AME - Escuela de Ciencia de Datos

Septiembre 2019

Historias con Impacto a Través de la

Ciencia de Datos

Vianey Leos(-)Barajas

@vianey_lb

Depts of Forestry & Envir. Resources and Statistics

North Carolina State University

1 / 37

What is data science?

Maybe a combination of Statistics, Computer Science and Mathematics?

2 / 37

What is data science?

Maybe a combination of Statistics, Computer Science and Mathematics?

What about machine learning and big data?

2 / 37

What is data science?

Maybe a combination of Statistics, Computer Science and Mathematics?

What about machine learning and big data?

In this talk we'll ignore any pre-conceived notions of what data science is and focus on:

2 / 37

What is data science?

Maybe a combination of Statistics, Computer Science and Mathematics?

What about machine learning and big data?

In this talk we'll ignore any pre-conceived notions of what data science is and focus on:

<< how to tell stories with data >>

2 / 37

Historias con Impacto3 / 37

Historias con Impacto

The biggest impact we can have is when we make discoveries about the world we live in.

3 / 37

Historias con Impacto

The biggest impact we can have is when we make discoveries about the world we live in.

3 / 37

Historias con Impacto

Something to always keep in mind: the stories we tell come from our perspectives, shaped by our experiences and driven by our personal choices. Being open and transparent about this is part of telling an impactful story.

4 / 37

Historias con Impacto

Reproducible research:

Can anyone reproduce your workflow?
Do the results reproduce?
RMarkdown

Data availability:

Making the data available

4 / 37

Historias con Impacto

Reproducible research:

Can anyone reproduce your workflow?
Do the results reproduce?
RMarkdown

Data availability:

Making the data available

Research that's accessible to everyone:

Why did the story turn out the way it did?
What was tried?
What failed?
Allowing insights into your research process.

4 / 37

Story Time5 / 37

What Story Do We Want Tell?

This is usually the easier part -- there are so many cool questions we can try to answer with data!

Some examples:

How do white sharks react to tourism boats?

How can we personalize (clothing/videos/products)? Check out StitchFix's algorithms team.

6 / 37

How do we tell our story with data?

One of the hardest parts and one of the most important steps. Domain expertise comes in! [*]

Quantifying the behavior we want to observe. Evaluating and re-evaluating.
Is it feasible to collect that information?
What information can we not collect?
Re-examining our original question and re-formulating it with what's possible.

7 / 37

How do we tell our story with data?

One of the hardest parts and one of the most important steps. Domain expertise comes in! [*]

Quantifying the behavior we want to observe. Evaluating and re-evaluating.
Is it feasible to collect that information?
What information can we not collect?
Re-examining our original question and re-formulating it with what's possible.

Data are objective $\to$ FALSE

We make decisions about what data to use to answer questions, decisions about how to collect the data, etc. Let's fully embrace subjectivity in every step of the process.

7 / 37

How do we tell our story with data?

One of the hardest parts and one of the most important steps. Domain expertise comes in! [*]

Quantifying the behavior we want to observe. Evaluating and re-evaluating.
Is it feasible to collect that information?
What information can we not collect?
Re-examining our original question and re-formulating it with what's possible.

Data are objective $\to$ FALSE

We make decisions about what data to use to answer questions, decisions about how to collect the data, etc. Let's fully embrace subjectivity in every step of the process.

[*]Domain expertise is a must. Not using it can lead to extremely irresponsible practices.

7 / 37

Quantifying the behavior we want to observe

Real World (of Mexican sharks)

Full Video: Pelagios Kakunjá YouTube Channel

Source: Pelagios Kakunjá

8 / 37

Quantifying the behavior we want to observe

Real World (of Mexican sharks)

Full Video: Pelagios Kakunjá YouTube Channel

Source: Pelagios Kakunjá

Quantified World

May be able to detect the shark's presence (if tagged)
Can not record the presence of untagged animals
Surface currents may be available
Temperature
A lot of information can not be recorded

8 / 37

Quantifying the behavior we want to observe.

$\to$ This sets in motion the type of analysis that will be used.

For instance, let's look at this question:

How do white sharks react to tourism boats?

9 / 37

Quantifying the behavior we want to observe.

$\to$ This sets in motion the type of analysis that will be used.

For instance, let's look at this question:

How do white sharks react to tourism boats?

What do we mean by react? What type of behavior are we looking for?

9 / 37

Quantifying the behavior we want to observe.

$\to$ This sets in motion the type of analysis that will be used.

For instance, let's look at this question:

How do white sharks react to tourism boats?

What do we mean by react? What type of behavior are we looking for?
We need to quantify their movements. Whole body movement, longitude/latitude?

9 / 37

Quantifying the behavior we want to observe.

$\to$ This sets in motion the type of analysis that will be used.

For instance, let's look at this question:

How do white sharks react to tourism boats?

What do we mean by react? What type of behavior are we looking for?
We need to quantify their movements. Whole body movement, longitude/latitude?
What was done: data (location) is collected approximately every 5-minutes.

9 / 37

Quantifying the behavior we want to observe.

$\to$ This sets in motion the type of analysis that will be used.

For instance, let's look at this question:

How do white sharks react to tourism boats?

What do we mean by react? What type of behavior are we looking for?
We need to quantify their movements. Whole body movement, longitude/latitude?
What was done: data (location) is collected approximately every 5-minutes.

When we talk about 'white shark' behaviors for this project, we mean "movement patterns observed at a 5-minute temporal scale" based on positional data.

9 / 37

Quantifying the behavior we want to observe.

$\to$ This sets in motion the type of analysis that will be used.

For instance, let's look at this question:

How do white sharks react to tourism boats?

What do we mean by react? What type of behavior are we looking for?
We need to quantify their movements. Whole body movement, longitude/latitude?
What was done: data (location) is collected approximately every 5-minutes.

When we talk about 'white shark' behaviors for this project, we mean "movement patterns observed at a 5-minute temporal scale" based on positional data.

Even the phrase "movement patterns" can be and is quite subjective.

9 / 37

White Shark Movements

Data collected by Alison Towner.

10 / 37

Bull Shark Movements

Work with Pelagios Kakunjá.

For more shark work: check out MigraMar and Alex Hearn.

11 / 37

Data Organization and Manipulation

Organization

Organizing and structuring your data set is one of the most important aspects in any project.

Manipulation

Writing scripts to go from the raw data set to the formatted data set for analysis. In R, take a look at the tidyverse.

12 / 37

What Next?

Once we've quantified the question we'd like to answer and collected data, exploratory data analysis can be done.

Lots and lots of plots...that tell our story.

Depending on what our story is so far, we'll attempt to match our data story with an appropriate model.

Keep in mind, what plots we make and what models we choose is also be subjective. We'll embrace it.

13 / 37

What models are seen as...

14 / 37

How we should think of models...

15 / 37

Our old friend: Simple Linear Regression

Basic setup:

$y = β_{0} + β_{1} x + ϵ$

$ϵ \sim N (0, σ^{2})$

16 / 37

Our old friend: Simple Linear Regression

Basic setup:

$y = β_{0} + β_{1} x + ϵ$

$ϵ \sim N (0, σ^{2})$

Altogether, we have

$f (y | x) \sim N (β_{0} + β_{1} x, σ^{2})$

16 / 37

Our old friend: Simple Linear Regression

Basic setup:

$y = β_{0} + β_{1} x + ϵ$

$ϵ \sim N (0, σ^{2})$

Altogether, we have

$f (y | x) \sim N (β_{0} + β_{1} x, σ^{2})$

Intrepretation of the parameters $β_{0}, β_{1}, σ^{2}$ :

$β_{0}$ = the expected value of y when x=0
$β_{1}$ = the expected change in y when you increase the value of x by 1
$σ^{2}$ = a measure of the variability around the mean ( $β_{0} + β_{1} x$ )

16 / 37

The story of SLR

What does it actually mean?

The means under different values of $x$ are connected (they're friends on a line.)

17 / 37

SLR is an abstract concept

Like all statistical models. One of the best ways to learn about a model is to simulate from it.

Deterministic (fixed, unknowns):

$β_{0}, β_{1}$
$σ^{2}$

Stochastic (drawn from a distribution):

$ϵ \sim N (0, σ^{2})$

#R Code
set.seed(17)
beta0 <- 1
beta1 <- 2
std.dev <- 5
x <- rt(n = 150, df = 3)
y <- beta0 + beta1*x + rnorm(n=150, mean = 0, sd=std.dev)

18 / 37

Simulating Data from a SLR

#R Code
set.seed(17)
beta0 <- 1
beta1 <- 2
std.dev <- 5
x <- rt(n = 150, df = 3)
  y <- beta0 + beta1*x + rnorm(n=150, mean = 0, sd=std.dev)

19 / 37

Simulating data from a SLR when the sample size $N = 50, 200, 500$

20 / 37

Simulating data from a SLR when the sample size of $N = 200$ and parameter values of $β_{0} = 1, β_{1} = - 2, σ = 5$

21 / 37

Generalized Linear Model: Poisson Log-Linear Model

Here, we imply that:

$y | x_{i} \sim P o i s s o n (λ_{i})$

$l o g (λ_{i}) = β_{0} + β_{1} x_{i}$

A special property of the Poisson distribution is that $E (y | x_{i}) = V a r (y | x_{i}) = λ_{i}$ .

Poisson distribution with $λ = 5$ :

22 / 37

The story of the Poisson Log-Linear Model

For every value of $x$ , we expect to see observations $y$ that are generated according to the Poisson distribution with $λ_{x} = e^{β_{0} + β_{1} x}$ .

23 / 37

Simulating from a Poisson Log-Linear Model

pbeta0 <- 0.1
pbeta1 <- 0.2
px <- runif(n=100, min=-5, max=5)
xlambda <- exp(pbeta0 + pbeta1*px)
py <- rpois(n=100, lambda=xlambda)

24 / 37

Hierarchical Model

One example

$y_{i j} \sim P o i s s o n (λ_{i j})$

$l o g (λ_{i j}) = β_{0, i} + β_{1, i} x_{i j}$

$β_{0, i} \overset{i i d}{\sim} N (μ_{0}, σ_{0}^{2})$

$β_{1, i} \overset{i i d}{\sim} N (μ_{1}, σ_{1}^{2})$

Story

Similar to the Poisson Log-Linear Model from before, but now we assume that some parameters vary across groups. However(!), the parameters $β_{0, i}$ and $β_{1, i}$ are linked across groups!

25 / 37

Simulating from a Hierarchical Model

## 10 individuals
pb0h <- rnorm(n = 10, mean = 0.1, sd = 0.1)
pb1h <- rnorm(n=10, mean=0.2, sd=0.1)
pxh <- matrix(data=NA, nrow=10, ncol=20) 
pyh <- matrix(data=NA, nrow=10, ncol=20)
for(j in 1:10){
  pxh[j,] <- runif(n=20, min=-5, max=5)
  pyh[j,] <- rpois(n=20, lambda=exp(pb0h[j] + pb1h[j]*pxh[j,]))
}

26 / 37

Simulation -- The Underused Tool

Advantages of simulation:

No data needed.
We can learn about what are models actually do, e.g. how different parameter values affect the outcome.
We build intuition about what different models imply.

27 / 37

Simulation -- The Underused Tool

Advantages of simulation:

No data needed.
We can learn about what are models actually do, e.g. how different parameter values affect the outcome.
We build intuition about what different models imply.

If after simulation it's hard to understand what these parameters mean in the overall context...

27 / 37

Simulation -- The Underused Tool

Advantages of simulation:

No data needed.
We can learn about what are models actually do, e.g. how different parameter values affect the outcome.
We build intuition about what different models imply.

If after simulation it's hard to understand what these parameters mean in the overall context...

we can always go back and do more simulations!

27 / 37

Simulation -- The Underused Tool

Advantages of simulation:

No data needed.
We can learn about what are models actually do, e.g. how different parameter values affect the outcome.
We build intuition about what different models imply.

If after simulation it's hard to understand what these parameters mean in the overall context...

we can always go back and do more simulations!

It'll be hard to interpret the parameters if we don't have a good idea of what they actually mean for our process.

27 / 37

Choosing candidate modelsRemember: every model tells a story.
28 / 37

Choosing candidate modelsRemember: every model tells a story.
How does the models' story tie into the story we want to tell? 
28 / 37

Choosing candidate modelsRemember: every model tells a story.
How does the models' story tie into the story we want to tell? 
When we simulate data from various models, does it look like the data we have?
28 / 37

Fitting models to data

It's one thing to think about the wonderful stories that models to tell. And another thing to face the reality of trying to fit these models to data.

29 / 37

Fitting models to data

It's one thing to think about the wonderful stories that models to tell. And another thing to face the reality of trying to fit these models to data.

Some models are well-behaved with small data sets...

29 / 37

Fitting models to data

It's one thing to think about the wonderful stories that models to tell. And another thing to face the reality of trying to fit these models to data.

Some models are well-behaved with small data sets...

Some models are data hungry...

29 / 37

Back to Simulation

Even if your story matches perfectly with a model's story...a fairytale ending isn't certain.

30 / 37

Back to Simulation

Even if your story matches perfectly with a model's story...a fairytale ending isn't certain.

Models need a certain amount of information before they reveal secrets about the data...

(in proper statistical terms: estimability/identifiability)

30 / 37

Back to Simulation

Even if your story matches perfectly with a model's story...a fairytale ending isn't certain.

Models need a certain amount of information before they reveal secrets about the data...

(in proper statistical terms: estimability/identifiability)

Good practice:

Put together information about your data:

What's the sample size? How many replicates are there? Number of individuals?

30 / 37

Back to Simulation

Even if your story matches perfectly with a model's story...a fairytale ending isn't certain.

Models need a certain amount of information before they reveal secrets about the data...

(in proper statistical terms: estimability/identifiability)

Good practice:

Put together information about your data:

What's the sample size? How many replicates are there? Number of individuals?

Simulate data from the model that matches your data specifics

30 / 37

Back to Simulation

Even if your story matches perfectly with a model's story...a fairytale ending isn't certain.

Models need a certain amount of information before they reveal secrets about the data...

(in proper statistical terms: estimability/identifiability)

Good practice:

Put together information about your data:

What's the sample size? How many replicates are there? Number of individuals?

Simulate data from the model that matches your data specifics
Fit the model to the simulated data. What do you see? How much uncertainty is there in the parameter estimates?

30 / 37

Fitting a Model

Once we fit a model to our data, but not before constructing uncertainty intervals, we begin to examine the patterns that are revealed.

31 / 37

Fitting a Model

Once we fit a model to our data, but not before constructing uncertainty intervals, we begin to examine the patterns that are revealed.

Can we interpret the parameters?

31 / 37

Fitting a Model

Once we fit a model to our data, but not before constructing uncertainty intervals, we begin to examine the patterns that are revealed.

Can we interpret the parameters?
Are the results inline with what was expected?

31 / 37

Fitting a Model

Once we fit a model to our data, but not before constructing uncertainty intervals, we begin to examine the patterns that are revealed.

Can we interpret the parameters?
Are the results inline with what was expected?
I'm sidestepping the idea of 'significance' on purpose. In favor of...

31 / 37

Fitting a Model

Once we fit a model to our data, but not before constructing uncertainty intervals, we begin to examine the patterns that are revealed.

Can we interpret the parameters?
Are the results inline with what was expected?
I'm sidestepping the idea of 'significance' on purpose. In favor of...
How do the values of my parameters jointly tell the story?

31 / 37

Fitting a Model

Once we fit a model to our data, but not before constructing uncertainty intervals, we begin to examine the patterns that are revealed.

Can we interpret the parameters?
Are the results inline with what was expected?
I'm sidestepping the idea of 'significance' on purpose. In favor of...
How do the values of my parameters jointly tell the story?

For Bayesian inference, check out Stan.

31 / 37

Assessing/Interpreting a Model

Mostly done in the context of 'posterior predictive checks' when conducting Bayesian inference, this idea falls into the class of simulation-based model assessment.

Simulate data from the fitted model: $y_{r e p}$

Draw values of the parameters $θ^{i}$ from their distributions
Plug in $θ^{i}$ into the model to simulate values of $y_{r e p}^{i}$
Repeat above steps multiple times (say, 100 times, $i = 100$ )
Compare $y_{r e p}$ to $y$

The key idea is: if our model is the data generating mechanism, could it generate data like ours?

Of course, there are still residuals and other traditional measures of model evaluation.

32 / 37

Interpeting our Model Results

We need to make sure we interpret the results in the context of:

33 / 37

Interpeting our Model Results

We need to make sure we interpret the results in the context of:

How we quantified the story we want to tell.

33 / 37

Interpeting our Model Results

We need to make sure we interpret the results in the context of:

How we quantified the story we want to tell.
What information was collected.

33 / 37

Interpeting our Model Results

We need to make sure we interpret the results in the context of:

How we quantified the story we want to tell.
What information was collected.
What information was NOT collected

33 / 37

Interpeting our Model Results

We need to make sure we interpret the results in the context of:

How we quantified the story we want to tell.
What information was collected.
What information was NOT collected
What kinds of data can our model produce? And what does that imply about the real world?

33 / 37

Interpeting our Model Results

We need to make sure we interpret the results in the context of:

How we quantified the story we want to tell.
What information was collected.
What information was NOT collected
What kinds of data can our model produce? And what does that imply about the real world?
What structure did we not capture? What data not collected could affect the outcomes?

33 / 37

Historias con Impacto34 / 37

Historias con Impacto

The biggest impact we can have is when we make discoveries about the world we live in.

34 / 37

Historias con Impacto

35 / 37

Historias con Impacto

Reproducible research:

Can anyone reproduce your workflow?
Do the results reproduce?
RMarkdown

Data availability:

Making the data available

35 / 37

Historias con Impacto

Reproducible research:

Can anyone reproduce your workflow?
Do the results reproduce?
RMarkdown

Data availability:

Making the data available

Research that's accessible to everyone:

Why did the story turn out the way it did?
What was tried?
What failed?
Allowing insights into your research process.

35 / 37

Leyendas vs Historias

Las leyendas son entretenidas y parte de la cultura. Ejemplos: la llorona, el cucuy, la mano peluda, el callejón del beso. Pero no hay manera de verificar lo que actualmente occurrió.

36 / 37

Leyendas vs Historias

Las leyendas son entretenidas y parte de la cultura. Ejemplos: la llorona, el cucuy, la mano peluda, el callejón del beso. Pero no hay manera de verificar lo que actualmente occurrió.

Leyendas en la ciencia:

Declaraciones grandes
Datos inaccessible
Codigo inaccessible
Decisiones subjetivas no declaradas como tal

36 / 37

Leyendas vs Historias

Las leyendas son entretenidas y parte de la cultura. Ejemplos: la llorona, el cucuy, la mano peluda, el callejón del beso. Pero no hay manera de verificar lo que actualmente occurrió.

Leyendas en la ciencia:

Declaraciones grandes
Datos inaccessible
Codigo inaccessible
Decisiones subjetivas no declaradas como tal

Historias en la ciencia:

Declaraciones en base de la manera que se cuantificó el problema
Datos accessibles
Codigo accessible y funciona
Comentan sobre la perspectiva y subjetividad del análisis

36 / 37

Leyendas vs Historias

Las leyendas son entretenidas y parte de la cultura. Ejemplos: la llorona, el cucuy, la mano peluda, el callejón del beso. Pero no hay manera de verificar lo que actualmente occurrió.

Leyendas en la ciencia:

Declaraciones grandes
Datos inaccessible
Codigo inaccessible
Decisiones subjetivas no declaradas como tal

Historias en la ciencia:

Declaraciones en base de la manera que se cuantificó el problema
Datos accessibles
Codigo accessible y funciona
Comentan sobre la perspectiva y subjetividad del análisis

Sigamos adelante con menos leyendas y más historias.

36 / 37

¡Gracias!

Slides created via the R package xaringan.

37 / 37

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help