class: title-slide-custom, top, center background-image: url("galapagos_playa2.jpeg") AME - Escuela de Ciencia de Datos Septiembre 2019 # **Historias con Impacto a Través de la** # **Ciencia de Datos** ### [Vianey Leos(-)Barajas](https://vleosbarajas.com) <img src="Twitter_Logo_Blue.png" width = "40" height="30"/> [@vianey_lb](https://twitter.com/vianey_lb) Depts of Forestry & Envir. Resources and Statistics North Carolina State University --- class: top, center, middle # What is data science? Maybe a combination of Statistics, Computer Science and Mathematics? -- What about machine learning and big data? -- In this talk we'll ignore any pre-conceived notions of what data science is and focus on: -- ## ** << how to tell stories with data >> ** --- class: top #Historias con Impacto -- .center[ **The biggest impact we can have is when we make `discoveries` about the world we live in.** ] -- <img src="realworld_abstract_chart.png"> --- # Historias con Impacto Something to always keep in mind: `the stories we tell come from our perspectives, shaped by our experiences and driven by our personal choices.` Being open and transparent about this is part of telling an impactful story. -- .pull-left[ `Reproducible research:` - Can anyone reproduce your workflow? - Do the results reproduce? - [RMarkdown](https://rmarkdown.rstudio.com/) `Data availability:` - Making the data available ] -- .pull-right[ `Research that's accessible to everyone:` - Why did the story turn out the way it did? - What was tried? - What failed? - Allowing insights into your research process. ] --- class: inverse, center, middle # Story Time --- class: top, middle # What Story Do We Want Tell? This is usually the easier part -- there are so many cool questions we can try to answer with data! _Some examples_: - ** How do white sharks react to tourism boats? ** .center[ <img src="https://media.giphy.com/media/aVcLhnUF0tAB2/giphy.gif" width="200" height="100" /> ] - ** How can we personalize (clothing/videos/products)? ** Check out [StitchFix's algorithms team](https://www.wired.com/story/stitch-fix-shop-your-looks/?utm_brand=wired&mbid=social_fb&utm_medium=social&utm_social-type=owned&utm_campaign=wired&utm_source=facebook&fbclid=IwAR0-zvkGPDNB4ICC9Qkc5t2MJU77V16OK4O2KozlxV5dvVRMO2Z085n9aDQ). .center[ <img src="https://media.giphy.com/media/Ig59V6d8nU4Fi/giphy.gif" width="200" height="150" /> ] --- class: top, left # How do we tell our story with data? One of the hardest parts and one of the most important steps. **Domain expertise comes in!** [*] 1. Quantifying the behavior we want to observe. Evaluating and re-evaluating. 2. Is it feasible to collect that information? 3. What information can we not collect? 4. Re-examining our original question and re-formulating it with what's possible. _______________________________________________ -- _Data are objective_ `\(\rightarrow\)` **FALSE** We make decisions about what data to use to answer questions, decisions about how to collect the data, etc. Let's fully embrace subjectivity in every step of the process. -- .footnote[[*]Domain expertise is a must. Not using it can lead to extremely irresponsible practices.] --- class: top, left .center[ #Quantifying the behavior we want to observe ] .pull-left[ .center[ **Real World** (of Mexican sharks) ] <img src="https://media.giphy.com/media/mGPcTYEJFDwF0Ok5tl/giphy.gif" /> Full Video: [Pelagios Kakunjá YouTube Channel](https://www.youtube.com/watch?v=kb9CMhB4vuY) Source: [Pelagios Kakunjá](http://pelagioskakunja.org/) ] -- .pull-right[ .center[ **Quantified World** ] - May be able to detect the shark's presence (if tagged) - Can not record the presence of untagged animals - Surface currents may be available - Temperature - A lot of information can not be recorded ] --- class: top # Quantifying the behavior we want to observe. .center[ `\(\rightarrow\)` `This sets in motion the type of analysis that will be used.` ] For instance, let's look at this question: .center[ ** How do white sharks react to tourism boats? ** ] -- - What do we mean by _react_? `What type of behavior are we looking for?` -- - We need to quantify their movements. `Whole body movement, longitude/latitude`? -- - What was done: data (location) is collected approximately every 5-minutes. -- When we talk about 'white shark' behaviors for this project, we mean "movement patterns observed at a 5-minute temporal scale" based on positional data. -- Even the phrase "movement patterns" can be and is quite subjective. --- class: top, inverse # White Shark Movements .center[ <img src="sarika_indtracks.gif" /> ] Data collected by Alison Towner. --- class: top, inverse # Bull Shark Movements Work with [Pelagios Kakunjá](http://pelagioskakunja.org/). <!-- --> For more shark work: check out [MigraMar](http://migramar.org/hi/en/) and [Alex Hearn](http://migramar.org/hi/en/assembly-members/hearn/). --- class: top, left # Data Organization and Manipulation **Organization** Organizing and structuring your data set is one of the most important aspects in any project. .center[ <img src="data_org.JPG"/> ] **Manipulation** Writing scripts to go from the _raw_ data set to the formatted data set for analysis. In `R`, take a look at the [tidyverse](https://www.tidyverse.org/). --- class: middle, center # What Next? Once we've quantified the question we'd like to answer and collected data, exploratory data analysis can be done. .center[ **Lots and lots of plots...that tell our story.** ] Depending on what our story is so far, we'll attempt to match our data story with an appropriate model. .footnote[Keep in mind, what plots we make and what models we choose is also be subjective. We'll embrace it.] --- background-position: 100% class: top, center, inverse # What models are seen as...  --- background-position: 100% class: top, center inverse # How we should think of models...  --- #Our old friend: Simple Linear Regression Basic setup: $$ y = \beta_0 + \beta_1x + \epsilon $$ $$ \epsilon \sim N(0, \sigma^2) $$ -- Altogether, we have $$ f(y|x) \sim N(\beta_0 + \beta_1x, \sigma^2) $$ -- Intrepretation of the parameters `\(\beta_0, \beta_1, \sigma^2\)`: - `\(\beta_0\)` = the expected value of y when x=0 - `\(\beta_1\)` = the expected change in y when you increase the value of x by 1 - `\(\sigma^2\)` = a measure of the variability around the mean ( `\(\beta_0 + \beta_1x\)` ) --- background-position: 100% class: top # The story of SLR **What does it actually mean?** .center[ <image src = "normals.png" width="500" height="400" /> ] The means under different values of `\(x\)` are connected (they're friends on a line.) --- # SLR is an abstract concept Like all statistical models. One of the best ways to learn about a model is to simulate from it. .pull-left[ ### Deterministic (fixed, unknowns): - `\(\beta_0, \beta_1\)` - `\(\sigma^2\)` ] .pull-right[ ### Stochastic (drawn from a distribution): - `\(\epsilon \sim N(0, \sigma^2)\)` ] ```r #R Code set.seed(17) beta0 <- 1 beta1 <- 2 std.dev <- 5 x <- rt(n = 150, df = 3) y <- beta0 + beta1*x + rnorm(n=150, mean = 0, sd=std.dev) ``` --- class: middle **Simulating Data from a SLR** ```r #R Code set.seed(17) beta0 <- 1 beta1 <- 2 std.dev <- 5 x <- rt(n = 150, df = 3) * y <- beta0 + beta1*x + rnorm(n=150, mean = 0, sd=std.dev) ``` <!-- --> --- Simulating data from a SLR when the sample size `\(N = 50, 200, 500\)` <!-- --><!-- --><!-- --> --- Simulating data from a SLR when the sample size of `\(N = 200\)` and parameter values of `\(\beta_0 = 1, \beta_1 = -2, \sigma=5\)` <!-- --><!-- --><!-- --> --- # Generalized Linear Model: Poisson Log-Linear Model Here, we imply that: $$ y|x_i \sim Poisson (\lambda_i) $$ $$ log(\lambda_i) = \beta_0 + \beta_1 x_i $$ A special property of the Poisson distribution is that `\(E(y|x_i) = Var(y|x_i) = \lambda_i\)`. **Poisson distribution with `\(\lambda = 5\)`:** <!-- --> --- # The story of the Poisson Log-Linear Model For every value of `\(x\)`, we expect to see observations `\(y\)` that are generated according to the Poisson distribution with `\(\lambda_x = e^{\beta_0 + \beta_1 x}\)`. <!-- --> --- # Simulating from a Poisson Log-Linear Model ```r pbeta0 <- 0.1 pbeta1 <- 0.2 px <- runif(n=100, min=-5, max=5) xlambda <- exp(pbeta0 + pbeta1*px) py <- rpois(n=100, lambda=xlambda) ``` <!-- --> --- # Hierarchical Model **One example** `$$y_{ij} \sim Poisson(\lambda_{ij})$$` `$$log(\lambda_{ij}) = \beta_{0,i} + \beta_{1,i} x_{ij}$$` `$$\beta_{0,i} \overset{iid}{\sim} N(\mu_0, \sigma_0^2)$$` `$$\beta_{1,i} \overset{iid}{\sim} N(\mu_1, \sigma_1^2)$$` ### Story Similar to the Poisson Log-Linear Model from before, but now we assume that some parameters vary across groups. However(!), the parameters `\(\beta_{0,i}\)` and `\(\beta_{1,i}\)` are linked across groups! --- # Simulating from a Hierarchical Model ```r ## 10 individuals pb0h <- rnorm(n = 10, mean = 0.1, sd = 0.1) pb1h <- rnorm(n=10, mean=0.2, sd=0.1) pxh <- matrix(data=NA, nrow=10, ncol=20) pyh <- matrix(data=NA, nrow=10, ncol=20) for(j in 1:10){ pxh[j,] <- runif(n=20, min=-5, max=5) pyh[j,] <- rpois(n=20, lambda=exp(pb0h[j] + pb1h[j]*pxh[j,])) } ``` <!-- --> --- # Simulation -- The Underused Tool Advantages of simulation: - No data needed. - We can learn about what are models actually do, e.g. how different parameter values affect the outcome. - We build intuition about what different models imply. -- If after simulation it's hard to understand what these parameters mean in the overall context... -- .center[ **we can always go back and do more simulations!** ] -- *It'll be hard to interpret the parameters if we don't have a good idea of what they actually mean for our process.* --- class: inverse, top # Choosing candidate models .center[ ## **Remember: every model tells a story.** ] -- .center[ ## **How does the models' story tie into the story we want to tell? ** ] -- .center[ ## **When we simulate data from various models, does it look like the data we have?** ] --- class: center, inverse # Fitting models to data It's one thing to think about the wonderful stories that models to tell. And another thing to face the reality of trying to fit these models to data. -- **Some models are well-behaved with small data sets...** <img src=https://media.giphy.com/media/3ndAvMC5LFPNMCzq7m/giphy.gif width="170" height="150"/> -- **Some models are data hungry...** <img src = https://media.giphy.com/media/xT0xeMA62E1XIlup68/giphy.gif width="200" height="150" /> --- class: top # Back to Simulation Even if your story matches perfectly with a model's story...a fairytale ending isn't certain. -- .center[ Models need a certain amount of information before they reveal secrets about the data... ] .center[ _(in proper statistical terms: `estimability/identifiability`)_ ] -- ______________________________ **Good practice:** - Put together information about your data: .center[ *What's the sample size?* *How many replicates are there?* *Number of individuals?* ] -- - Simulate data from the model that matches your data specifics -- - Fit the model to the simulated data. What do you see? How much uncertainty is there in the parameter estimates? --- class: top #Fitting a Model Once we fit a model to our data, but not before constructing uncertainty intervals, we begin to examine the patterns that are revealed. -- - Can we interpret the parameters? -- - Are the results inline with what was expected? -- - _I'm sidestepping the idea of 'significance' on purpose._ In favor of... -- - How do the values of my parameters jointly tell the story? -- .footnote[ For Bayesian inference, check out [Stan](https://mc-stan.org/). ] --- #Assessing/Interpreting a Model Mostly done in the context of 'posterior predictive checks' when conducting Bayesian inference, this idea falls into the class of simulation-based model assessment. `Simulate data from the fitted model`: `\(\mathbf{y}_{rep}\)` - Draw values of the parameters `\(\boldsymbol{\theta}^i\)` from their distributions - Plug in `\(\boldsymbol{\theta}^i\)` into the model to simulate values of `\(\mathbf{y}_{rep}^i\)` - Repeat above steps multiple times (say, 100 times, `\(i=100\)`) - Compare `\(\mathbf{y}_{rep}\)` to `\(\mathbf{y}\)` **The key idea is: if our model is the _data generating mechanism_, could it generate data like ours?** Of course, there are still residuals and other traditional measures of model evaluation. --- #Interpeting our Model Results We need to make sure we interpret the results in the context of: -- - How we quantified the story we want to tell. -- - What information was collected. -- - `What information was NOT collected` -- - What kinds of data can our model produce? And what does that imply about the real world? -- - What structure did we not capture? What data not collected could affect the outcomes? --- class: top #Historias con Impacto -- .center[ **The biggest impact we can have is when we make `discoveries` about the world we live in.** ] <img src="realworld_abstract_chart.png"> --- # Historias con Impacto Something to always keep in mind: `the stories we tell come from our perspectives, shaped by our experiences and driven by our personal choices.` Being open and transparent about this is part of telling an impactful story. -- .pull-left[ `Reproducible research:` - Can anyone reproduce your workflow? - Do the results reproduce? - [RMarkdown](https://rmarkdown.rstudio.com/) `Data availability:` - Making the data available ] -- .pull-right[ `Research that's accessible to everyone:` - Why did the story turn out the way it did? - What was tried? - What failed? - Allowing insights into your research process. ] --- class: top #Leyendas vs Historias Las leyendas son entretenidas y parte de la cultura. Ejemplos: la llorona, el cucuy, la mano peluda, el callejón del beso. Pero no hay manera de verificar lo que actualmente occurrió. -- .pull-left[ `Leyendas en la ciencia`: - Declaraciones grandes - Datos inaccessible - Codigo inaccessible - Decisiones subjetivas no declaradas como tal ] -- .pull-right[ `Historias en la ciencia`: - Declaraciones en base de la manera que se cuantificó el problema - Datos accessibles - Codigo accessible y funciona - Comentan sobre la perspectiva y subjetividad del análisis ] -- .center[ **Sigamos adelante con menos leyendas y más historias.** ] --- class: center, middle # ¡Gracias! Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).