class: center, middle, inverse, title-slide # Intro to Bayes ### Daniel Anderson ### Week 7 --- layout: true <script> feather.replace() </script> <div class="slides-footer"> <span> <a class = "footer-icon-link" href = "https://github.com/datalorax/mlm2/raw/main/static/slides/w7p1.pdf"> <i class = "footer-icon" data-feather="download"></i> </a> <a class = "footer-icon-link" href = "https://mlm2.netlify.app/slides/w7p1.html"> <i class = "footer-icon" data-feather="link"></i> </a> <a class = "footer-icon-link" href = "https://mlm2-2021.netlify.app"> <i class = "footer-icon" data-feather="globe"></i> </a> <a class = "footer-icon-link" href = "https://github.com/datalorax/mlm2"> <i class = "footer-icon" data-feather="github"></i> </a> </span> </div> --- # Agenda .gray[[We're not going to get through it all]] * Review Homework 2 * Some equation practice * Introduce Bayes theorem + Go through an example with estimating a mean * Discuss Bayes in the context of regression modeling --- class: inverse-red middle # Homework 2 Review --- class: inverse-red middle # Equation practice --- # The data From an example in the Mplus manual. I made up the column names.
01
:
30
```r mplus_d <- read_csv(here::here("data", "mplus920.csv")) mplus_d ``` ``` ## # A tibble: 7,500 x 6 ## score baseline sch_treatment dist_ses schid distid ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 5.559216 1.383101 1 -0.642262 1 1 ## 2 -0.107394 -0.789654 1 -0.642262 1 1 ## 3 0.049476 -0.760867 1 -0.642262 1 1 ## 4 -2.387703 -0.798527 1 -0.642262 1 1 ## 5 1.180393 -0.411377 1 -0.642262 1 1 ## 6 3.959005 -0.987154 1 -0.642262 2 1 ## 7 -0.895792 -1.966773 1 -0.642262 2 1 ## 8 2.879087 0.42117 1 -0.642262 2 1 ## 9 5.611088 1.67047 1 -0.642262 2 1 ## 10 2.828119 0.001154 1 -0.642262 2 1 ## # … with 7,490 more rows ``` --- # Model 1 ### Fit the following model
01
:
30
$$ `\begin{aligned} \operatorname{score}_{i} &\sim N \left(\alpha_{j[i]}, \sigma^2 \right) \\ \alpha_{j} &\sim N \left(\mu_{\alpha_{j}}, \sigma^2_{\alpha_{j}} \right) \text{, for distid j = 1,} \dots \text{,J} \end{aligned}` $$ -- ```r lmer(score ~ 1 + (1|distid), data = mplus_d) ``` --- # Model 2 ### Fit the following model
01
:
00
$$ `\begin{aligned} \operatorname{score}_{i} &\sim N \left(\alpha_{j[i],k[i]} + \beta_{1}(\operatorname{baseline}), \sigma^2 \right) \\ \alpha_{j} &\sim N \left(\mu_{\alpha_{j}}, \sigma^2_{\alpha_{j}} \right) \text{, for schid j = 1,} \dots \text{,J} \\ \alpha_{k} &\sim N \left(\mu_{\alpha_{k}}, \sigma^2_{\alpha_{k}} \right) \text{, for distid k = 1,} \dots \text{,K} \end{aligned}` $$ -- ```r lmer(score ~ baseline + (1|schid) + (1|distid), data = mplus_d) ``` --- # Model 3 ### Fit the following model
01
:
30
$$ `\begin{aligned} \operatorname{score}_{i} &\sim N \left(\alpha_{j[i],k[i]} + \beta_{1j[i]}(\operatorname{baseline}), \sigma^2 \right) \\ \left( \begin{array}{c} \begin{aligned} &\alpha_{j} \\ &\beta_{1j} \end{aligned} \end{array} \right) &\sim N \left( \left( \begin{array}{c} \begin{aligned} &\mu_{\alpha_{j}} \\ &\mu_{\beta_{1j}} \end{aligned} \end{array} \right) , \left( \begin{array}{cc} \sigma^2_{\alpha_{j}} & \rho_{\alpha_{j}\beta_{1j}} \\ \rho_{\beta_{1j}\alpha_{j}} & \sigma^2_{\beta_{1j}} \end{array} \right) \right) \text{, for schid j = 1,} \dots \text{,J} \\ \alpha_{k} &\sim N \left(\gamma_{0}^{\alpha} + \gamma_{1}^{\alpha}(\operatorname{dist\_ses}) + \gamma_{2}^{\alpha}(\operatorname{baseline} \times \operatorname{dist\_ses}), \sigma^2_{\alpha_{k}} \right) \text{, for distid k = 1,} \dots \text{,K} \end{aligned}` $$ -- ```r lmer(score ~ baseline * dist_ses + (baseline|schid) + (1|distid), data = mplus_d) ``` --- # Model 4 ### Fit the following model
01
:
30
$$ `\begin{aligned} \operatorname{score}_{i} &\sim N \left(\alpha_{j[i],k[i]} + \beta_{1j[i]}(\operatorname{baseline}), \sigma^2 \right) \\ \left( \begin{array}{c} \begin{aligned} &\alpha_{j} \\ &\beta_{1j} \end{aligned} \end{array} \right) &\sim N \left( \left( \begin{array}{c} \begin{aligned} &\gamma_{0}^{\alpha} + \gamma_{1}^{\alpha}(\operatorname{sch\_treatment}) \\ &\mu_{\beta_{1j}} \end{aligned} \end{array} \right) , \left( \begin{array}{cc} \sigma^2_{\alpha_{j}} & \rho_{\alpha_{j}\beta_{1j}} \\ \rho_{\beta_{1j}\alpha_{j}} & \sigma^2_{\beta_{1j}} \end{array} \right) \right) \text{, for schid j = 1,} \dots \text{,J} \\ \alpha_{k} &\sim N \left(\gamma_{0}^{\alpha} + \gamma_{1}^{\alpha}(\operatorname{dist\_ses}), \sigma^2_{\alpha_{k}} \right) \text{, for distid k = 1,} \dots \text{,K} \end{aligned}` $$ -- ```r lmer(score ~ baseline + sch_treatment + dist_ses + (baseline|schid) + (1|distid), data = mplus_d) ``` --- # Model 5 ### Fit the following model
01
:
30
$$ `\begin{aligned} \operatorname{score}_{i} &\sim N \left(\alpha_{j[i],k[i]} + \beta_{1j[i]}(\operatorname{baseline}), \sigma^2 \right) \\ \left( \begin{array}{c} \begin{aligned} &\alpha_{j} \\ &\beta_{1j} \end{aligned} \end{array} \right) &\sim N \left( \left( \begin{array}{c} \begin{aligned} &\gamma_{0}^{\alpha} + \gamma_{1k[i]}^{\alpha}(\operatorname{sch\_treatment}) \\ &\gamma^{\beta_{1}}_{0} + \gamma^{\beta_{1}}_{1}(\operatorname{sch\_treatment}) \end{aligned} \end{array} \right) , \left( \begin{array}{cc} \sigma^2_{\alpha_{j}} & \rho_{\alpha_{j}\beta_{1j}} \\ \rho_{\beta_{1j}\alpha_{j}} & \sigma^2_{\beta_{1j}} \end{array} \right) \right) \text{, for schid j = 1,} \dots \text{,J} \\ \left( \begin{array}{c} \begin{aligned} &\alpha_{k} \\ &\gamma_{1k} \end{aligned} \end{array} \right) &\sim N \left( \left( \begin{array}{c} \begin{aligned} &\gamma_{0}^{\alpha} + \gamma_{1}^{\alpha}(\operatorname{dist\_ses}) \\ &\mu_{\gamma_{1k}} \end{aligned} \end{array} \right) , \left( \begin{array}{cc} \sigma^2_{\alpha_{k}} & \rho_{\alpha_{k}\gamma_{1k}} \\ \rho_{\gamma_{1k}\alpha_{k}} & \sigma^2_{\gamma_{1k}} \end{array} \right) \right) \text{, for distid k = 1,} \dots \text{,K} \end{aligned}` $$ -- ```r lmer(score ~ baseline * sch_treatment + dist_ses + (baseline|schid) + (sch_treatment|distid), data = mplus_d) ``` --- # Model 6 ### Fit the following model
01
:
30
$$ \small `\begin{aligned} \operatorname{score}_{i} &\sim N \left(\alpha_{j[i],k[i]} + \beta_{1j[i],k[i]}(\operatorname{baseline}), \sigma^2 \right) \\ \left( \begin{array}{c} \begin{aligned} &\alpha_{j} \\ &\beta_{1j} \end{aligned} \end{array} \right) &\sim N \left( \left( \begin{array}{c} \begin{aligned} &\gamma_{0}^{\alpha} + \gamma_{1k[i]}^{\alpha}(\operatorname{sch\_treatment}) \\ &\gamma^{\beta_{1}}_{1k[i0} + \gamma^{\beta_{1}}_{1k[i]}(\operatorname{sch\_treatment}) \end{aligned} \end{array} \right) , \left( \begin{array}{cc} \sigma^2_{\alpha_{j}} & \rho_{\alpha_{j}\beta_{1j}} \\ \rho_{\beta_{1j}\alpha_{j}} & \sigma^2_{\beta_{1j}} \end{array} \right) \right) \text{, for schid j = 1,} \dots \text{,J} \\ \left( \begin{array}{c} \begin{aligned} &\alpha_{k} \\ &\beta_{1k} \\ &\gamma_{1k} \\ &\gamma^{\beta_{1}}_{1k} \end{aligned} \end{array} \right) &\sim N \left( \left( \begin{array}{c} \begin{aligned} &\gamma_{0}^{\alpha} + \gamma_{1}^{\alpha}(\operatorname{dist\_ses}) \\ &\mu_{\beta_{1k}} \\ &\mu_{\gamma_{1k}} \\ &\mu_{\gamma^{\beta_{1}}_{1k}} \end{aligned} \end{array} \right) , \left( \begin{array}{cccc} \sigma^2_{\alpha_{k}} & 0 & 0 & 0 \\ 0 & \sigma^2_{\beta_{1k}} & 0 & 0 \\ 0 & 0 & \sigma^2_{\gamma_{1k}} & 0 \\ 0 & 0 & 0 & \sigma^2_{\gamma^{\beta_{1}}_{1k}} \end{array} \right) \right) \text{, for distid k = 1,} \dots \text{,K} \end{aligned}` $$ -- ```r lmer(score ~ baseline * sch_treatment + dist_ses + (baseline|schid) + (baseline * sch_treatment||distid), data = mplus_d) ``` --- # Final one ### Fit the following model
01
:
30
$$ \scriptsize `\begin{aligned} \operatorname{score}_{i} &\sim N \left(\alpha_{j[i],k[i]} + \beta_{1j[i],k[i]}(\operatorname{baseline}), \sigma^2 \right) \\ \left( \begin{array}{c} \begin{aligned} &\alpha_{j} \\ &\beta_{1j} \end{aligned} \end{array} \right) &\sim N \left( \left( \begin{array}{c} \begin{aligned} &\gamma_{0}^{\alpha} + \gamma_{1}^{\alpha}(\operatorname{sch\_treatment}) \\ &\gamma^{\beta_{1}}_{0} + \gamma^{\beta_{1}}_{1}(\operatorname{sch\_treatment}) \end{aligned} \end{array} \right) , \left( \begin{array}{cc} \sigma^2_{\alpha_{j}} & \rho_{\alpha_{j}\beta_{1j}} \\ \rho_{\beta_{1j}\alpha_{j}} & \sigma^2_{\beta_{1j}} \end{array} \right) \right) \text{, for schid j = 1,} \dots \text{,J} \\ \left( \begin{array}{c} \begin{aligned} &\alpha_{k} \\ &\beta_{1k} \end{aligned} \end{array} \right) &\sim N \left( \left( \begin{array}{c} \begin{aligned} &\gamma_{0}^{\alpha} + \gamma_{1}^{\alpha}(\operatorname{dist\_ses}) + \gamma_{2}^{\alpha}(\operatorname{dist\_ses} \times \operatorname{sch\_treatment}) \\ &\gamma^{\beta_{1}}_{0} + \gamma^{\beta_{1}}_{1}(\operatorname{dist\_ses}) + \gamma^{\beta_{1}}_{1}(\operatorname{dist\_ses} \times \operatorname{sch\_treatment}) \end{aligned} \end{array} \right) , \left( \begin{array}{cc} \sigma^2_{\alpha_{k}} & \rho_{\alpha_{k}\beta_{1k}} \\ \rho_{\beta_{1k}\alpha_{k}} & \sigma^2_{\beta_{1k}} \end{array} \right) \right) \text{, for distid k = 1,} \dots \text{,K} \end{aligned}` $$ -- ```r lmer(score ~ baseline * sch_treatment * dist_ses + (baseline|schid) + (baseline |distid), data = mplus_d) ``` --- class: inverse-blue middle # Bayes --- # A disclaimer * There is **no** chance we'll really be able to do Bayes justice in this class * The hope for today is that you'll get an introduction * By the end you should be able to fit the models you already can, but in a Bayes framework * Hopefully you also recognize the tradeoffs, and potential extensions --- class: inverse-red middle # Bayes theorem --- # In equation form You'll see this presented many different ways, perhaps mostly commonly as $$ p(B \mid A) = \frac{ p(A \mid B) \times p(B)}{p(A)} $$ where `\(\mid\)` is read as "given" and `\(p\)` is the probability -- I prefer to give `\(A\)` and `\(B\)` more meaningful names -- $$ p(\text{prior} \mid \text{data}) = \frac{ p(\text{data} \mid \text{prior}) \times p(\text{prior})}{\text{data}} $$ --- # A real example Classifying a student with a learning disability. We want to know $$ p(\text{LD} \mid \text{Test}_p) = \frac{ p(\text{Test}_p \mid \text{LD}) \times p(\text{LD})}{\text{Test}_p} $$ -- Notice, this means we need to know: -- * True positive rate of the test, `\(p(\text{Test}_p \mid \text{LD})\)` -- * Base rate for learning disabilities, `\(p(\text{LD})\)` -- * Base rate for testing positive, `\(\text{Test}_p\)` --- # Estimating If we have these things, we can estimate the probability that a student has a learning disability, given a positive test. -- ## Let's assume: * `\(p(\text{Test}_p \mid \text{LD}) = 0.90\)` * `\(p(\text{LD}) = 0.10\)` * `\(\text{Test}_p = 0.20\)` --- $$ p(\text{LD} \mid \text{Test}_p) = \frac{ .90 \times .10}{.20} $$ -- $$ p(\text{LD} \mid \text{Test}_p) = 0.45 $$ -- A bit less than you might have expected? Probability is hard... -- When we see things like "90% true positive rate" we want to interpret it as `\(p(\text{LD} \mid \text{Test}_p)\)`, when it's actually `\(p(\text{Test}_p \mid \text{LD})\)` --- # Pieces If we know all the pieces, we can estimate Bayes theorem directly. -- ![](https://media2.giphy.com/media/2ZVrNVOtaM2q1zluYs/giphy.gif) Unfortunately this is almost never the case... --- # Alternative view $$ \text{updated beliefs} = \frac{\text{likelihood of data} \times \text{prior information}}{\text{average likelihood}} $$ -- How do we calculate the likelihood of the data? We have to assume some distribution. --- # Example with IQ .footnote[This example borrowed from [TJ Mahr](https://cdn.rawgit.com/tjmahr/Psych710_BayesLecture/55f446a0/bayes_slides.html)] ```r #install.packages("carData") iqs <- carData::Burt$IQbio iqs ``` ``` ## [1] 82 80 88 108 116 117 132 71 75 93 95 88 111 63 77 86 83 93 97 87 94 96 112 113 106 107 98 ``` -- IQ scores are generally assumed to be generated from a distribution that looks like this: $$ IQ_i \sim N(100, 15) $$ -- ![](w7p1_files/figure-html/unnamed-chunk-18-1.png)<!-- --> --- # Likelihood What's the likelihood of a score of 80, assuming this distribution? -- ![](w7p1_files/figure-html/unnamed-chunk-19-1.png)<!-- --> -- ```r dnorm(80, mean = 100, sd = 15) ``` ``` ## [1] 0.010934 ``` --- # Likelihood of the data We sum the likelihood to get the overall likelihood of the data. However, this leads to very small numbers. Computationally, it's easier to sum the *log* of these likelihoods. -- ```r dnorm(iqs, mean = 100, sd = 15, log = TRUE) ``` ``` ## [1] -4.346989 -4.515878 -3.946989 -3.769211 -4.195878 -4.269211 -5.902544 -5.495878 -5.015878 -3.735878 -3.682544 -3.946989 -3.895878 -6.669211 -4.802544 ## [16] -4.062544 -4.269211 -3.735878 -3.646989 -4.002544 -3.706989 -3.662544 -3.946989 -4.002544 -3.706989 -3.735878 -3.635878 ``` -- ```r sum(dnorm(iqs, mean = 100, sd = 15, log = TRUE)) ``` ``` ## [1] -114.3065 ``` --- # Alternative distributions What if we assumed the data were generated from an alternative distribution, say `\(IQ_i \sim N(115, 5)\)`? -- ```r sum(dnorm(iqs, mean = 115, sd = 5, log = TRUE)) ``` ``` ## [1] -416.3662 ``` -- The value is *much* lower. In most models, we are estimating `\(\mu\)` and `\(\sigma\)`, and trying to find values that *maximize* the sum of the log likelihoods. --- # Visually The real data generating distribution ![](w7p1_files/figure-html/unnamed-chunk-24-1.png)<!-- --> --- # Visually The poorly fitting one ![](w7p1_files/figure-html/unnamed-chunk-25-1.png)<!-- --> --- # Non-Bayesian In a frequentist regression model, we would find parameters that *maximize* the likelihood. Note - the distributional mean is often conditional. -- This is part of why I've come to prefer notation that emphasizes the data generating process. --- # Example I know we've talked about this before, but a simple linear regression model like this ```r m <- lm(IQbio ~ class, data = carData::Burt) ``` -- is generally displayed like this $$ \normalsize `\begin{aligned} \operatorname{IQbio} &= \alpha + \beta_{1}(\operatorname{class}_{\operatorname{low}}) + \beta_{2}(\operatorname{class}_{\operatorname{medium}}) + \epsilon \end{aligned}` $$ -- But we could display the same thing like this $$ `\begin{align} \operatorname{IQbio} &\sim N(\widehat{\mu}, \widehat{\sigma}) \\ \widehat{\mu} = \alpha &+ \beta_{1}(\operatorname{class}_{\operatorname{low}}) + \beta_{2}(\operatorname{class}_{\operatorname{medium}}) \end{align}` $$ --- class: inverse-red middle # Priors --- # Bayesian posterior $$ \text{posterior} = \frac{ \text{likelihood} \times \text{prior}}{\text{average likelihood}} $$ -- The above is how we estimate with Bayes. -- In words, it states that our updated beliefs (posterior) depend on the evidence from our data (likelihood) and our prior knowledge/conceptions/information (prior). -- Our prior will shift in accordance with the evidence from the data --- # Basic example Let's walk through a basic example where we're just estimating a mean. We'll assume we somehow magically know the variance. Please follow along. -- First, generate some data ```r set.seed(123) true_data <- rnorm(50, 5, 1) ``` --- # Grid search We're now going to specify a grid of possible means for our data. Let's search anywhere from -3 to 12 in 0.1 intervals. -- ```r grid <- tibble(possible_mean = seq(-3, 12, 0.1)) ``` -- Next, we'll specify a *prior distribution*. That is - how likely do we *think* each of these possible means are? -- Let's say our best guess is `\(\mu = 2\)`. Values on either side of `\(2\)` should be less likely. -- ```r prior <- dnorm(grid$possible_mean, mean = 2, sd = 1) ``` --- # Plot our prior ```r grid %>% mutate(prior = prior) %>% ggplot(aes(possible_mean, prior)) + geom_line() ``` ![](w7p1_files/figure-html/unnamed-chunk-31-1.png)<!-- --> Note that the *strength* of our prior depends on the standard deviation This would be our best guess as to where the data would fall *before* observing the data. --- # Look at other priors ```r grid %>% mutate(prior1 = dnorm(possible_mean, mean = 2, sd = 1), prior2 = dnorm(possible_mean, mean = 2, sd = 2), prior3 = dnorm(possible_mean, mean = 2, sd = 3)) %>% ggplot(aes(possible_mean)) + geom_line(aes(y = prior1)) + geom_line(aes(y = prior2), color = "cornflowerblue") + geom_line(aes(y = prior3), color = "firebrick") ``` ![](w7p1_files/figure-html/unnamed-chunk-32-1.png)<!-- --> --- # Set prior * Let's go with a fairly conservative prior, with `\(\mu = 2, \sigma = 3\)`. * We also need to normalize it so the probability sums to 1.0 -- ```r grid <- grid %>% mutate(prior = dnorm(possible_mean, mean = 2, sd = 3), prior = prior / sum(prior)) # normalize ``` --- # Observe 1 data point ```r grid <- grid %>% mutate(likelihood = dnorm(true_data[1], possible_mean, 2)) grid ``` ``` ## # A tibble: 151 x 3 ## possible_mean prior likelihood ## <dbl> <dbl> <dbl> ## 1 -3 0.003477802 0.0001973758 ## 2 -2.9 0.003674439 0.0002374240 ## 3 -2.8 0.003877883 0.0002848850 ## 4 -2.7 0.004088046 0.0003409800 ## 5 -2.6 0.004304813 0.0004071013 ## 6 -2.5 0.004528041 0.0004848308 ## 7 -2.4 0.004757554 0.0005759600 ## 8 -2.3 0.004993151 0.0006825094 ## 9 -2.2 0.005234594 0.0008067505 ## 10 -2.1 0.005481619 0.0009512268 ## # … with 141 more rows ``` --- # Compute posterior ```r grid <- grid %>% mutate(posterior = likelihood * prior, posterior = posterior / sum(posterior)) # normalize ``` --- # Plot ```r grid %>% pivot_longer(-possible_mean) %>% ggplot(aes(possible_mean, value)) + geom_line(aes(color = name)) ``` ![](w7p1_files/figure-html/unnamed-chunk-36-1.png)<!-- --> --- # Observe a second data point The old posterior becomes our new prior ```r grid <- grid %>% mutate(likelihood = dnorm(true_data[2], possible_mean, 2), posterior = likelihood * posterior, posterior = posterior / sum(posterior)) ``` --- # Plot ```r grid %>% pivot_longer(-possible_mean) %>% ggplot(aes(possible_mean, value)) + geom_line(aes(color = name)) ``` ![](w7p1_files/figure-html/unnamed-chunk-38-1.png)<!-- --> --- # Observe a third data point ```r grid <- grid %>% mutate(likelihood = dnorm(true_data[3], possible_mean, 2), posterior = likelihood * posterior, posterior = posterior / sum(posterior)) ``` --- # Plot ```r grid %>% pivot_longer(-possible_mean) %>% ggplot(aes(possible_mean, value)) + geom_line(aes(color = name)) ``` ![](w7p1_files/figure-html/unnamed-chunk-40-1.png)<!-- --> --- # All the data ```r grid <- grid %>% mutate(prior = dnorm(grid$possible_mean, mean = 2, sd = 3), prior = prior / sum(prior), posterior = prior) # best guess before seeing data for(i in seq_along(true_data)) { grid <- grid %>% mutate(likelihood = dnorm(true_data[i], possible_mean, 2), posterior = likelihood * posterior, posterior = posterior / sum(posterior)) } ``` --- ```r grid %>% pivot_longer(-possible_mean) %>% ggplot(aes(possible_mean, value)) + geom_line(aes(color = name)) ``` ![](w7p1_files/figure-html/unnamed-chunk-42-1.png)<!-- --> --- # Posterior * We can summarize our posterior distribution * This is a fundamental difference between Bayesian & frequentist approaches + In Bayes, our data is assumed fixed, our parameters random + In frequentist, our data is assumed random, our parameters fixed --- # Most likely? ```r grid %>% filter(posterior == max(posterior)) ``` ``` ## # A tibble: 1 x 4 ## possible_mean prior likelihood posterior ## <dbl> <dbl> <dbl> <dbl> ## 1 5 0.008459494 0.1992979 0.1416204 ``` --- # Sampling * Now that we have a posterior distribution, we can sample from it to help us with inference * Each possible mean should be sampled in accordance with its probability specified by the posterior. -- * Let's draw 10,000 samples ```r posterior_samples <- sample(grid$possible_mean, size = 10000, replace = TRUE, prob = grid$posterior) ``` --- # Inference First, let's plot the samples -- ```r ggplot(data.frame(sample = posterior_samples), aes(sample)) + geom_histogram(bins = 100) ``` ![](w7p1_files/figure-html/unnamed-chunk-45-1.png)<!-- --> --- # Central tendency ```r mean(posterior_samples) ``` ``` ## [1] 5.00441 ``` ```r median(posterior_samples) ``` ``` ## [1] 5 ``` -- ## Spread ```r sd(posterior_samples) ``` ``` ## [1] 0.278466 ``` --- # Credible intervals Let's compute an 80% credible interval ```r tibble(posterior_samples) %>% summarize(ci_80 = quantile(posterior_samples, c(0.1, 0.9))) ``` ``` ## # A tibble: 2 x 1 ## ci_80 ## <dbl> ## 1 4.6 ## 2 5.4 ``` -- ## What's the chance the "true" mean is less than 4.8? ```r sum(posterior_samples < 4.8) / length(posterior_samples) * 100 ``` ``` ## [1] 18.05 ``` --- # Ranges What's the probability the "true" mean is between 5.2 and 5.5? -- ```r sum(posterior_samples >= 5.2 & posterior_samples <= 5.5) / length(posterior_samples) * 100 ``` ``` ## [1] 27.94 ``` -- Greater than 4.5? ```r sum(posterior_samples > 4.5) / length(posterior_samples) * 100 ``` ``` ## [1] 95.05 ``` -- Note this is much more natural than frequentist statistics --- # Change our prior Let's try again with a tighter prior ```r grid <- grid %>% mutate(prior = dnorm(grid$possible_mean, mean = 2, sd = 0.1), prior = prior / sum(prior), posterior = prior) # best guess before seeing data for(i in seq_along(true_data)) { grid <- grid %>% mutate(likelihood = dnorm(true_data[i], possible_mean, 2), posterior = likelihood * posterior, posterior = posterior / sum(posterior)) } ``` --- ```r grid %>% pivot_longer(-possible_mean) %>% ggplot(aes(possible_mean, value)) + geom_line(aes(color = name)) ``` ![](w7p1_files/figure-html/unnamed-chunk-53-1.png)<!-- --> ```r grid %>% filter(posterior == max(posterior)) ``` ``` ## # A tibble: 1 x 4 ## possible_mean prior likelihood posterior ## <dbl> <dbl> <dbl> <dbl> ## 1 2.3 0.004431848 0.08476010 0.3915258 ``` --- # More data Same thing, but this time with tons of data --- ```r true_data <- rnorm(5000, 5, 1) grid <- grid %>% mutate(prior = dnorm(grid$possible_mean, mean = 2, sd = 0.1), prior = prior / sum(prior), posterior = prior) # best guess before seeing data for(i in seq_along(true_data)) { grid <- grid %>% mutate(likelihood = dnorm(true_data[i], possible_mean, 2), posterior = likelihood * posterior, posterior = posterior / sum(posterior)) } ``` --- ```r grid %>% pivot_longer(-possible_mean) %>% ggplot(aes(possible_mean, value)) + geom_line(aes(color = name)) ``` ![](w7p1_files/figure-html/unnamed-chunk-56-1.png)<!-- --> ```r grid %>% filter(posterior == max(posterior)) ``` ``` ## # A tibble: 1 x 4 ## possible_mean prior likelihood posterior ## <dbl> <dbl> <dbl> <dbl> ## 1 4.8 2.277577e-171 0.1994389 0.9678310 ``` --- # Taking a step back * The purpose of the prior is to include *what you already know* into your analysis * The strength of your prior should depend on your prior research * Larger samples will overwhelm priors quicker, particularly if they are diffuse * Think through the lens of updating your prior beliefs * This whole framework is quite different, but also gives us a lot of advantages in terms of probability interpretation, as we'll see --- class: inverse-blue middle # Bayes for regression --- # More complicated * Remember our posterior is defined by $$ \text{posterior} = \frac{ \text{likelihood} \times \text{prior}}{\text{average likelihood}} $$ -- When we just had a single parameter to estimate, `\(\mu\)`, this was tractable with grid search. -- With even simple linear regression, however, we have three parameters: `\(\alpha\)`, `\(\beta\)`, and `\(\sigma\)` -- Our Bayesian model then becomes considerably more complicated: $$ P(\alpha, \beta, \sigma \mid x) = \frac{ P(x \mid \alpha, \beta, \sigma) \, P(\alpha, \beta, \sigma)}{\iiint \, P(x \mid \alpha, \beta, \sigma) \, P(\alpha, \beta, \sigma) \,d\alpha \,d\beta \,d\sigma} $$ --- # Estimation Rather than trying to compute the integrals, we *simulate* observations from the joint posterior distribution. -- This sounds a bit like magic - how do we do this? -- Multiple different algorithms, but all use some form of Markov-Chain Monte-Carlo sampling --- # Conceptually * Imagine the posterior as a hill * We start with the parameters set to random numbers + Estimate the posterior with this values (our spot on the hill) * Use information from the prior sample to determine whether and how to change the current parameter values * Try to "walk" around in a way to a complete "picture" of the hill from the samples * Use these samples as your posterior distribution --- # Metropolis-Hastings We will use is called the Metropolis-Hastings algorithm: * Compute a candidate "step" for the parameters * Calculate an acceptance ratio `\(\alpha = f(x')/f(x_t)\)`. This will fall between 0 and 1. * Generate number, `\(u\)`, from a random uniform distribution between 0 and 1 + if `\(u \leq \alpha\)`, accept the candidate + if `\(u \geq \alpha\)`, reject the candidate -- Complicated, but conceptually we're trying to sample the joint posterior distribution in a way that conforms with the probability density of the posterior (i.e., the shape) --- # Script * The `mh-alg.R` script works through the Metropolis-Hastings algorithm to get samples from the joint posterior of a simple linear regression model. * The data are simulated, so we know the true values * It's complicated, and not required at all, but it's there for you if you want more info --- class: inverse-green middle # Next time Continue discussing Bayes - more emphasis on model results & plotting Fit and interpret multilevel logistic regression models