More BayesAnd multilevel binomial logistic regressionDaniel AndersonWeek 81 / 128

Agenda

More equation practice
Finishing up our intro to Bayes
- Specific focus on MCMC
Implementation with {brms}
Logistic regression review
Extending to multilevel logistic regression models with lme4

2 / 128

Equation practice

3 / 128

The data

Read in the following data

library(tidyverse)
library(lme4)
d <- read_csv(here::here("data", "three-lev.csv"))
d

## # A tibble: 7,230 x 12
##    schid       sid  size lowinc mobility female black hispanic retained grade  year   math
##    <dbl>     <dbl> <dbl>  <dbl>    <dbl>  <dbl> <dbl>    <dbl>    <dbl> <dbl> <dbl>  <dbl>
##  1  2020 273026452   380   40.3     12.5      0     0        1        0     2   0.5  1.146
##  2  2020 273026452   380   40.3     12.5      0     0        1        0     3   1.5  1.134
##  3  2020 273026452   380   40.3     12.5      0     0        1        0     4   2.5  2.3  
##  4  2020 273030991   380   40.3     12.5      0     0        0        0     0  -1.5 -1.303
##  5  2020 273030991   380   40.3     12.5      0     0        0        0     1  -0.5  0.439
##  6  2020 273030991   380   40.3     12.5      0     0        0        0     2   0.5  2.43 
##  7  2020 273030991   380   40.3     12.5      0     0        0        0     3   1.5  2.254
##  8  2020 273030991   380   40.3     12.5      0     0        0        0     4   2.5  3.873
##  9  2020 273059461   380   40.3     12.5      0     0        1        0     0  -1.5 -1.384
## 10  2020 273059461   380   40.3     12.5      0     0        1        0     1  -0.5  0.338
## # … with 7,220 more rows

4 / 128

Model 1

Fit the following model

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i]}, σ^{2}) \\ α_{j} & \sim N (γ_{0}^{α} + γ_{1}^{α} (mobility), σ_{α_{j}}^{2}), for schid j = 1, \dots,J \end{aligned}$

5 / 128

Model 1

Fit the following model

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i]}, σ^{2}) \\ α_{j} & \sim N (γ_{0}^{α} + γ_{1}^{α} (mobility), σ_{α_{j}}^{2}), for schid j = 1, \dots,J \end{aligned}$

lmer(math ~ mobility + (1|schid), data = d)

5 / 128

Model 2

Fit the following model

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i], k [i]} + β_{1 j [i]} (year), σ^{2}) \\ (\begin{matrix} \begin{aligned} α_{j} \\ β_{1 j} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (female) \\ μ_{β_{1 j}} \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{j}}^{2} & ρ_{α_{j} β_{1 j}} \\ ρ_{β_{1 j} α_{j}} & σ_{β_{1 j}}^{2} \end{array})), for sid j = 1, \dots,J \\ α_{k} & \sim N (γ_{0}^{α} + γ_{1}^{α} (mobility), σ_{α_{k}}^{2}), for schid k = 1, \dots,K \end{aligned}$

6 / 128

Model 2

Fit the following model

01:30

lmer(math ~ year + female + mobility + 
             (year|sid) +  (1|schid), 
     data = d)

6 / 128

Model 3

Fit the following model

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i], k [i]} + β_{1 j [i]} (year), σ^{2}) \\ (\begin{matrix} \begin{aligned} α_{j} \\ β_{1 j} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (black) + γ_{2}^{α} (hispanic) + γ_{3}^{α} (female) \\ γ_{0}^{β_{1}} + γ_{1}^{β_{1}} (black) + γ_{2}^{β_{1}} (hispanic) \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{j}}^{2} & ρ_{α_{j} β_{1 j}} \\ ρ_{β_{1 j} α_{j}} & σ_{β_{1 j}}^{2} \end{array})), for sid j = 1, \dots,J \\ α_{k} & \sim N (γ_{0}^{α} + γ_{1}^{α} (mobility), σ_{α_{k}}^{2}), for schid k = 1, \dots,K \end{aligned}$

7 / 128

Model 3

Fit the following model

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i], k [i]} + β_{1 j [i]} (year), σ^{2}) \\ (\begin{matrix} \begin{aligned} α_{j} \\ β_{1 j} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (black) + γ_{2}^{α} (hispanic) + γ_{3}^{α} (female) \\ γ_{0}^{β_{1}} + γ_{1}^{β_{1}} (black) + γ_{2}^{β_{1}} (hispanic) \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{j}}^{2} & ρ_{α_{j} β_{1 j}} \\ ρ_{β_{1 j} α_{j}} & σ_{β_{1 j}}^{2} \end{array})), for sid j = 1, \dots,J \\ α_{k} & \sim N (γ_{0}^{α} + γ_{1}^{α} (mobility), σ_{α_{k}}^{2}), for schid k = 1, \dots,K \end{aligned}$

lmer(math ~ year * black + year * hispanic + female + mobility + 
             (year|sid) +  (1|schid), 
     data = d)

7 / 128

Model 4

Fit the following model

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i], k [i]} + β_{1 j [i], k [i]} (year), σ^{2}) \\ (\begin{matrix} \begin{aligned} α_{j} \\ β_{1 j} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1 k [i]}^{α} (female) \\ γ_{0}^{β_{1}} + γ_{1}^{β_{1}} (female) \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{j}}^{2} & ρ_{α_{j} β_{1 j}} \\ ρ_{β_{1 j} α_{j}} & σ_{β_{1 j}}^{2} \end{array})), for sid j = 1, \dots,J \\ (\begin{matrix} \begin{aligned} α_{k} \\ β_{1 k} \\ γ_{1 k} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (mobility) + γ_{2}^{α} (lowinc) \\ γ_{0}^{β_{1}} + γ_{1}^{β_{1}} (mobility) \\ μ_{γ_{1 k}} \end{aligned} \end{matrix}), (\begin{array}{ccc} σ_{α_{k}}^{2} & ρ_{α_{k} β_{1 k}} & ρ_{α_{k} γ_{1 k}} \\ ρ_{β_{1 k} α_{k}} & σ_{β_{1 k}}^{2} & ρ_{β_{1 k} γ_{1 k}} \\ ρ_{γ_{1 k} α_{k}} & ρ_{γ_{1 k} β_{1 k}} & σ_{γ_{1 k}}^{2} \end{array})), for schid k = 1, \dots,K \end{aligned}$

8 / 128

Model 4

Fit the following model

01:30

lmer(math ~ year * female + year * mobility + lowinc +
             (year|sid) +  (year + female|schid), 
     data = d)

8 / 128

Model 5

Fit the following model

Don't worry if you run into convergene warnings

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i], k [i]} + β_{1 j [i], k [i]} (year), σ^{2}) \\ (\begin{matrix} \begin{aligned} α_{j} \\ β_{1 j} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1 k [i]}^{α} (female) \\ γ_{1 k [i 0}^{β_{1}} + γ_{1 k [i]}^{β_{1}} (female) \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{j}}^{2} & ρ_{α_{j} β_{1 j}} \\ ρ_{β_{1 j} α_{j}} & σ_{β_{1 j}}^{2} \end{array})), for sid j = 1, \dots,J \\ (\begin{matrix} \begin{aligned} α_{k} \\ β_{1 k} \\ γ_{1 k} \\ γ_{1 k}^{β_{1}} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (mobility) + γ_{2}^{α} (lowinc) \\ μ_{β_{1 k}} \\ μ_{γ_{1 k}} \\ μ_{γ_{1 k}^{β_{1}}} \end{aligned} \end{matrix}), (\begin{array}{cccc} σ_{α_{k}}^{2} & 0 & 0 & 0 \\ 0 & σ_{β_{1 k}}^{2} & 0 & 0 \\ 0 & 0 & σ_{γ_{1 k}}^{2} & 0 \\ 0 & 0 & 0 & σ_{γ_{1 k}^{β_{1}}}^{2} \end{array})), for schid k = 1, \dots,K \end{aligned}$

9 / 128

Model 5

Fit the following model

Don't worry if you run into convergene warnings

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i], k [i]} + β_{1 j [i], k [i]} (year), σ^{2}) \\ (\begin{matrix} \begin{aligned} α_{j} \\ β_{1 j} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1 k [i]}^{α} (female) \\ γ_{1 k [i 0}^{β_{1}} + γ_{1 k [i]}^{β_{1}} (female) \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{j}}^{2} & ρ_{α_{j} β_{1 j}} \\ ρ_{β_{1 j} α_{j}} & σ_{β_{1 j}}^{2} \end{array})), for sid j = 1, \dots,J \\ (\begin{matrix} \begin{aligned} α_{k} \\ β_{1 k} \\ γ_{1 k} \\ γ_{1 k}^{β_{1}} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (mobility) + γ_{2}^{α} (lowinc) \\ μ_{β_{1 k}} \\ μ_{γ_{1 k}} \\ μ_{γ_{1 k}^{β_{1}}} \end{aligned} \end{matrix}), (\begin{array}{cccc} σ_{α_{k}}^{2} & 0 & 0 & 0 \\ 0 & σ_{β_{1 k}}^{2} & 0 & 0 \\ 0 & 0 & σ_{γ_{1 k}}^{2} & 0 \\ 0 & 0 & 0 & σ_{γ_{1 k}^{β_{1}}}^{2} \end{array})), for schid k = 1, \dots,K \end{aligned}$

lmer(math ~ year * female + mobility + lowinc +
       (year|sid) +  (year * female||schid), 
     data = d)

9 / 128

Model 6

Fit the following model

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i]} + β_{1 j [i]} (year), σ^{2}) \\ (\begin{matrix} \begin{aligned} α_{j} \\ β_{1 j} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (female) \\ μ_{β_{1 j}} \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{j}}^{2} & ρ_{α_{j} β_{1 j}} \\ ρ_{β_{1 j} α_{j}} & σ_{β_{1 j}}^{2} \end{array})), for sid j = 1, \dots,J \\ γ_{1 k} & \sim N (μ_{γ_{1 k}}, σ_{γ_{1 k}}^{2}), for schid k = 1, \dots,K \end{aligned}$

10 / 128

Model 6

Fit the following model

01:30

lmer(math ~ year + female + mobility * size +
             (year|sid) +  (0 + mobility|schid), 
     data = d)

10 / 128

Model 7

Fit the following model

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i]} + β_{1 j [i]} (year), σ^{2}) \\ (\begin{matrix} \begin{aligned} α_{j} \\ β_{1 j} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (female) + γ_{2}^{α} (mobility) + γ_{3}^{α} (size) \\ μ_{β_{1 j}} \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{j}}^{2} & ρ_{α_{j} β_{1 j}} \\ ρ_{β_{1 j} α_{j}} & σ_{β_{1 j}}^{2} \end{array})), for sid j = 1, \dots,J \end{aligned}$

11 / 128

Model 7

Fit the following model

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i]} + β_{1 j [i]} (year), σ^{2}) \\ (\begin{matrix} \begin{aligned} α_{j} \\ β_{1 j} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (female) + γ_{2}^{α} (mobility) + γ_{3}^{α} (size) \\ μ_{β_{1 j}} \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{j}}^{2} & ρ_{α_{j} β_{1 j}} \\ ρ_{β_{1 j} α_{j}} & σ_{β_{1 j}}^{2} \end{array})), for sid j = 1, \dots,J \end{aligned}$

lmer(math ~ year + female + mobility + size +
       (year|sid), 
     data = d)

11 / 128

Model 8

Last one

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i], k [i]} + β_{1 j [i], k [i]} (year), σ^{2}) \\ (\begin{matrix} \begin{aligned} α_{j} \\ β_{1 j} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (female) \\ μ_{β_{1 j}} \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{j}}^{2} & ρ_{α_{j} β_{1 j}} \\ ρ_{β_{1 j} α_{j}} & σ_{β_{1 j}}^{2} \end{array})), for sid j = 1, \dots,J \\ (\begin{matrix} \begin{aligned} α_{k} \\ β_{1 k} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (lowinc) \\ γ_{0}^{β_{1}} + γ_{1}^{β_{1}} (lowinc) \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{k}}^{2} & ρ_{α_{k} β_{1 k}} \\ ρ_{β_{1 k} α_{k}} & σ_{β_{1 k}}^{2} \end{array})), for schid k = 1, \dots,K \end{aligned}$

12 / 128

Model 8

Last one

01:30

$\begin{aligned} {math}_{i} & \sim N (α_{j [i], k [i]} + β_{1 j [i], k [i]} (year), σ^{2}) \\ (\begin{matrix} \begin{aligned} α_{j} \\ β_{1 j} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (female) \\ μ_{β_{1 j}} \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{j}}^{2} & ρ_{α_{j} β_{1 j}} \\ ρ_{β_{1 j} α_{j}} & σ_{β_{1 j}}^{2} \end{array})), for sid j = 1, \dots,J \\ (\begin{matrix} \begin{aligned} α_{k} \\ β_{1 k} \end{aligned} \end{matrix}) & \sim N ((\begin{matrix} \begin{aligned} γ_{0}^{α} + γ_{1}^{α} (lowinc) \\ γ_{0}^{β_{1}} + γ_{1}^{β_{1}} (lowinc) \end{aligned} \end{matrix}), (\begin{array}{cc} σ_{α_{k}}^{2} & ρ_{α_{k} β_{1 k}} \\ ρ_{β_{1 k} α_{k}} & σ_{β_{1 k}}^{2} \end{array})), for schid k = 1, \dots,K \end{aligned}$

lmer(math ~ year * lowinc + female + 
       (year|sid) + (year |schid), 
     data = d)

12 / 128

Finishing up w/Bayes intro

13 / 128

Updated beliefs

Remember our posterior is defined by

$posterior = \frac{likelihood \times prior}{average likelihood}$

14 / 128

Updated beliefs

Remember our posterior is defined by

$posterior = \frac{likelihood \times prior}{average likelihood}$ When we just had a single parameter to estimate, $μ$ , this was tractable with grid search.

14 / 128

Updated beliefs

Remember our posterior is defined by

$posterior = \frac{likelihood \times prior}{average likelihood}$ When we just had a single parameter to estimate, $μ$ , this was tractable with grid search.

With even simple linear regression, however, we have three parameters: $α$ , $β$ , and $σ$

14 / 128

Updated beliefs

Remember our posterior is defined by

$posterior = \frac{likelihood \times prior}{average likelihood}$ When we just had a single parameter to estimate, $μ$ , this was tractable with grid search.

With even simple linear regression, however, we have three parameters: $α$ , $β$ , and $σ$

Our Bayesian model then becomes considerably more complicated:

$P (α, β, σ ∣ x) = \frac{P (x ∣ α, β, σ) P (α, β, σ)}{∭ P (x ∣ α, β, σ) P (α, β, σ) d α d β d σ}$

14 / 128

Estimation

Rather than trying to compute the integrals, we draw samples from the joint posterior distribution.

15 / 128

Estimation

Rather than trying to compute the integrals, we draw samples from the joint posterior distribution.

This sounds a bit like magic - how do we do this?

15 / 128

Estimation

Rather than trying to compute the integrals, we draw samples from the joint posterior distribution.

This sounds a bit like magic - how do we do this?

Multiple different algorithms, but all use some form of Markov-Chain Monte-Carlo sampling

15 / 128

Conceptually

Imagine the posterior as a hill
We start with the parameters set to random numbers
- Estimate the posterior with this values (our spot on the hill)
Try to "walk" around in a way that we get a complete "picture" of the hill from the samples
Use these samples as your posterior distribution

16 / 128

An example

Example from Ravenzwaaij et al.

Let's go back to an example where we're trying to estimate the mean with a know standard deviation.

17 / 128

An example

Example from Ravenzwaaij et al.

Let's go back to an example where we're trying to estimate the mean with a know standard deviation. $N (μ, 15)$

17 / 128

An example

Example from Ravenzwaaij et al.

Let's go back to an example where we're trying to estimate the mean with a know standard deviation. $N (μ, 15)$ Start with an initial guess, say 110

17 / 128

MCMC

One algorithm:

Generate a proposal by taking a sample from your best guess according to some distribution, e.g., $proposal \sim N (110, 10)$ . Suppose we get a value of 108

18 / 128

MCMC

One algorithm:

Generate a proposal by taking a sample from your best guess according to some distribution, e.g., $proposal \sim N (110, 10)$ . Suppose we get a value of 108
Observe the actual data. Let's say we have only one point: 100

18 / 128

MCMC

One algorithm:

Generate a proposal by taking a sample from your best guess according to some distribution, e.g., $proposal \sim N (110, 10)$ . Suppose we get a value of 108
Observe the actual data. Let's say we have only one point: 100

Compare the "height" of the proposal distribution to height of the current distribution, relative to observed data: $N (100 | 110, 15)$ , $N (100 | 108, 15)$

18 / 128

In code

dnorm(110, 100, 15)

## [1] 0.02129653

dnorm(108, 100, 15)

## [1] 0.02307026

19 / 128

If probability of target distribution is higher, accept
If probability is lower, randomly select between the two with probability equal to the "heights" (probability of the two distributions)

20 / 128

If probability of target distribution is higher, accept
If probability is lower, randomly select between the two with probability equal to the "heights" (probability of the two distributions)
If proposal is accepted - it's the next sample in the chain
Otherwise, the next sample is a copy of current sample

20 / 128

If probability of target distribution is higher, accept
If probability is lower, randomly select between the two with probability equal to the "heights" (probability of the two distributions)
If proposal is accepted - it's the next sample in the chain
Otherwise, the next sample is a copy of current sample

This completes one iteration

Next iteration starts by generating a new proposal distribution

20 / 128

If probability of target distribution is higher, accept
If probability is lower, randomly select between the two with probability equal to the "heights" (probability of the two distributions)
If proposal is accepted - it's the next sample in the chain
Otherwise, the next sample is a copy of current sample

This completes one iteration

Next iteration starts by generating a new proposal distribution

Stop when you have enough samples to have an adequate "picture"

20 / 128

Example in code

Let's take 5000 samples

set.seed(42) # for reproducibility
samples <- c(
  110, # initial guess
  rep(NA, 4999) # space for subsequent samples to fill in
)

21 / 128

Loop

for(i in 2:5000) {
  # generate proposal distribution
  proposal <- rnorm(1, mean = samples[i - 1], sd = 10)
  # calculate current/proposal distribution likelihoood
  prob_current <- dnorm(samples[i - 1], 100, 15)
  prob_proposal <- dnorm(proposal, 100, 15)
  # compute the probability ratio
  prob_ratio <- prob_proposal / prob_current
  # Determine which to select
  if(prob_ratio > runif(1)) {
    samples[i] <- proposal # accept
  } else {
    samples[i] <- samples[i - 1] # reject
  }
}

22 / 128

Plot

Iteration history

tibble(iteration = 1:5000,
       value = samples) %>% 
  ggplot(aes(iteration, value)) +
  geom_line() +
  geom_hline(yintercept = 100, 
             color = "magenta",
             size = 3)

23 / 128

Plot

Density

tibble(iteration = 1:5000,
       value = samples) %>% 
  ggplot(aes(value)) +
  geom_density(fill = "#00b4f5",
               alpha = 0.7) +
  geom_vline(xintercept = 100, 
             color = "magenta",
             size = 3)

24 / 128

Bayes Update Script

Last week I promised a script to let you play around with priors. I'll show that now ("bayes-update-plotting.R")

There's also a MH script I'll quickly show.

[demo]

25 / 128

Implementation with {brms}

Luckily, we don't have to program the MCMC algorithm ourselves

26 / 128

What is it?

bayesian regression modeling with stan
Uses stan as the model backend - basically writes the model code for you then sends it to stan
Allows model syntax similar to lme4
Simple specification of priors - defaults are flat
Provides many methods for post-model fitting inference

27 / 128

Fit a basic model

Let's start with the default (uninformative) priors, and fit a standard, simple-linear regression model

library(brms)
sleep_m0 <- brm(Reaction ~ Days, data = lme4::sleepstudy)

## 
## SAMPLING FOR MODEL 'a3c55247ada09ee979052a314fc29ad7' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 2.2e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.050609 seconds (Warm-up)
## Chain 1:                0.017419 seconds (Sampling)
## Chain 1:                0.068028 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'a3c55247ada09ee979052a314fc29ad7' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 6e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.048009 seconds (Warm-up)
## Chain 2:                0.021626 seconds (Sampling)
## Chain 2:                0.069635 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'a3c55247ada09ee979052a314fc29ad7' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 1.2e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.041524 seconds (Warm-up)
## Chain 3:                0.020323 seconds (Sampling)
## Chain 3:                0.061847 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'a3c55247ada09ee979052a314fc29ad7' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 7e-06 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.048884 seconds (Warm-up)
## Chain 4:                0.020719 seconds (Sampling)
## Chain 4:                0.069603 seconds (Total)
## Chain 4:

28 / 128

Model summary

summary(sleep_m0)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: Reaction ~ Days 
##    Data: lme4::sleepstudy (Number of observations: 180) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   251.49      6.64   238.26   264.74 1.00     4168     3040
## Days         10.45      1.25     7.98    12.90 1.00     4155     2967
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    47.96      2.51    43.34    53.22 1.00     4069     2959
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

29 / 128

View "fixed" effect

Let's look at our estimated relation between Days and Reaction

30 / 128

View "fixed" effect

Let's look at our estimated relation between Days and Reaction

conditional_effects(sleep_m0)

30 / 128

Wrong model

Of course, this is the wrong model, we have a multilevel structure

31 / 128

Multilevel model

Notice the syntax is essentially equivalent to lme4

sleep_m1 <- brm(Reaction ~ Days + (Days | Subject), data = lme4::sleepstudy)
summary(sleep_m1)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: Reaction ~ Days + (Days | Subject) 
##    Data: lme4::sleepstudy (Number of observations: 180) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Group-Level Effects: 
## ~Subject (Number of levels: 18) 
##                     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)          27.06      7.15    15.59    42.69 1.00     1529     2353
## sd(Days)                6.55      1.49     4.19    10.04 1.00     1144     1440
## cor(Intercept,Days)     0.08      0.29    -0.47     0.64 1.01     1125     1844
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   251.42      7.17   237.21   265.47 1.00     2093     2543
## Days         10.45      1.71     7.23    13.95 1.00     1660     2090
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    25.91      1.54    23.13    29.17 1.00     3581     2712
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

32 / 128

Fixed effect

The uncertainty has increased

conditional_effects(sleep_m1)

33 / 128

Checking your model

pp_check(sleep_m1)

34 / 128

More checks

plot(sleep_m1)

35 / 128

Even more

launch_shinystan(sleep_m1)

36 / 128

Model comparison

See here for probably the best paper on this topic to date (imo)

Two primary (modern) methods
- Leave-one-out Cross-Validation (LOO)
- Widely Applicable Information Criterion (WAIC)

37 / 128

Model comparison

See here for probably the best paper on this topic to date (imo)

Two primary (modern) methods
- Leave-one-out Cross-Validation (LOO)
- Widely Applicable Information Criterion (WAIC)

Both provide estimates of the out-of-sample predictive accuracy of the model. LOO is similar to K-fold CV, while WAIC is similar to AIC/BIC (but an improved version for Bayes models)

37 / 128

Model comparison

See here for probably the best paper on this topic to date (imo)

Two primary (modern) methods
- Leave-one-out Cross-Validation (LOO)
- Widely Applicable Information Criterion (WAIC)

Both provide estimates of the out-of-sample predictive accuracy of the model. LOO is similar to K-fold CV, while WAIC is similar to AIC/BIC (but an improved version for Bayes models)

Both can be computed using brms. LOO is approximated (not actually refit each time).

37 / 128

LOO: M0

loo(sleep_m0)

## 
## Computed from 4000 by 180 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo   -953.3 10.5
## p_loo         3.2  0.5
## looic      1906.5 21.0
## ------
## Monte Carlo SE of elpd_loo is 0.0.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.

38 / 128

LOO: M1

loo(sleep_m1)

## Warning: Found 3 observations with a pareto_k > 0.7 in model 'sleep_m1'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for
## problematic observations.

## 
## Computed from 4000 by 180 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo   -861.7 22.7
## p_loo        34.7  8.8
## looic      1723.3 45.5
## ------
## Monte Carlo SE of elpd_loo is NA.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     175   97.2%   682       
##  (0.5, 0.7]   (ok)         2    1.1%   2189      
##    (0.7, 1]   (bad)        1    0.6%   14        
##    (1, Inf)   (very bad)   2    1.1%   9         
## See help('pareto-k-diagnostic') for details.

39 / 128

Plot

plot(loo(sleep_m1), label_points = TRUE)

## Warning: Found 3 observations with a pareto_k > 0.7 in model 'sleep_m1'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for
## problematic observations.

40 / 128

Interpretation

Large values here (anything above 0.7) is an indication of model misspecification

Smaller values do not guarantee a well-specified model, however

41 / 128

LOO Compare

This is a fairly complicated topic, and we won't spend a lot of time on it. See here for a bit more information specifically on the functions

The best fitting model will be on top. Use the standard error within your interpretation

loo_compare(loo(sleep_m0), loo(sleep_m1))

## Warning: Found 3 observations with a pareto_k > 0.7 in model 'sleep_m1'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for
## problematic observations.

##          elpd_diff se_diff
## sleep_m1   0.0       0.0  
## sleep_m0 -91.6      21.4

42 / 128

WAIC

Similar to other information criteria.

waic(sleep_m0)

## 
## Computed from 4000 by 180 log-likelihood matrix
## 
##           Estimate   SE
## elpd_waic   -953.3 10.5
## p_waic         3.2  0.5
## waic        1906.5 21.0

waic(sleep_m1)

## Warning: 
## 14 (7.8%) p_waic estimates greater than 0.4. We recommend trying loo instead.

## 
## Computed from 4000 by 180 log-likelihood matrix
## 
##           Estimate   SE
## elpd_waic   -860.2 22.3
## p_waic        33.2  8.4
## waic        1720.5 44.7
## 
## 14 (7.8%) p_waic estimates greater than 0.4. We recommend trying loo instead.

43 / 128

Compare

You can use waic within loo_compare()

loo_compare(waic(sleep_m0), waic(sleep_m1))

## Warning: 
## 14 (7.8%) p_waic estimates greater than 0.4. We recommend trying loo instead.

##          elpd_diff se_diff
## sleep_m1   0.0       0.0  
## sleep_m0 -93.0      21.0

44 / 128

Another model

kidney_m0 <- brm(time ~ age + sex, data = kidney)
pp_check(kidney_m0, type = "ecdf_overlay")

45 / 128

Fixing this

We need to change the assumptions of our model - specifically that the outcome is not normally distributed

46 / 128

Fixing this

We need to change the assumptions of our model - specifically that the outcome is not normally distributed

Plot the raw data

ggplot(kidney, aes(time)) +
  geom_histogram(alpha = 0.7)

46 / 128

Maybe Poisson?

(I realize we haven't talked about these types of models yet)

kidney_m1 <- brm(time ~ age + sex, 
                 data = kidney, 
                 family = poisson())

47 / 128

Nope

pp_check(kidney_m1, type = "ecdf_overlay")

48 / 128

Gamma w/log link

kidney_m2 <- brm(time ~ age + sex, 
                 data = kidney, 
                 family = Gamma("log"))
pp_check(kidney_m2, type = "ecdf_overlay")

49 / 128

Specifying priors

Let's sample from only our priors to see what kind of predictions we get.

Here, we're specifying that our beta coefficient prior is $β \sim N (0, 0.5)$

kidney_m3 <- brm(
  time ~ age + sex, 
  data = kidney,
  family = Gamma("log"),
  prior = prior(normal(0, 0.5), class = "b"),
  sample_prior = "only"
)

50 / 128

kidney_m3

## Warning: There were 2253 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See http://mc-stan.org/misc/warnings.html#divergent-
## transitions-after-warmup

##  Family: gamma 
##   Links: mu = log; shape = identity 
## Formula: time ~ age + sex 
##    Data: kidney (Number of observations: 76) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     4.23     22.42   -39.94    46.55 1.01      629      991
## age          -0.02      0.51    -0.97     0.97 1.01      629      946
## sexfemale     0.01      0.50    -0.98     0.97 1.01      667     1196
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## shape     0.42      3.64     0.00     2.35 1.01      436      711
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

51 / 128

Prior predictions

Random sample of 100 points

52 / 128

Why?

It seemed like our prior was fairly tight

53 / 128

Why?

It seemed like our prior was fairly tight

The exploding prior happens because of the log transformation

Age is coded in years
Imagine a coef of 1 (2 standard deviations above our prior)
Prediction for a 25 year old would be $\exp (25)$ = 7.2004899\times 10^{10}

53 / 128

A note on prior specifications

It's hard
I don't have a ton of good advice
Be particularly careful when you're using distributions that have anything other than an identity link (e.g., log link, as we are here)

54 / 128

One more model

Let's fit a model we've fit previously

55 / 128

One more model

Let's fit a model we've fit previously

In Week 4, we fit this model

library(lme4)
popular <- read_csv(here::here("data", "popularity.csv"))
m_lmer <- lmer(popular ~ extrav + (extrav|class), popular,
               control = lmerControl(optimizer = "bobyqa"))

55 / 128

One more model

Let's fit a model we've fit previously

In Week 4, we fit this model

library(lme4)
popular <- read_csv(here::here("data", "popularity.csv"))
m_lmer <- lmer(popular ~ extrav + (extrav|class), popular,
               control = lmerControl(optimizer = "bobyqa"))

Try fitting the same model with {brms} with the default, diffuse priors

02:00

55 / 128

Bayesian verision

m_brms <- brm(popular ~ extrav + (extrav|class), popular)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: popular ~ extrav + (extrav | class) 
##    Data: popular (Number of observations: 2000) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Group-Level Effects: 
## ~class (Number of levels: 100) 
##                       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)             1.72      0.17     1.41     2.08 1.00     1165     2052
## sd(extrav)                0.16      0.03     0.11     0.22 1.01      724     1755
## cor(Intercept,extrav)    -0.95      0.03    -1.00    -0.89 1.01      358      564
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     2.46      0.20     2.06     2.84 1.00      994     1951
## extrav        0.49      0.03     0.44     0.54 1.00     1510     2474
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.95      0.02     0.92     0.98 1.00     4787     2392
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

56 / 128

lme4 model

arm::display(m_lmer)

## lmer(formula = popular ~ extrav + (extrav | class), data = popular, 
##     control = lmerControl(optimizer = "bobyqa"))
##             coef.est coef.se
## (Intercept) 2.46     0.20   
## extrav      0.49     0.03   
## 
## Error terms:
##  Groups   Name        Std.Dev. Corr  
##  class    (Intercept) 1.73           
##           extrav      0.16     -0.97 
##  Residual             0.95           
## ---
## number of obs: 2000, groups: class, 100
## AIC = 5791.4, DIC = 5762
## deviance = 5770.7

57 / 128

Add priors

Multiple ways to do this
Generally, I allow the intercept to just be what it is
For "fixed" effects, you might consider regularizing priors.
You can also set priors for individual parameters
In my experience, you will mostly want to set priors for different "class"es of parameters, e.g.: Intercept, b, sd, sigma, cor

58 / 128

Half-Cauchy distribution

This is the distribution most often used for standard deviation priors

59 / 128

Specify some new priors

Let's specify some regularizing priors for the fixed effects and standard deviations

60 / 128

Specify some new priors

Let's specify some regularizing priors for the fixed effects and standard deviations

priors <- c(
  prior(normal(0, 0.5), class = b),
  prior(cauchy(0, 1), class = sd)
)
m_brms2 <- brm(popular ~ extrav + (extrav|class), 
               data = popular,
               prior = priors)

60 / 128

Almost no difference in this case

m_brms2

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: popular ~ extrav + (extrav | class) 
##    Data: popular (Number of observations: 2000) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Group-Level Effects: 
## ~class (Number of levels: 100) 
##                       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)             1.70      0.17     1.39     2.06 1.01     1358     1824
## sd(extrav)                0.16      0.03     0.11     0.22 1.00      871     1683
## cor(Intercept,extrav)    -0.95      0.03    -1.00    -0.88 1.01      437      871
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     2.48      0.20     2.08     2.86 1.00     1157     1810
## extrav        0.49      0.03     0.44     0.54 1.00     1960     2642
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.95      0.02     0.92     0.98 1.00     3976     2977
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

61 / 128

Speed

We'll never reach the lme4::lmer() speeds, but we can make it faster.

Parallelize (not always as effective as you might hope)
Use the cmdstan backend
- This requires a little bit of additional work, but is probably worth it if you're fitting bigger models.

62 / 128

Installing cmdstan

First, install the {cmdstanr} package

install.packages(
  "cmdstanr",
  repos = c(
    "https://mc-stan.org/r-packages/", 
    getOption("repos")
  )
)

63 / 128

Installing cmdstan

First, install the {cmdstanr} package

install.packages(
  "cmdstanr",
  repos = c(
    "https://mc-stan.org/r-packages/", 
    getOption("repos")
  )
)

Then, install cmdstan

cmdstanr::install_cmdstan()

63 / 128

Timings

I'm not evaluating the below, but the timings were 112.646, 84.561, and 42.934 seconds, respectively.

library(tictoc)
tic()
m_brms <- brm(popular ~ extrav + (extrav|class), 
              data = popular)
toc()
# cmdstan
tic()
m_brms2 <- brm(popular ~ extrav + (extrav|class), 
               data = popular,
               backend = "cmdstanr")
toc()
# cmdstan w/parallelization
tic()
m_brms3 <- brm(popular ~ extrav + (extrav|class), 
               data = popular,
               backend = "cmdstanr",
               cores = 4)
toc()

64 / 128

Wrapping up our Bayes intro

65 / 128

Advantages to Bayes

Opportunity to incorporate prior knowledge into the modeling process (you don't really have to - could just set wide priors)
Natural interpretation of uncertainty
Can often allow you to estimate models that are difficult if not impossible with frequentist methods

66 / 128

Advantages to Bayes

Opportunity to incorporate prior knowledge into the modeling process (you don't really have to - could just set wide priors)
Natural interpretation of uncertainty
Can often allow you to estimate models that are difficult if not impossible with frequentist methods

Disadvatages

Generally going to be slower in implementation
You may run into pushback from others - particularly with respect to priors

66 / 128

Notes on the posterior

The posterior is the distribution of the parameters, given the data
Think of it as the distribution of what we don't know, but are interested in (model parameters), given what we know or have observed (the data), and our prior beliefs
Gives a complete picture of parameter uncertainty
We can do lots of things with the posterior that is hard to get otherwise

67 / 128

Break

05:00

68 / 128

Review of logistic regression

69 / 128

Data generating distribution

Up to this point, we've been assuming the data were generated from a normal distribution.

70 / 128

Data generating distribution

Up to this point, we've been assuming the data were generated from a normal distribution.

For example, we might fit a simple linear regression model to the wages data like this

$\begin{aligned} {wages}_{i} & \sim N (\hat{y}, σ^{2}) \\ \hat{y} & = α + β_{1} (exper) \end{aligned}$

where we're embedding a linear model into the mean

70 / 128

Data generating distribution

Up to this point, we've been assuming the data were generated from a normal distribution.

For example, we might fit a simple linear regression model to the wages data like this

$\begin{aligned} {wages}_{i} & \sim N (\hat{y}, σ^{2}) \\ \hat{y} & = α + β_{1} (exper) \end{aligned}$

where we're embedding a linear model into the mean In code

wages <- read_csv(here::here("data", "wages.csv")) %>% 
  mutate(hourly_wage = exp(lnw))
wages_lm <- lm(hourly_wage ~ exper, data = wages)

70 / 128

Graphically

ggplot(wages, aes(exper, hourly_wage)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

71 / 128

Graphically

ggplot(wages, aes(exper, hourly_wage)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Aside - you see why the log of wages was modeled instead?

71 / 128

Log wages

ggplot(wages, aes(exper, lnw)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

72 / 128

Move to binary model

Let's split the wages data into a binary classification based on whether it's above or below the mean.

73 / 128

Move to binary model

Let's split the wages data into a binary classification based on whether it's above or below the mean.

wages <- wages %>% 
  mutate(
    high_wage = ifelse(
      hourly_wage > mean(hourly_wage, na.rm = TRUE), 1, 0
    )
  )
wages %>% 
  select(id, hourly_wage, high_wage)

## # A tibble: 6,402 x 3
##       id hourly_wage high_wage
##    <dbl>       <dbl>     <dbl>
##  1    31    4.441535         0
##  2    31    4.191254         0
##  3    31    4.344888         0
##  4    31    5.748851         0
##  5    31    6.896403         0
##  6    31    5.523435         0
##  7    31    8.052640         1
##  8    31    8.406456         1
##  9    36    7.257243         0
## 10    36    6.037560         0
## # … with 6,392 more rows

73 / 128

Plot

means <- wages %>% 
  group_by(high_wage) %>% 
  summarize(mean = mean(exper))
ggplot(wages, aes(exper, high_wage)) +
  geom_point(alpha = 0.01) +
  geom_point(aes(x = mean), data = means, 
             shape = 23, 
             fill = "cornflowerblue")

74 / 128

We could fit a linear model

m_lpm <- lm(high_wage ~ exper, data = wages)
arm::display(m_lpm)

## lm(formula = high_wage ~ exper, data = wages)
##             coef.est coef.se
## (Intercept) 0.14     0.01   
## exper       0.06     0.00   
## ---
## n = 6402, k = 2
## residual sd = 0.45, R-Squared = 0.11

75 / 128

We could fit a linear model

m_lpm <- lm(high_wage ~ exper, data = wages)
arm::display(m_lpm)

## lm(formula = high_wage ~ exper, data = wages)
##             coef.est coef.se
## (Intercept) 0.14     0.01   
## exper       0.06     0.00   
## ---
## n = 6402, k = 2
## residual sd = 0.45, R-Squared = 0.11

This is referred to as a linear probability model (LPM) and they are pretty hotly contested, with proponents and detractors

75 / 128

Plot

ggplot(wages, aes(exper, high_wage)) +
  geom_point(alpha = 0.01) +
  geom_smooth(method = "lm", se = FALSE)

76 / 128

Prediction

What if somebody has an experience of 25 years?

77 / 128

Prediction

What if somebody has an experience of 25 years?

predict(m_lpm, newdata = data.frame(exper = 25))

##        1 
## 1.535723

🤨

Our prediction goes outside the range of our data.

77 / 128

Prediction

What if somebody has an experience of 25 years?

predict(m_lpm, newdata = data.frame(exper = 25))

##        1 
## 1.535723

🤨

Our prediction goes outside the range of our data.

As a rule, the assumed data generating process should match the boundaries of the data.

77 / 128

Prediction

What if somebody has an experience of 25 years?

predict(m_lpm, newdata = data.frame(exper = 25))

##        1 
## 1.535723

🤨

Our prediction goes outside the range of our data.

As a rule, the assumed data generating process should match the boundaries of the data.

Of course, you could truncate after the fact. I think that's less than ideal (see here or here for discussions contrasting LPM with logistic regression).

77 / 128

The binomial model

$y_{i} \sim Binomial (n, p_{i})$

78 / 128

The binomial model

$y_{i} \sim Binomial (n, p_{i})$ Think in terms of coin flips

78 / 128

The binomial model

$y_{i} \sim Binomial (n, p_{i})$ Think in terms of coin flips

$n$ is the number of coin flips, while $p_{i}$ is the probability of heads

78 / 128

Example

Flip 1 coin 10 times

set.seed(42)
rbinom(
  n = 10, # number of trials
  size = 1, # number of coins
  prob = 0.5 # probability of heads
)

##  [1] 1 1 0 1 1 1 1 0 1 1

79 / 128

Example

Flip 1 coin 10 times

set.seed(42)
rbinom(
  n = 10, # number of trials
  size = 1, # number of coins
  prob = 0.5 # probability of heads
)

##  [1] 1 1 0 1 1 1 1 0 1 1

Side note - a binomial model with size = 1 (or $n = 1$ in equation form) is equivalent to a Bernoulli distribution

79 / 128

Example

Flip 1 coin 10 times

set.seed(42)
rbinom(
  n = 10, # number of trials
  size = 1, # number of coins
  prob = 0.5 # probability of heads
)

##  [1] 1 1 0 1 1 1 1 0 1 1

Side note - a binomial model with size = 1 (or $n = 1$ in equation form) is equivalent to a Bernoulli distribution

Flip 10 coins 1 time

rbinom(n = 1, size = 10, prob = 0.5)

## [1] 5

79 / 128

Modeling

We now build a linear model for $p$ , just like we previously built a linear model for $μ$ .

80 / 128

Modeling

We now build a linear model for $p$ , just like we previously built a linear model for $μ$ .

A problem

Probability is bounded $[0, 1]$

We need to ensure that our model respects these bounds

80 / 128

Link functions

We solve this problem by using a link function

$\begin{aligned} y_{i} & \sim Binomial (n, p_{i}) \\ f (p_{i}) & = α + β (x_{i}) \end{aligned}$

81 / 128

Link functions

We solve this problem by using a link function

$\begin{aligned} y_{i} & \sim Binomial (n, p_{i}) \\ f (p_{i}) & = α + β (x_{i}) \end{aligned}$

Instead of modeling $p_{i}$ directly, we model $f (p_{i})$

81 / 128

Link functions

We solve this problem by using a link function

$\begin{aligned} y_{i} & \sim Binomial (n, p_{i}) \\ f (p_{i}) & = α + β (x_{i}) \end{aligned}$

Instead of modeling $p_{i}$ directly, we model $f (p_{i})$
The specific $f$ is the link function

81 / 128

Link functions

We solve this problem by using a link function

$\begin{aligned} y_{i} & \sim Binomial (n, p_{i}) \\ f (p_{i}) & = α + β (x_{i}) \end{aligned}$

Instead of modeling $p_{i}$ directly, we model $f (p_{i})$
The specific $f$ is the link function
Link functions map the linear space of the model to the non-linear parameters (like probability)

81 / 128

Link functions

We solve this problem by using a link function

$\begin{aligned} y_{i} & \sim Binomial (n, p_{i}) \\ f (p_{i}) & = α + β (x_{i}) \end{aligned}$

Instead of modeling $p_{i}$ directly, we model $f (p_{i})$
The specific $f$ is the link function
Link functions map the linear space of the model to the non-linear parameters (like probability)
The log and logit links are most common

81 / 128

Binomial logistic regression

If we only have two categories, we commonly assume a binomial distribution, with a logit link.

$\begin{aligned} y_{i} & \sim Binomial (n, p_{i}) \\ logit (p_{i}) & = α + β (x_{i}) \end{aligned}$

82 / 128

Binomial logistic regression

If we only have two categories, we commonly assume a binomial distribution, with a logit link.

$\begin{aligned} y_{i} & \sim Binomial (n, p_{i}) \\ logit (p_{i}) & = α + β (x_{i}) \end{aligned}$ where the logit link is defined by the log-odds

$logit (p_{i}) = \log [\frac{p_{i}}{1 - p_{i}}]$

82 / 128

Binomial logistic regression

If we only have two categories, we commonly assume a binomial distribution, with a logit link.

$\begin{aligned} y_{i} & \sim Binomial (n, p_{i}) \\ logit (p_{i}) & = α + β (x_{i}) \end{aligned}$ where the logit link is defined by the log-odds

$logit (p_{i}) = \log [\frac{p_{i}}{1 - p_{i}}]$

$\log [\frac{p_{i}}{1 - p_{i}}] = α + β (x_{i})$

82 / 128

Inverse link

What we probably want to interpret is probability

83 / 128

Inverse link

What we probably want to interpret is probability

We can transform the log-odds to probability by exponentiating

$p_{i} = \frac{\exp (α + β (x_{i}))}{1 + \exp (α + β (x_{i}))}$

83 / 128

Inverse link

What we probably want to interpret is probability

We can transform the log-odds to probability by exponentiating

$p_{i} = \frac{\exp (α + β (x_{i}))}{1 + \exp (α + β (x_{i}))}$ This is the logistic function, or the inverse-logit.

83 / 128

Example logistic regression model

m_glm <- glm(high_wage ~ exper, 
             data = wages, 
             family = binomial(link = "logit"))
arm::display(m_glm)

## glm(formula = high_wage ~ exper, family = binomial(link = "logit"), 
##     data = wages)
##             coef.est coef.se
## (Intercept) -1.65     0.05  
## exper        0.25     0.01  
## ---
##   n = 6402, k = 2
##   residual deviance = 7655.2, null deviance = 8345.8 (difference = 690.5)

84 / 128

Coefficient interpretation

The coefficients are reported on the log-odds scale. Other than that, interpretation is the same.

85 / 128

Coefficient interpretation

The coefficients are reported on the log-odds scale. Other than that, interpretation is the same.

For example:

The log-odds of a participant with zero years experience being in the high wage category was -1.65.

For every one year of additional experience, the log-odds of being in the high wage category increased by 0.25.

85 / 128

Note

Outside of scientific audiences, almost nobody is going to understand the previous slide
You cannot just transform the coefficients and interpret them as probabilities (because it is non-linear on the probability scale).

86 / 128

Log odds scale

tibble(exper = 0:25) %>% 
  mutate(pred = predict(m_glm, newdata = .)) %>% 
  ggplot(aes(exper, pred)) +
  geom_line()

87 / 128

Log odds scale

tibble(exper = 0:25) %>% 
  mutate(pred = predict(m_glm, newdata = .)) %>% 
  ggplot(aes(exper, pred)) +
  geom_line()

Perfectly straight line - change in log-odds are modeled as a linear function of experience

87 / 128

Probability scale

tibble(exper = 0:25) %>% 
  mutate(pred = predict(m_glm, 
                        newdata = ., 
                        type = "response")) %>% 
  ggplot(aes(exper, pred)) +
  geom_line()

88 / 128

Probability scale

tibble(exper = 0:25) %>% 
  mutate(pred = predict(m_glm, 
                        newdata = ., 
                        type = "response")) %>% 
  ggplot(aes(exper, pred)) +
  geom_line()

Our model parameters map to probability non-linearly, and it is bound to $[0, 1]$

88 / 128

Probability predictions

Let's make the predictions from the previous slide "by hand"

89 / 128

Probability predictions

Let's make the predictions from the previous slide "by hand"

$p_{i} = \frac{\exp (α + β (x_{i}))}{1 + \exp (α + β (x_{i}))}$

89 / 128

Probability predictions

Let's make the predictions from the previous slide "by hand"

$p_{i} = \frac{\exp (α + β (x_{i}))}{1 + \exp (α + β (x_{i}))}$ And our coefficients are

coef(m_glm)

## (Intercept)       exper 
##  -1.6467568   0.2536815

89 / 128

Intercept

$(p_{i} | exper = 0) = \frac{\exp (- 1.65 + 0.25 (0))}{1 + \exp (- 1.65 + 0.25 (0))}$

90 / 128

Intercept

$(p_{i} | exper = 0) = \frac{\exp (- 1.65 + 0.25 (0))}{1 + \exp (- 1.65 + 0.25 (0))}$ $(p_{i} | exper = 0) = \frac{\exp (- 1.65)}{1 + \exp (- 1.65)}$

90 / 128

Intercept

$(p_{i} | exper = 0) = \frac{\exp (- 1.65 + 0.25 (0))}{1 + \exp (- 1.65 + 0.25 (0))}$ $(p_{i} | exper = 0) = \frac{\exp (- 1.65)}{1 + \exp (- 1.65)}$ $(p_{i} | exper = 0) = \frac{0.19}{1.19} = 0.16$

90 / 128

Five years experience

Notice the exponentiation happens after adding the coefficients together

$(p_{i} | exper = 5) = \frac{\exp (- 1.65 + 0.25 (5))}{1 + \exp (- 1.65 + 0.25 (5))}$

91 / 128

Five years experience

Notice the exponentiation happens after adding the coefficients together

$(p_{i} | exper = 5) = \frac{\exp (- 1.65 + 0.25 (5))}{1 + \exp (- 1.65 + 0.25 (5))}$ $(p_{i} | exper = 5) = \frac{\exp (- 0.4)}{1 + \exp (- 0.4)}$

91 / 128

Five years experience

Notice the exponentiation happens after adding the coefficients together

$(p_{i} | exper = 5) = \frac{\exp (- 1.65 + 0.25 (5))}{1 + \exp (- 1.65 + 0.25 (5))}$ $(p_{i} | exper = 5) = \frac{\exp (- 0.4)}{1 + \exp (- 0.4)}$ $(p_{i} | exper = 5) = \frac{0.67}{1.67} = 0.40$

91 / 128

Fifteen years experience

$(p_{i} | exper = 15) = \frac{\exp (- 1.65 + 0.25 (15))}{1 + \exp (- 1.65 + 0.25 (15))}$

92 / 128

Fifteen years experience

$(p_{i} | exper = 15) = \frac{\exp (- 1.65 + 0.25 (15))}{1 + \exp (- 1.65 + 0.25 (15))}$ $(p_{i} | exper = 15) = \frac{\exp (2.1)}{1 + \exp (2.1)}$

92 / 128

Fifteen years experience

$(p_{i} | exper = 15) = \frac{\exp (- 1.65 + 0.25 (15))}{1 + \exp (- 1.65 + 0.25 (15))}$ $(p_{i} | exper = 15) = \frac{\exp (2.1)}{1 + \exp (2.1)}$ $(p_{i} | exper = 15) = \frac{8.16}{9.16} = 0.89$

92 / 128

More interpretation

Let's try to make this so more people might understand.

93 / 128

More interpretation

Let's try to make this so more people might understand.

First, show the logistic curve on the probability scale! Can help prevent linear interpretations

93 / 128

More interpretation

Let's try to make this so more people might understand.

First, show the logistic curve on the probability scale! Can help prevent linear interpretations

Discuss probabilities at different $x$ values

93 / 128

More interpretation

Let's try to make this so more people might understand.

First, show the logistic curve on the probability scale! Can help prevent linear interpretations

Discuss probabilities at different $x$ values

The "divide by 4" rule

93 / 128

Divide by 4

Logistic curve is steepest when $p_{i} = 0.5$
The slope of the logistic curve (its derivative) is maximized at $β / 4$
Aside from the intercept, we can say that the change is no more than $β / 4$

94 / 128

Divide by 4

Logistic curve is steepest when $p_{i} = 0.5$
The slope of the logistic curve (its derivative) is maximized at $β / 4$
Aside from the intercept, we can say that the change is no more than $β / 4$

Example

$\frac{0.25}{4} = 0.0625$

94 / 128

Divide by 4

Logistic curve is steepest when $p_{i} = 0.5$
The slope of the logistic curve (its derivative) is maximized at $β / 4$
Aside from the intercept, we can say that the change is no more than $β / 4$

Example

$\frac{0.25}{4} = 0.0625$ So, a one year increase in experience corresponds to, at most, a 6% increase in being in the high wage category

94 / 128

Example writeup

There was a 16 percent chance that a participant with zero years experience would be in the high wage category. The logistic function, which mapped years of experience to the probability of being in the high-wage category, was non-linear, as shown in Figure 1. At its steepest point, a one year increase in experience corresponded with approximately a 6% increase in the probability of being in the high-wage category. For individuals with 5, 10, and 15 years of experience, the probability increased to a 41, 71, and 90 percent chance, respectively.

95 / 128

Other distributions

96 / 128

Figure from McElreath

97 / 128

Multilevel logistic regression

98 / 128

The data

Polling data from the 1988 election.

polls <- rio::import(here::here("data", "polls.dta"),
                     setclass = "tbl_df")
polls

## # A tibble: 13,544 x 10
##      org  year survey  bush state   edu   age female black weight
##    <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
##  1     1     1      1     1     7     2     2      1     0   1403
##  2     1     1      1     1    33     4     3      0     0    778
##  3     1     1      1     0    20     2     1      1     0   1564
##  4     1     1      1     1    31     3     2      1     0   1055
##  5     1     1      1     1    18     3     1      1     0   1213
##  6     1     1      1     1    31     4     2      0     0    910
##  7     1     1      1     1    40     1     3      0     0    735
##  8     1     1      1     1    33     4     2      1     0    410
##  9     1     1      1     0    22     4     2      1     0    410
## 10     1     1      1     1    22     4     3      0     0    778
## # … with 13,534 more rows

99 / 128

About the data

Collected one week before the election
Nationally representative sample
Should use post-stratification to control for non-response, but we'll hold off on that for now (See Gelman & Hill, Chapter 14 for more details)

100 / 128

Baseline probability

Let's assume we want to estimate the probability that Bush will be elected.

101 / 128

Baseline probability

Let's assume we want to estimate the probability that Bush will be elected.

We could fit a single level model like this

$\begin{aligned} bush & \sim Binomial (n = 1, {prob}_{bush = 1} = \hat{P}) \\ \log [\frac{\hat{P}}{1 - \hat{P}}] & = α \end{aligned}$

101 / 128

In code

Can you write the code for the previous model?

01:00

102 / 128

In code

Can you write the code for the previous model?

01:00

bush_sl <- glm(bush ~ 1, 
               data = polls,
               family = binomial(link = "logit"))

102 / 128

In code

Can you write the code for the previous model?

01:00

bush_sl <- glm(bush ~ 1, 
               data = polls,
               family = binomial(link = "logit"))

arm::display(bush_sl)

## glm(formula = bush ~ 1, family = binomial(link = "logit"), data = polls)
##             coef.est coef.se
## (Intercept) 0.25     0.02   
## ---
##   n = 11566, k = 1
##   residual deviance = 15858.1, null deviance = 15858.1 (difference = 0.0)

102 / 128

In code

Can you write the code for the previous model?

01:00

bush_sl <- glm(bush ~ 1, 
               data = polls,
               family = binomial(link = "logit"))

arm::display(bush_sl)

## glm(formula = bush ~ 1, family = binomial(link = "logit"), data = polls)
##             coef.est coef.se
## (Intercept) 0.25     0.02   
## ---
##   n = 11566, k = 1
##   residual deviance = 15858.1, null deviance = 15858.1 (difference = 0.0)

So, $p_{i} = \frac{\exp (0.25)}{1 + \exp (0.25)} = 0.56$

102 / 128

State-level variability

To estimate state-level variability, we just specify a distribution for the intercept variability.

103 / 128

State-level variability

To estimate state-level variability, we just specify a distribution for the intercept variability.

103 / 128

State-level variability

To estimate state-level variability, we just specify a distribution for the intercept variability.

$\begin{aligned} bush & \sim Binomial (n = 1, {prob}_{bush = 1} = \hat{P}) \\ \log [\frac{\hat{P}}{1 - \hat{P}}] & = α_{j [i]} \\ α_{j} & \sim N (μ_{α_{j}}, σ_{α_{j}}^{2}), for state j = 1, \dots,J \end{aligned}$ Notice we're still specifying that the intercept variability is generated by a normal distribution.

103 / 128

State-level variability

To estimate state-level variability, we just specify a distribution for the intercept variability.

What does this variability actually represent?

103 / 128

State-level variability

To estimate state-level variability, we just specify a distribution for the intercept variability.

What does this variability actually represent?

Variance in the log-odds

103 / 128

Fitting the model

If we're using {lme4}, we just swap lmer() for glmer() and specify the family and link function.

104 / 128

Fitting the model

If we're using {lme4}, we just swap lmer() for glmer() and specify the family and link function.

library(lme4)
m0 <- glmer(bush ~ 1 + (1|state),
           data = polls,
           family = binomial(link = "logit"))
arm::display(m0)

## glmer(formula = bush ~ 1 + (1 | state), data = polls, family = binomial(link = "logit"))
## coef.est  coef.se 
##     0.25     0.06 
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  state    (Intercept) 0.34    
##  Residual             1.00    
## ---
## number of obs: 11566, groups: state, 49
## AIC = 15697, DIC = 15450.4
## deviance = 15571.7

104 / 128

Interpretation

The average log odds of supporting bush was 0.25
This average varied between states with a standard deviation of 0.34

105 / 128

State-level variation

library(broom.mixed)
m0_tidied <- tidy(m0, effects = "ran_vals", conf.int = TRUE)
m0_tidied

## # A tibble: 49 x 8
##    effect   group level term           estimate  std.error    conf.low   conf.high
##    <chr>    <chr> <chr> <chr>             <dbl>      <dbl>       <dbl>       <dbl>
##  1 ran_vals state 1     (Intercept)  0.5201836  0.1526301   0.2210341   0.8193332 
##  2 ran_vals state 3     (Intercept)  0.09438107 0.1424292  -0.1847750   0.3735371 
##  3 ran_vals state 4     (Intercept)  0.06905650 0.1737493  -0.2714859   0.4095989 
##  4 ran_vals state 5     (Intercept)  0.03522421 0.05570977 -0.07396493  0.1444133 
##  5 ran_vals state 6     (Intercept)  0.1045418  0.1513626  -0.1921233   0.4012070 
##  6 ran_vals state 7     (Intercept) -0.1015410  0.1546380  -0.4046260   0.2015440 
##  7 ran_vals state 8     (Intercept) -0.3824312  0.2376744  -0.8482644   0.08340195
##  8 ran_vals state 9     (Intercept) -0.6229779  0.2931982  -1.197636   -0.04832002
##  9 ran_vals state 10    (Intercept)  0.3027001  0.07975137  0.1463903   0.4590099 
## 10 ran_vals state 11    (Intercept)  0.09282107 0.1173420  -0.1371651   0.3228072 
## # … with 39 more rows

106 / 128

Fancified Plot Code

m0_tidied %>% 
  mutate(level = forcats::fct_reorder(level, estimate)) %>% 
  ggplot(aes(estimate, level)) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  geom_point(aes(color = estimate)) +
  geom_vline(xintercept = 0, color = "gray40", size = 3) +
  labs(x = "Log-Odds Estimate", y = "") +
  colorspace::scale_color_continuous_diverging(palette = "Blue-Red 2") +
  guides(color = "none") +
  theme(panel.grid.major.y = element_blank(),
        axis.text.y = element_blank())

107 / 128

Fancified Plot

108 / 128

Extending the model

Let's add some predictors

Include age, female and black as fixed effect predictors.

109 / 128

Extending the model

Let's add some predictors

Include age, female and black as fixed effect predictors.
We should center age so the intercept represents the sample average age.

109 / 128

Extending the model

Let's add some predictors

Include age, female and black as fixed effect predictors.
We should center age so the intercept represents the sample average age.

You try first

03:00

109 / 128

polls <- polls %>% 
  mutate(age_c = age - mean(age, na.rm = TRUE))
m1 <- glmer(bush ~ age_c + female + black + (1|state),
            data = polls,
            family = binomial(link = "logit"))
arm::display(m1)

## glmer(formula = bush ~ age_c + female + black + (1 | state), 
##     data = polls, family = binomial(link = "logit"))
##             coef.est coef.se
## (Intercept)  0.43     0.07  
## age_c       -0.08     0.02  
## female      -0.11     0.04  
## black       -1.84     0.09  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  state    (Intercept) 0.41    
##  Residual             1.00    
## ---
## number of obs: 11566, groups: state, 49
## AIC = 15137.1, DIC = 14856.7
## deviance = 14991.9

110 / 128

Varying slopes