+ - 0:00:00
Notes for current slide
Notes for next slide

Intro to Bayes

Daniel Anderson

Week 7

1 / 65

Agenda

[We're not going to get through it all]

  • Review Homework 2

  • Some equation practice

  • Introduce Bayes theorem

    • Go through an example with estimating a mean
  • Discuss Bayes in the context of regression modeling

2 / 65

Homework 2 Review

3 / 65

Equation practice

4 / 65

The data

From an example in the Mplus manual. I made up the column names.

01:30
mplus_d <- read_csv(here::here("data", "mplus920.csv"))
mplus_d
## # A tibble: 7,500 x 6
## score baseline sch_treatment dist_ses schid distid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.559216 1.383101 1 -0.642262 1 1
## 2 -0.107394 -0.789654 1 -0.642262 1 1
## 3 0.049476 -0.760867 1 -0.642262 1 1
## 4 -2.387703 -0.798527 1 -0.642262 1 1
## 5 1.180393 -0.411377 1 -0.642262 1 1
## 6 3.959005 -0.987154 1 -0.642262 2 1
## 7 -0.895792 -1.966773 1 -0.642262 2 1
## 8 2.879087 0.42117 1 -0.642262 2 1
## 9 5.611088 1.67047 1 -0.642262 2 1
## 10 2.828119 0.001154 1 -0.642262 2 1
## # … with 7,490 more rows
5 / 65

Model 1

Fit the following model

01:30

scoreiN(αj[i],σ2)αjN(μαj,σ2αj), for distid j = 1,,JscoreiN(αj[i],σ2)αjN(μαj,σ2αj), for distid j = 1,,J

6 / 65

Model 1

Fit the following model

01:30

scoreiN(αj[i],σ2)αjN(μαj,σ2αj), for distid j = 1,,JscoreiN(αj[i],σ2)αjN(μαj,σ2αj), for distid j = 1,,J

lmer(score ~ 1 + (1|distid), data = mplus_d)
6 / 65

Model 2

Fit the following model

01:00

scoreiN(αj[i],k[i]+β1(baseline),σ2)αjN(μαj,σ2αj), for schid j = 1,,JαkN(μαk,σ2αk), for distid k = 1,,KscoreiN(αj[i],k[i]+β1(baseline),σ2)αjN(μαj,σ2αj), for schid j = 1,,JαkN(μαk,σ2αk), for distid k = 1,,K

7 / 65

Model 2

Fit the following model

01:00

scoreiN(αj[i],k[i]+β1(baseline),σ2)αjN(μαj,σ2αj), for schid j = 1,,JαkN(μαk,σ2αk), for distid k = 1,,KscoreiN(αj[i],k[i]+β1(baseline),σ2)αjN(μαj,σ2αj), for schid j = 1,,JαkN(μαk,σ2αk), for distid k = 1,,K

lmer(score ~ baseline + (1|schid) + (1|distid),
data = mplus_d)
7 / 65

Model 3

Fit the following model

01:30

scoreiN(αj[i],k[i]+β1j[i](baseline),σ2)(αjβ1j)N((μαjμβ1j),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,JαkN(γα0+γα1(dist_ses)+γα2(baseline×dist_ses),σ2αk), for distid k = 1,,KscoreiN(αj[i],k[i]+β1j[i](baseline),σ2)(αjβ1j)N((μαjμβ1j),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,JαkN(γα0+γα1(dist_ses)+γα2(baseline×dist_ses),σ2αk), for distid k = 1,,K

8 / 65

Model 3

Fit the following model

01:30

scoreiN(αj[i],k[i]+β1j[i](baseline),σ2)(αjβ1j)N((μαjμβ1j),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,JαkN(γα0+γα1(dist_ses)+γα2(baseline×dist_ses),σ2αk), for distid k = 1,,KscoreiN(αj[i],k[i]+β1j[i](baseline),σ2)(αjβ1j)N((μαjμβ1j),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,JαkN(γα0+γα1(dist_ses)+γα2(baseline×dist_ses),σ2αk), for distid k = 1,,K

lmer(score ~ baseline * dist_ses +
(baseline|schid) + (1|distid),
data = mplus_d)
8 / 65

Model 4

Fit the following model

01:30

scoreiN(αj[i],k[i]+β1j[i](baseline),σ2)(αjβ1j)N((γα0+γα1(sch_treatment)μβ1j),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,JαkN(γα0+γα1(dist_ses),σ2αk), for distid k = 1,,KscoreiN(αj[i],k[i]+β1j[i](baseline),σ2)(αjβ1j)N((γα0+γα1(sch_treatment)μβ1j),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,JαkN(γα0+γα1(dist_ses),σ2αk), for distid k = 1,,K

9 / 65

Model 4

Fit the following model

01:30

scoreiN(αj[i],k[i]+β1j[i](baseline),σ2)(αjβ1j)N((γα0+γα1(sch_treatment)μβ1j),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,JαkN(γα0+γα1(dist_ses),σ2αk), for distid k = 1,,KscoreiN(αj[i],k[i]+β1j[i](baseline),σ2)(αjβ1j)N((γα0+γα1(sch_treatment)μβ1j),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,JαkN(γα0+γα1(dist_ses),σ2αk), for distid k = 1,,K

lmer(score ~ baseline + sch_treatment + dist_ses +
(baseline|schid) + (1|distid),
data = mplus_d)
9 / 65

Model 5

Fit the following model

01:30

scoreiN(αj[i],k[i]+β1j[i](baseline),σ2)(αjβ1j)N((γα0+γα1k[i](sch_treatment)γβ10+γβ11(sch_treatment)),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,J(αkγ1k)N((γα0+γα1(dist_ses)μγ1k),(σ2αkραkγ1kργ1kαkσ2γ1k)), for distid k = 1,,KscoreiN(αj[i],k[i]+β1j[i](baseline),σ2)(αjβ1j)N((γα0+γα1k[i](sch_treatment)γβ10+γβ11(sch_treatment)),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,J(αkγ1k)N((γα0+γα1(dist_ses)μγ1k),(σ2αkραkγ1kργ1kαkσ2γ1k)), for distid k = 1,,K

10 / 65

Model 5

Fit the following model

01:30

scoreiN(αj[i],k[i]+β1j[i](baseline),σ2)(αjβ1j)N((γα0+γα1k[i](sch_treatment)γβ10+γβ11(sch_treatment)),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,J(αkγ1k)N((γα0+γα1(dist_ses)μγ1k),(σ2αkραkγ1kργ1kαkσ2γ1k)), for distid k = 1,,KscoreiN(αj[i],k[i]+β1j[i](baseline),σ2)(αjβ1j)N((γα0+γα1k[i](sch_treatment)γβ10+γβ11(sch_treatment)),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,J(αkγ1k)N((γα0+γα1(dist_ses)μγ1k),(σ2αkραkγ1kργ1kαkσ2γ1k)), for distid k = 1,,K

lmer(score ~ baseline * sch_treatment + dist_ses +
(baseline|schid) + (sch_treatment|distid),
data = mplus_d)
10 / 65

Model 6

Fit the following model

01:30

scoreiN(αj[i],k[i]+β1j[i],k[i](baseline),σ2)(αjβ1j)N((γα0+γα1k[i](sch_treatment)γβ11k[i0+γβ11k[i](sch_treatment)),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,J(αkβ1kγ1kγβ11k)N((γα0+γα1(dist_ses)μβ1kμγ1kμγβ11k),(σ2αk0000σ2β1k0000σ2γ1k0000σ2γβ11k)), for distid k = 1,,KscoreiN(αj[i],k[i]+β1j[i],k[i](baseline),σ2)(αjβ1j)N((γα0+γα1k[i](sch_treatment)γβ11k[i0+γβ11k[i](sch_treatment)),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,J⎜ ⎜ ⎜ ⎜αkβ1kγ1kγβ11k⎟ ⎟ ⎟ ⎟N⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎜ ⎜ ⎜ ⎜γα0+γα1(dist_ses)μβ1kμγ1kμγβ11k⎟ ⎟ ⎟ ⎟,⎜ ⎜ ⎜ ⎜ ⎜ ⎜σ2αk0000σ2β1k0000σ2γ1k0000σ2γβ11k⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎟ ⎟ ⎟ ⎟ ⎟ ⎟, for distid k = 1,,K

11 / 65

Model 6

Fit the following model

01:30

scoreiN(αj[i],k[i]+β1j[i],k[i](baseline),σ2)(αjβ1j)N((γα0+γα1k[i](sch_treatment)γβ11k[i0+γβ11k[i](sch_treatment)),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,J(αkβ1kγ1kγβ11k)N((γα0+γα1(dist_ses)μβ1kμγ1kμγβ11k),(σ2αk0000σ2β1k0000σ2γ1k0000σ2γβ11k)), for distid k = 1,,KscoreiN(αj[i],k[i]+β1j[i],k[i](baseline),σ2)(αjβ1j)N((γα0+γα1k[i](sch_treatment)γβ11k[i0+γβ11k[i](sch_treatment)),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,J⎜ ⎜ ⎜ ⎜αkβ1kγ1kγβ11k⎟ ⎟ ⎟ ⎟N⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎜ ⎜ ⎜ ⎜γα0+γα1(dist_ses)μβ1kμγ1kμγβ11k⎟ ⎟ ⎟ ⎟,⎜ ⎜ ⎜ ⎜ ⎜ ⎜σ2αk0000σ2β1k0000σ2γ1k0000σ2γβ11k⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎟ ⎟ ⎟ ⎟ ⎟ ⎟, for distid k = 1,,K

lmer(score ~ baseline * sch_treatment + dist_ses +
(baseline|schid) +
(baseline * sch_treatment||distid),
data = mplus_d)
11 / 65

Final one

Fit the following model

01:30

scoreiN(αj[i],k[i]+β1j[i],k[i](baseline),σ2)(αjβ1j)N((γα0+γα1(sch_treatment)γβ10+γβ11(sch_treatment)),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,J(αkβ1k)N((γα0+γα1(dist_ses)+γα2(dist_ses×sch_treatment)γβ10+γβ11(dist_ses)+γβ11(dist_ses×sch_treatment)),(σ2αkραkβ1kρβ1kαkσ2β1k)), for distid k = 1,,KscoreiN(αj[i],k[i]+β1j[i],k[i](baseline),σ2)(αjβ1j)N((γα0+γα1(sch_treatment)γβ10+γβ11(sch_treatment)),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,J(αkβ1k)N((γα0+γα1(dist_ses)+γα2(dist_ses×sch_treatment)γβ10+γβ11(dist_ses)+γβ11(dist_ses×sch_treatment)),(σ2αkραkβ1kρβ1kαkσ2β1k)), for distid k = 1,,K

12 / 65

Final one

Fit the following model

01:30

scoreiN(αj[i],k[i]+β1j[i],k[i](baseline),σ2)(αjβ1j)N((γα0+γα1(sch_treatment)γβ10+γβ11(sch_treatment)),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,J(αkβ1k)N((γα0+γα1(dist_ses)+γα2(dist_ses×sch_treatment)γβ10+γβ11(dist_ses)+γβ11(dist_ses×sch_treatment)),(σ2αkραkβ1kρβ1kαkσ2β1k)), for distid k = 1,,KscoreiN(αj[i],k[i]+β1j[i],k[i](baseline),σ2)(αjβ1j)N((γα0+γα1(sch_treatment)γβ10+γβ11(sch_treatment)),(σ2αjραjβ1jρβ1jαjσ2β1j)), for schid j = 1,,J(αkβ1k)N((γα0+γα1(dist_ses)+γα2(dist_ses×sch_treatment)γβ10+γβ11(dist_ses)+γβ11(dist_ses×sch_treatment)),(σ2αkραkβ1kρβ1kαkσ2β1k)), for distid k = 1,,K

lmer(score ~ baseline * sch_treatment * dist_ses +
(baseline|schid) + (baseline |distid),
data = mplus_d)
12 / 65

Bayes

13 / 65

A disclaimer

  • There is no chance we'll really be able to do Bayes justice in this class

  • The hope for today is that you'll get an introduction

  • By the end you should be able to fit the models you already can, but in a Bayes framework

  • Hopefully you also recognize the tradeoffs, and potential extensions

14 / 65

Bayes theorem

15 / 65

In equation form

You'll see this presented many different ways, perhaps mostly commonly as

p(BA)=p(AB)×p(B)p(A)p(BA)=p(AB)×p(B)p(A) where is read as "given" and pp is the probability

16 / 65

In equation form

You'll see this presented many different ways, perhaps mostly commonly as

p(BA)=p(AB)×p(B)p(A)p(BA)=p(AB)×p(B)p(A) where is read as "given" and pp is the probability

I prefer to give AA and BB more meaningful names

16 / 65

In equation form

You'll see this presented many different ways, perhaps mostly commonly as

p(BA)=p(AB)×p(B)p(A)p(BA)=p(AB)×p(B)p(A) where is read as "given" and pp is the probability

I prefer to give AA and BB more meaningful names

p(priordata)=p(dataprior)×p(prior)datap(priordata)=p(dataprior)×p(prior)data

16 / 65

A real example

Classifying a student with a learning disability. We want to know

p(LDTestp)=p(TestpLD)×p(LD)Testpp(LDTestp)=p(TestpLD)×p(LD)Testp

17 / 65

A real example

Classifying a student with a learning disability. We want to know

p(LDTestp)=p(TestpLD)×p(LD)Testpp(LDTestp)=p(TestpLD)×p(LD)TestpNotice, this means we need to know:

17 / 65

A real example

Classifying a student with a learning disability. We want to know

p(LDTestp)=p(TestpLD)×p(LD)Testpp(LDTestp)=p(TestpLD)×p(LD)TestpNotice, this means we need to know:

  • True positive rate of the test, p(TestpLD)p(TestpLD)
17 / 65

A real example

Classifying a student with a learning disability. We want to know

p(LDTestp)=p(TestpLD)×p(LD)Testpp(LDTestp)=p(TestpLD)×p(LD)TestpNotice, this means we need to know:

  • True positive rate of the test, p(TestpLD)p(TestpLD)
  • Base rate for learning disabilities, p(LD)p(LD)
17 / 65

A real example

Classifying a student with a learning disability. We want to know

p(LDTestp)=p(TestpLD)×p(LD)Testpp(LDTestp)=p(TestpLD)×p(LD)TestpNotice, this means we need to know:

  • True positive rate of the test, p(TestpLD)p(TestpLD)
  • Base rate for learning disabilities, p(LD)p(LD)

  • Base rate for testing positive, TestpTestp

17 / 65

Estimating

If we have these things, we can estimate the probability that a student has a learning disability, given a positive test.

18 / 65

Estimating

If we have these things, we can estimate the probability that a student has a learning disability, given a positive test.

Let's assume:

  • p(TestpLD)=0.90p(TestpLD)=0.90

  • p(LD)=0.10p(LD)=0.10

  • Testp=0.20Testp=0.20

18 / 65

p(LDTestp)=.90×.10.20p(LDTestp)=.90×.10.20

19 / 65

p(LDTestp)=.90×.10.20p(LDTestp)=.90×.10.20

p(LDTestp)=0.45p(LDTestp)=0.45

19 / 65

p(LDTestp)=.90×.10.20p(LDTestp)=.90×.10.20

p(LDTestp)=0.45p(LDTestp)=0.45

A bit less than you might have expected? Probability is hard...

19 / 65

p(LDTestp)=.90×.10.20p(LDTestp)=.90×.10.20

p(LDTestp)=0.45p(LDTestp)=0.45

A bit less than you might have expected? Probability is hard...

When we see things like "90% true positive rate" we want to interpret it as p(LDTestp)p(LDTestp), when it's actually p(TestpLD)p(TestpLD)

19 / 65

Pieces

If we know all the pieces, we can estimate Bayes theorem directly.

20 / 65

Pieces

If we know all the pieces, we can estimate Bayes theorem directly.

Unfortunately this is almost never the case...

20 / 65

Alternative view

updated beliefs=likelihood of data×prior informationaverage likelihoodupdated beliefs=likelihood of data×prior informationaverage likelihood

21 / 65

Alternative view

updated beliefs=likelihood of data×prior informationaverage likelihoodupdated beliefs=likelihood of data×prior informationaverage likelihood How do we calculate the likelihood of the data? We have to assume some distribution.

21 / 65

Example with IQ

This example borrowed from TJ Mahr

#install.packages("carData")
iqs <- carData::Burt$IQbio
iqs
## [1] 82 80 88 108 116 117 132 71 75 93 95 88 111 63 77 86 83 93 97 87 94 96 112 113 106 107 98
22 / 65

Example with IQ

This example borrowed from TJ Mahr

#install.packages("carData")
iqs <- carData::Burt$IQbio
iqs
## [1] 82 80 88 108 116 117 132 71 75 93 95 88 111 63 77 86 83 93 97 87 94 96 112 113 106 107 98

IQ scores are generally assumed to be generated from a distribution that looks like this:

IQiN(100,15)IQiN(100,15)

22 / 65

Example with IQ

This example borrowed from TJ Mahr

#install.packages("carData")
iqs <- carData::Burt$IQbio
iqs
## [1] 82 80 88 108 116 117 132 71 75 93 95 88 111 63 77 86 83 93 97 87 94 96 112 113 106 107 98

IQ scores are generally assumed to be generated from a distribution that looks like this:

IQiN(100,15)IQiN(100,15)

22 / 65

Likelihood

What's the likelihood of a score of 80, assuming this distribution?

23 / 65

Likelihood

What's the likelihood of a score of 80, assuming this distribution?

23 / 65

Likelihood

What's the likelihood of a score of 80, assuming this distribution?

dnorm(80, mean = 100, sd = 15)
## [1] 0.010934
23 / 65

Likelihood of the data

We sum the likelihood to get the overall likelihood of the data. However, this leads to very small numbers. Computationally, it's easier to sum the log of these likelihoods.

24 / 65

Likelihood of the data

We sum the likelihood to get the overall likelihood of the data. However, this leads to very small numbers. Computationally, it's easier to sum the log of these likelihoods.

dnorm(iqs, mean = 100, sd = 15, log = TRUE)
## [1] -4.346989 -4.515878 -3.946989 -3.769211 -4.195878 -4.269211 -5.902544 -5.495878 -5.015878 -3.735878 -3.682544 -3.946989 -3.895878 -6.669211 -4.802544
## [16] -4.062544 -4.269211 -3.735878 -3.646989 -4.002544 -3.706989 -3.662544 -3.946989 -4.002544 -3.706989 -3.735878 -3.635878
24 / 65

Likelihood of the data

We sum the likelihood to get the overall likelihood of the data. However, this leads to very small numbers. Computationally, it's easier to sum the log of these likelihoods.

dnorm(iqs, mean = 100, sd = 15, log = TRUE)
## [1] -4.346989 -4.515878 -3.946989 -3.769211 -4.195878 -4.269211 -5.902544 -5.495878 -5.015878 -3.735878 -3.682544 -3.946989 -3.895878 -6.669211 -4.802544
## [16] -4.062544 -4.269211 -3.735878 -3.646989 -4.002544 -3.706989 -3.662544 -3.946989 -4.002544 -3.706989 -3.735878 -3.635878
sum(dnorm(iqs, mean = 100, sd = 15, log = TRUE))
## [1] -114.3065
24 / 65

Alternative distributions

What if we assumed the data were generated from an alternative distribution, say IQiN(115,5)IQiN(115,5)?

25 / 65

Alternative distributions

What if we assumed the data were generated from an alternative distribution, say IQiN(115,5)IQiN(115,5)?

sum(dnorm(iqs, mean = 115, sd = 5, log = TRUE))
## [1] -416.3662
25 / 65

Alternative distributions

What if we assumed the data were generated from an alternative distribution, say IQiN(115,5)IQiN(115,5)?

sum(dnorm(iqs, mean = 115, sd = 5, log = TRUE))
## [1] -416.3662

The value is much lower. In most models, we are estimating μμ and σσ, and trying to find values that maximize the sum of the log likelihoods.

25 / 65

Visually

The real data generating distribution

26 / 65

Visually

The poorly fitting one

27 / 65

Non-Bayesian

In a frequentist regression model, we would find parameters that maximize the likelihood. Note - the distributional mean is often conditional.

28 / 65

Non-Bayesian

In a frequentist regression model, we would find parameters that maximize the likelihood. Note - the distributional mean is often conditional.

This is part of why I've come to prefer notation that emphasizes the data generating process.

28 / 65

Example

I know we've talked about this before, but a simple linear regression model like this

m <- lm(IQbio ~ class, data = carData::Burt)
29 / 65

Example

I know we've talked about this before, but a simple linear regression model like this

m <- lm(IQbio ~ class, data = carData::Burt)

is generally displayed like this

IQbio=α+β1(classlow)+β2(classmedium)+ϵIQbio=α+β1(classlow)+β2(classmedium)+ϵ

29 / 65

Example

I know we've talked about this before, but a simple linear regression model like this

m <- lm(IQbio ~ class, data = carData::Burt)

is generally displayed like this

IQbio=α+β1(classlow)+β2(classmedium)+ϵIQbio=α+β1(classlow)+β2(classmedium)+ϵ

But we could display the same thing like this IQbioN(ˆμ,ˆσ)ˆμ=α+β1(classlow)+β2(classmedium)IQbioN(ˆμ,ˆσ)ˆμ=α+β1(classlow)+β2(classmedium)

29 / 65

Priors

30 / 65

Bayesian posterior

posterior=likelihood×prioraverage likelihoodposterior=likelihood×prioraverage likelihood

31 / 65

Bayesian posterior

posterior=likelihood×prioraverage likelihoodposterior=likelihood×prioraverage likelihood

The above is how we estimate with Bayes.

31 / 65

Bayesian posterior

posterior=likelihood×prioraverage likelihoodposterior=likelihood×prioraverage likelihood

The above is how we estimate with Bayes.

In words, it states that our updated beliefs (posterior) depend on the evidence from our data (likelihood) and our prior knowledge/conceptions/information (prior).

31 / 65

Bayesian posterior

posterior=likelihood×prioraverage likelihoodposterior=likelihood×prioraverage likelihood

The above is how we estimate with Bayes.

In words, it states that our updated beliefs (posterior) depend on the evidence from our data (likelihood) and our prior knowledge/conceptions/information (prior).

Our prior will shift in accordance with the evidence from the data

31 / 65

Basic example

Let's walk through a basic example where we're just estimating a mean. We'll assume we somehow magically know the variance. Please follow along.

32 / 65

Basic example

Let's walk through a basic example where we're just estimating a mean. We'll assume we somehow magically know the variance. Please follow along.

First, generate some data

set.seed(123)
true_data <- rnorm(50, 5, 1)
32 / 65

Grid search

We're now going to specify a grid of possible means for our data. Let's search anywhere from -3 to 12 in 0.1 intervals.

33 / 65

Grid search

We're now going to specify a grid of possible means for our data. Let's search anywhere from -3 to 12 in 0.1 intervals.

grid <- tibble(possible_mean = seq(-3, 12, 0.1))
33 / 65

Grid search

We're now going to specify a grid of possible means for our data. Let's search anywhere from -3 to 12 in 0.1 intervals.

grid <- tibble(possible_mean = seq(-3, 12, 0.1))

Next, we'll specify a prior distribution. That is - how likely do we think each of these possible means are?

33 / 65

Grid search

We're now going to specify a grid of possible means for our data. Let's search anywhere from -3 to 12 in 0.1 intervals.

grid <- tibble(possible_mean = seq(-3, 12, 0.1))

Next, we'll specify a prior distribution. That is - how likely do we think each of these possible means are?

Let's say our best guess is μ=2μ=2. Values on either side of 22 should be less likely.

33 / 65

Grid search

We're now going to specify a grid of possible means for our data. Let's search anywhere from -3 to 12 in 0.1 intervals.

grid <- tibble(possible_mean = seq(-3, 12, 0.1))

Next, we'll specify a prior distribution. That is - how likely do we think each of these possible means are?

Let's say our best guess is μ=2μ=2. Values on either side of 22 should be less likely.

prior <- dnorm(grid$possible_mean, mean = 2, sd = 1)
33 / 65

Plot our prior

grid %>%
mutate(prior = prior) %>%
ggplot(aes(possible_mean, prior)) +
geom_line()

Note that the strength of our prior depends on the standard deviation

This would be our best guess as to where the data would fall before observing the data.

34 / 65

Look at other priors

grid %>%
mutate(prior1 = dnorm(possible_mean, mean = 2, sd = 1),
prior2 = dnorm(possible_mean, mean = 2, sd = 2),
prior3 = dnorm(possible_mean, mean = 2, sd = 3)) %>%
ggplot(aes(possible_mean)) +
geom_line(aes(y = prior1)) +
geom_line(aes(y = prior2), color = "cornflowerblue") +
geom_line(aes(y = prior3), color = "firebrick")

35 / 65

Set prior

  • Let's go with a fairly conservative prior, with μ=2,σ=3μ=2,σ=3.

  • We also need to normalize it so the probability sums to 1.0

36 / 65

Set prior

  • Let's go with a fairly conservative prior, with μ=2,σ=3μ=2,σ=3.

  • We also need to normalize it so the probability sums to 1.0

grid <- grid %>%
mutate(prior = dnorm(possible_mean, mean = 2, sd = 3),
prior = prior / sum(prior)) # normalize
36 / 65

Observe 1 data point

grid <- grid %>%
mutate(likelihood = dnorm(true_data[1], possible_mean, 2))
grid
## # A tibble: 151 x 3
## possible_mean prior likelihood
## <dbl> <dbl> <dbl>
## 1 -3 0.003477802 0.0001973758
## 2 -2.9 0.003674439 0.0002374240
## 3 -2.8 0.003877883 0.0002848850
## 4 -2.7 0.004088046 0.0003409800
## 5 -2.6 0.004304813 0.0004071013
## 6 -2.5 0.004528041 0.0004848308
## 7 -2.4 0.004757554 0.0005759600
## 8 -2.3 0.004993151 0.0006825094
## 9 -2.2 0.005234594 0.0008067505
## 10 -2.1 0.005481619 0.0009512268
## # … with 141 more rows
37 / 65

Compute posterior

grid <- grid %>%
mutate(posterior = likelihood * prior,
posterior = posterior / sum(posterior)) # normalize
38 / 65

Plot

grid %>%
pivot_longer(-possible_mean) %>%
ggplot(aes(possible_mean, value)) +
geom_line(aes(color = name))

39 / 65

Observe a second data point

The old posterior becomes our new prior

grid <- grid %>%
mutate(likelihood = dnorm(true_data[2], possible_mean, 2),
posterior = likelihood * posterior,
posterior = posterior / sum(posterior))
40 / 65

Plot

grid %>%
pivot_longer(-possible_mean) %>%
ggplot(aes(possible_mean, value)) +
geom_line(aes(color = name))

41 / 65

Observe a third data point

grid <- grid %>%
mutate(likelihood = dnorm(true_data[3], possible_mean, 2),
posterior = likelihood * posterior,
posterior = posterior / sum(posterior))
42 / 65

Plot

grid %>%
pivot_longer(-possible_mean) %>%
ggplot(aes(possible_mean, value)) +
geom_line(aes(color = name))

43 / 65

All the data

grid <- grid %>%
mutate(prior = dnorm(grid$possible_mean, mean = 2, sd = 3),
prior = prior / sum(prior),
posterior = prior) # best guess before seeing data
for(i in seq_along(true_data)) {
grid <- grid %>%
mutate(likelihood = dnorm(true_data[i], possible_mean, 2),
posterior = likelihood * posterior,
posterior = posterior / sum(posterior))
}
44 / 65
grid %>%
pivot_longer(-possible_mean) %>%
ggplot(aes(possible_mean, value)) +
geom_line(aes(color = name))

45 / 65

Posterior

  • We can summarize our posterior distribution

  • This is a fundamental difference between Bayesian & frequentist approaches

    • In Bayes, our data is assumed fixed, our parameters random

    • In frequentist, our data is assumed random, our parameters fixed

46 / 65

Most likely?

grid %>%
filter(posterior == max(posterior))
## # A tibble: 1 x 4
## possible_mean prior likelihood posterior
## <dbl> <dbl> <dbl> <dbl>
## 1 5 0.008459494 0.1992979 0.1416204
47 / 65

Sampling

  • Now that we have a posterior distribution, we can sample from it to help us with inference

  • Each possible mean should be sampled in accordance with its probability specified by the posterior.

48 / 65

Sampling

  • Now that we have a posterior distribution, we can sample from it to help us with inference

  • Each possible mean should be sampled in accordance with its probability specified by the posterior.

  • Let's draw 10,000 samples
posterior_samples <- sample(grid$possible_mean,
size = 10000,
replace = TRUE,
prob = grid$posterior)
48 / 65

Inference

First, let's plot the samples

49 / 65

Inference

First, let's plot the samples

ggplot(data.frame(sample = posterior_samples), aes(sample)) +
geom_histogram(bins = 100)

49 / 65

Central tendency

mean(posterior_samples)
## [1] 5.00441
median(posterior_samples)
## [1] 5
50 / 65

Central tendency

mean(posterior_samples)
## [1] 5.00441
median(posterior_samples)
## [1] 5

Spread

sd(posterior_samples)
## [1] 0.278466
50 / 65

Credible intervals

Let's compute an 80% credible interval

tibble(posterior_samples) %>%
summarize(ci_80 = quantile(posterior_samples, c(0.1, 0.9)))
## # A tibble: 2 x 1
## ci_80
## <dbl>
## 1 4.6
## 2 5.4
51 / 65

Credible intervals

Let's compute an 80% credible interval

tibble(posterior_samples) %>%
summarize(ci_80 = quantile(posterior_samples, c(0.1, 0.9)))
## # A tibble: 2 x 1
## ci_80
## <dbl>
## 1 4.6
## 2 5.4

What's the chance the "true" mean is less than 4.8?

sum(posterior_samples < 4.8) / length(posterior_samples) * 100
## [1] 18.05
51 / 65

Ranges

What's the probability the "true" mean is between 5.2 and 5.5?

52 / 65

Ranges

What's the probability the "true" mean is between 5.2 and 5.5?

sum(posterior_samples >= 5.2 & posterior_samples <= 5.5) /
length(posterior_samples) * 100
## [1] 27.94
52 / 65

Ranges

What's the probability the "true" mean is between 5.2 and 5.5?

sum(posterior_samples >= 5.2 & posterior_samples <= 5.5) /
length(posterior_samples) * 100
## [1] 27.94

Greater than 4.5?

sum(posterior_samples > 4.5) / length(posterior_samples) * 100
## [1] 95.05
52 / 65

Ranges

What's the probability the "true" mean is between 5.2 and 5.5?

sum(posterior_samples >= 5.2 & posterior_samples <= 5.5) /
length(posterior_samples) * 100
## [1] 27.94

Greater than 4.5?

sum(posterior_samples > 4.5) / length(posterior_samples) * 100
## [1] 95.05

Note this is much more natural than frequentist statistics

52 / 65

Change our prior

Let's try again with a tighter prior

grid <- grid %>%
mutate(prior = dnorm(grid$possible_mean, mean = 2, sd = 0.1),
prior = prior / sum(prior),
posterior = prior) # best guess before seeing data
for(i in seq_along(true_data)) {
grid <- grid %>%
mutate(likelihood = dnorm(true_data[i], possible_mean, 2),
posterior = likelihood * posterior,
posterior = posterior / sum(posterior))
}
53 / 65
grid %>%
pivot_longer(-possible_mean) %>%
ggplot(aes(possible_mean, value)) +
geom_line(aes(color = name))

grid %>%
filter(posterior == max(posterior))
## # A tibble: 1 x 4
## possible_mean prior likelihood posterior
## <dbl> <dbl> <dbl> <dbl>
## 1 2.3 0.004431848 0.08476010 0.3915258
54 / 65

More data

Same thing, but this time with tons of data

55 / 65
true_data <- rnorm(5000, 5, 1)
grid <- grid %>%
mutate(prior = dnorm(grid$possible_mean, mean = 2, sd = 0.1),
prior = prior / sum(prior),
posterior = prior) # best guess before seeing data
for(i in seq_along(true_data)) {
grid <- grid %>%
mutate(likelihood = dnorm(true_data[i], possible_mean, 2),
posterior = likelihood * posterior,
posterior = posterior / sum(posterior))
}
56 / 65
grid %>%
pivot_longer(-possible_mean) %>%
ggplot(aes(possible_mean, value)) +
geom_line(aes(color = name))

grid %>%
filter(posterior == max(posterior))
## # A tibble: 1 x 4
## possible_mean prior likelihood posterior
## <dbl> <dbl> <dbl> <dbl>
## 1 4.8 2.277577e-171 0.1994389 0.9678310
57 / 65

Taking a step back

  • The purpose of the prior is to include what you already know into your analysis

  • The strength of your prior should depend on your prior research

  • Larger samples will overwhelm priors quicker, particularly if they are diffuse

  • Think through the lens of updating your prior beliefs

  • This whole framework is quite different, but also gives us a lot of advantages in terms of probability interpretation, as we'll see

58 / 65

Bayes for regression

59 / 65

More complicated

  • Remember our posterior is defined by

posterior=likelihood×prioraverage likelihoodposterior=likelihood×prioraverage likelihood

60 / 65

More complicated

  • Remember our posterior is defined by

posterior=likelihood×prioraverage likelihoodposterior=likelihood×prioraverage likelihood When we just had a single parameter to estimate, μμ, this was tractable with grid search.

60 / 65

More complicated

  • Remember our posterior is defined by

posterior=likelihood×prioraverage likelihoodposterior=likelihood×prioraverage likelihood When we just had a single parameter to estimate, μμ, this was tractable with grid search.

With even simple linear regression, however, we have three parameters: αα, ββ, and σσ

60 / 65

More complicated

  • Remember our posterior is defined by

posterior=likelihood×prioraverage likelihoodposterior=likelihood×prioraverage likelihood When we just had a single parameter to estimate, μμ, this was tractable with grid search.

With even simple linear regression, however, we have three parameters: αα, ββ, and σσ

Our Bayesian model then becomes considerably more complicated:

P(α,β,σx)=P(xα,β,σ)P(α,β,σ)P(xα,β,σ)P(α,β,σ)dαdβdσP(α,β,σx)=P(xα,β,σ)P(α,β,σ)P(xα,β,σ)P(α,β,σ)dαdβdσ

60 / 65

Estimation

Rather than trying to compute the integrals, we simulate observations from the joint posterior distribution.

61 / 65

Estimation

Rather than trying to compute the integrals, we simulate observations from the joint posterior distribution.

This sounds a bit like magic - how do we do this?

61 / 65

Estimation

Rather than trying to compute the integrals, we simulate observations from the joint posterior distribution.

This sounds a bit like magic - how do we do this?

Multiple different algorithms, but all use some form of Markov-Chain Monte-Carlo sampling

61 / 65

Conceptually

  • Imagine the posterior as a hill

  • We start with the parameters set to random numbers

    • Estimate the posterior with this values (our spot on the hill)
  • Use information from the prior sample to determine whether and how to change the current parameter values

  • Try to "walk" around in a way to a complete "picture" of the hill from the samples

  • Use these samples as your posterior distribution

62 / 65

Metropolis-Hastings

We will use is called the Metropolis-Hastings algorithm:

  • Compute a candidate "step" for the parameters

  • Calculate an acceptance ratio α=f(x)/f(xt). This will fall between 0 and 1.

  • Generate number, u, from a random uniform distribution between 0 and 1

    • if uα, accept the candidate

    • if uα, reject the candidate

63 / 65

Metropolis-Hastings

We will use is called the Metropolis-Hastings algorithm:

  • Compute a candidate "step" for the parameters

  • Calculate an acceptance ratio α=f(x)/f(xt). This will fall between 0 and 1.

  • Generate number, u, from a random uniform distribution between 0 and 1

    • if uα, accept the candidate

    • if uα, reject the candidate

Complicated, but conceptually we're trying to sample the joint posterior distribution in a way that conforms with the probability density of the posterior (i.e., the shape)

63 / 65

Script

  • The mh-alg.R script works through the Metropolis-Hastings algorithm to get samples from the joint posterior of a simple linear regression model.

  • The data are simulated, so we know the true values

  • It's complicated, and not required at all, but it's there for you if you want more info

64 / 65

Next time

Continue discussing Bayes - more emphasis on model results & plotting

Fit and interpret multilevel logistic regression models

65 / 65

Agenda

[We're not going to get through it all]

  • Review Homework 2

  • Some equation practice

  • Introduce Bayes theorem

    • Go through an example with estimating a mean
  • Discuss Bayes in the context of regression modeling

2 / 65
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow