# Bayes Workshop Part 2

Author

Harm

Published

October 4, 2023

## Intro

The first part of the tutorial contrasted the basics of Frequentist statistics with Bayesian statistics. We talked about what the practical advantages of each method are and why Bayesian statistics might be a useful tool for Accounting researchers. In part two we will illustrate how to fit Bayesian models. First we will cover some basic intuition behind MCMC, then we will fit some models to examine the role of team training to reduce internal control deficiency.

## Conceptual understanding of fitting Bayesian models

### Posterior distributions are difficult

Let’s say you have some data $$y$$ and want to compute the posterior of some parameter $$\theta$$ after seeing $$y$$. This is difficult.

$P(\theta \mid y)=\frac{P(y\mid\theta)P(\theta)}{P(y)}$

As the bayesian updating formula illustrates, first we would need to multiply two probability density functions and then scaled it by another probability density function. The result is often analytically intractable. We cannot compute the resulting function in most cases. There are exceptions, combinations of likelihood and so-called conjugate priors, where the resulting form of the posterior is the same as the prior. Conjugate priors do not exist in every situation however, and they limit the range of possible priors to use. Hence, in modern statistics, we need to resort to sampling methods.

### The idea behind MCMC

Sampling methods are an amazing and powerful innovation. They allow us to sample from any unknown distribution. It works well in the situation where we have a distribution $$P(z) = \frac{1}{Z_P}p(z)$$, where it is easy to compute $$p(z)$$ but hard to compute $$Z_P$$. This is the case for $$P(\theta \mid y)=\frac{P(y\mid\theta)P(\theta)}{P(y)}$$. We can often compute $$P(y\mid\theta)P(\theta)$$ reasonably well. The problem is the scaler $$P(y)$$, which is very difficult to compute (As it is technically the integral of $$P(y\mid\theta)$$ over all possible values of $$\theta$$). However, if we set up the sampling method the right way, we can avoid having to compute $$P(y)$$.

The simplest sampling method, the Metropolis Markov Chain Monte Carlo Sampler is a good way of explaining the intuition behind most sampling methods. Compared to more modern samplers Metropolis MCMC is quite inefficient. The tools we are going to use later use a sampler called Hamiltonian MCMC. We won’t really cover it here because we don’t have time. However, if you are seriously considering using Bayesian methods, you should familiarize yourself a bit with it. For complicated models, you might need to tune the sampler a bit and for that you need to understand how it works. A good starting resource (in general) are the incredibly good teaching materials by Richard McElreath on YouTube.

Back to Metropolis Hastings MCMC. The following is adapted from wikipedia: Let $$f(\theta)$$ be a function that is proportional to the desired posterior probability density function $$P(\theta \mid y)$$. We will repeat the following steps for $$n$$ number of steps. And we call the series of $$n$$ steps a chain.

1. To begin: Choose an arbitrary point $$\theta_0$$ to be the first “sample”. Choose an arbitrary probability density $$g( \theta_t \mid \theta_{t-1} )$$. The requirement for this is a function that proposes new “candidates” for the next sample $$\theta_t$$, given the previous sample value. A common choice is a Gaussian distribution centered at $$\theta_{t-1}$$, so that points closer to $$\theta_{t-1}$$ are more likely to be visited next, making the sequence of samples into a random walk. We call this the proposal density function (jumping function)
2. For each iteration t (the $$n$$ steps):
• Generate a candidate $$\theta_t'$$ for the next sample using $$g( \theta_t \mid \theta_{t-1} )$$.
• Calculate the acceptance ratio $$\alpha =f(\theta_t')/f(\theta_{t-1})$$, which will be used to decide whether to accept or reject the candidate.
• Generate a uniform random number $$u \in [ 0 , 1 ]$$.
• If $$u\leq \alpha$$, then accept the candidate by setting $$\theta_t = \theta_t'$$, else reject the candidate and set $$\theta_t = \theta_{t-1}$$.

Computing the acceptance ratio is how we avoid having to compute $$P(Y)$$. Because $$f$$ is essentially P(y)P(). Which means:

$a = \frac{f(\theta_t')}{f(\theta_{t-1})} = \frac{P(y\mid\theta_t')P(\theta_t')}{P(y\mid\theta_{t-1})P(\theta_{t-1})}=\frac{P(\theta_t'\mid y)}{P(\theta_{t-1}\mid y)}$ This algorithm proceeds by randomly attempting to move about the sample space, sometimes accepting the moves and sometimes remaining in place. Note that the acceptance ratio $$\alpha$$ indicates how probable the new proposed sample is with respect to the current sample, according to the posterior distribution. If we attempt to move to a point that is more probable than the existing point (i.e. a point in a higher-density region of $$P(\theta\mid y)$$ corresponding to an $$\alpha >1\geq u}$$, we will always accept the move.

Because we start at a random spot in parameter space, it might take some time for the chain to “find” the posterior distribution (loosely speaking). Often we run not one of these chains, but many (default is to use 4) to see whether they have converged to traversing the posterior region. To visualize how this all looks, here is an animation borrowed from this great post by Maxwell B. Joseph

## The Case

For the rest of part two we will use the following made-up research case:

You want to examine whether team-level training of operating teams helps in reducing internal control deficiencies. Via a national research center, you managed to find firms willing to participate in a field experiment where you randomly assign and conduct team-level training with the help of an advisory firm. You do this for a few months and record the level of internal control deficiencies as reported by internal audits.

Our question of interest is thus whether team-training is effective at reducing internal control weakness. In addition, we also want to know additional characteristics of any potential team-training effect. Bayesian analysis is particularly suited for these type of questions and we’ll use this simple (and unfortunately unrealistic example) to illustrate how to perform Bayesian stats with R.

## Setup

library(ggplot2)
library(bayesplot)
library(brms)
box::use(
collapse[qsu]
)
options(brms.backend = "cmdstanr")
source("hhs-ggtheme.R")

# seed for R's pseudo-RNGs, not Stan's
set.seed(1123)

## The data

dta <- readRDS('data/contrdef.rds')
str(dta)
'data.frame':   120 obs. of  9 variables:
$firm_id : int 1 1 1 1 1 1 1 1 1 1 ...$ quarter   : int  1 2 3 4 5 6 7 8 9 10 ...
$firm_size : num 21189 21189 21189 21189 21189 ...$ firm_int  : num  -9.52 -9.52 -9.52 -9.52 -9.52 ...
$firm_train: num 0.0864 0.0864 0.0864 0.0864 0.0864 ...$ training  : num  6 2 5 3 4 3 0 0 2 8 ...
$noise : num 0.1363 0.0823 0.1308 0.1747 0.1745 ...$ eta       : num  0.962 0.617 0.876 0.703 0.789 ...
$contrdef : num 1 1 5 2 2 0 1 2 3 0 ... We have access to the following important fields: • firm_id: The unique firm identifier • quarter: The number of the quarter in which the training was conducted • firm_size: The amount of revenues in that quarter in T$
• training: How many trainings were conducted in that firm in that quarter
• contrdef: The count of internal control deficiencies found in a 90 day window after the end of the quarter

### Descriptives of the data

c(
"n firms" = length(unique(dta$firm_id)), "n quarters" = length(unique(dta$quarter))
)
   n firms n quarters
10         12 
qsu(dta)
              N     Mean          SD      Min      Max
firm_id     120      5.5      2.8843        1       10
quarter     120      6.5      3.4665        1       12
firm_size   120  87464.5  51260.4165    17902   166954
firm_int    120  -9.5598      0.0934  -9.7066  -9.3488
firm_train  120  -0.3374      0.3344   -0.984   0.1061
training    120    2.275      1.8469        0        8
noise       120   0.1058      0.0443  -0.0097   0.1889
eta         120   0.8819      1.2766  -3.5323   2.6158
contrdef    120    4.675      8.4894        0       49
ggplot(dta, aes(x = contrdef)) +
geom_bar()

## Modeling the data DGP

### Starting point is our dependent variable

We have count data. That means we can think of various choices for processes generating count data. We can go with a normal distribution which basically means a normal linear regression. However, normal distributions are really more suitable for continuous data. We have count data—integer numbers that are distinct. One candidate distribution we could use to model the data generation is a Poisson distribution. We need to be careful here though. Poisson random variables have a special property: their variance equals the mean. Real-world count data very frequently has a variance that is higher than the mean. We call this over-dispersed’’ data. There are some ways to model this and we will use them. For didactic purposes we’ll start with poisson and test for over-dispersion.

### The rest of the model

Given that we have chosen a Poisson regression as our main distribution, we define the likelihood as follows (For firm $$i = 1,\dots,10$$ at time (quarter) $$t = 1,\dots,12$$):

\begin{align*} \textrm{contrdef}_{i,t} & \sim \textrm{Poisson}(\lambda_{i,t}) \\ \lambda_{i,t} & = \exp{(\eta_{i,t})} \\ \eta_{i,t} &= b_0 + b_1 \, \textrm{training}_{i,t} \end{align*}

## Fitting a simple Bayesian model

For more complicated models we prefer to code the model directly in the stan language and use cmdstanr to fit it. The models in this tutorial can all be fit using simple formulas and using the awesome brms package.

fit_simple <- brm(
contrdef ~ 1 + training,
family = poisson,
data = dta,
prior = c(
prior("normal(0, 10)", class = "Intercept"),
prior("normal(0, 10)", class = "b", coef = "training")
),
chains = 4, cores = 4,
refresh = 0
)

### Fit summary

summary(fit_simple)
 Family: poisson
Formula: contrdef ~ 1 + training
Data: dta (Number of observations: 120)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     1.73      0.06     1.61     1.86 1.00     3403     2649
training     -0.09      0.02    -0.14    -0.04 1.00     2678     2389

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
prior_summary(fit_simple)
prior class coef group resp dpar nlpar lb ub source
b default
normal(0, 10) b training user
normal(0, 10) Intercept user

The summary statistics are based on the empirical histogram of the four markov chains we fit. Here is how they looked:

### Checking the sampling

mcmc_hist(fit_simple, pars = c("b_Intercept", "b_training"))

We should also have a quick look at the chains themselves:

bayesplot::mcmc_trace(fit_simple)

### Checking the fit

Above we talked about the issue of over-dispersion. Let’s take a look at our model “fit” versus the actual data

y_pred <- posterior_predict(fit_simple)
ppc_dens_overlay(y = dtacontrdef, yrep = y_pred[1:200,]) We are underpredicting zero counts, overpredicting middle range counts and maybe slightly underpredicting large counts. That looks like our poisson model with its property that the variance should equal the mean has issues fitting well. $$\lambda$$, controls both the expected counts and the variance of these counts. We can fix that. ## Modelling over-dispersion A common DGP model for fitting overdispersed data is the negative-binomial distribution. We also add a two control variables. \begin{align*} \textrm{contrdef}_{i,t} & \sim \textrm{Neg-Binomial}(\lambda_{i,t}, \phi) \\ \lambda_{i,t} & = \exp{(\eta_{i,t})} \\ \eta_{i,t} &= b_0 + b_1 \,\textrm{training}_{i,t} + \textrm{log\_firm\_size}_{i} \end{align*} To understand what is going on here, just note that the negative binomial pdf is parameterized in terms of its log-mean, $$\eta$$, and it has a precision, $$\phi$$, that affects it’s variance. The mean and variance of $$y = \textrm{contrdef}$$ is thus: $\mathbb{E}[y] \, = \lambda = \exp(\eta)$ $\text{Var}[y] = \lambda + \lambda^2/\phi = \exp(\eta) + \exp(\eta)^2 / \phi.$ As $$\phi$$ gets larger the term $$\lambda^2 / \phi$$ approaches zero and so the variance of the negative-binomial approaches $$\lambda$$, i.e., the negative-binomial gets closer and closer to the Poisson. We include $$\textrm{log\_firm\_size}_{i}$$ as an exposure term (note, it does not have a coefficient in front). Our previous poisson model’s mean parameter is a rate of deficiencies in the next quarter (90 days). However, in a way the “deficiency process” also plays out over a firm’s size (bigger, more complex firms have more opportunities for deficiencies to occur). We have revenues in T as a measure of firm size. If we multiply $$\lambda$$ by $$\textrm{firm\_size}_{i}$$, we can interpret our coefficients as shifting a rate of deficiencies per T$revenues per next 90 days. The last trick is to log firm size in order to put it into $$\eta$$. ## Fitting a negative binomial model fit_negbin <- brm( contrdef ~ 1 + training + offset(log(firm_size)), family = negbinomial, data = dta, prior = c( prior("normal(0, 10)", class = "Intercept"), prior("normal(0, 10)", class = "b", coef = "training") ), chains = 4, cores = 4, refresh = 0 ) ### Fit summary summary(fit_negbin)  Family: negbinomial Links: mu = log; shape = identity Formula: contrdef ~ 1 + training + offset(log(firm_size)) Data: dta (Number of observations: 120) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept -9.77 0.22 -10.18 -9.33 1.00 3588 2547 training 0.00 0.07 -0.14 0.14 1.00 3840 2733 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS shape 0.54 0.09 0.39 0.73 1.00 3370 2670 Draws were sampled using sample(hmc). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). The coefficients have changed. Note that partly this is also because we are now looking at the rate of deficiencies per 90 days per T$ in revenues

prior_summary(fit_negbin)
prior class coef group resp dpar nlpar lb ub source
b default
normal(0, 10) b training user
normal(0, 10) Intercept user
gamma(0.01, 0.01) shape 0 default

### Checking the sampling

mcmc_hist(fit_negbin, pars = c("b_Intercept", "b_training", "shape"))

We should also have a quick look at the chains themselves:

bayesplot::mcmc_trace(fit_negbin)

### Checking the model fit

Above we talked about the issue of over-dispersion. Let’s take a look at our model “fit” versus the actual data

y_pred2 <- posterior_predict(fit_negbin)

## Fitting a varying intercepts varying slopes model

fit_varslopes <- brm(
contrdef ~ 1 + training + offset(log(firm_size)) + (1 + training| firm_id),
family = negbinomial,
data = dta,
prior = c(
prior("normal(0, 10)", class = "Intercept"),
prior("normal(0, 10)", class = "b", coef = "training"),
prior("normal(0, 1)", class = "sd", coef = "Intercept", group = "firm_id"),
prior(lkj(2), class = "cor")
),
chains = 4, cores = 4,
refresh = 0
)

### Fit summary

summary(fit_varslopes)
Warning: There were 1 divergent transitions after warmup. Increasing
adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
 Family: negbinomial
Links: mu = log; shape = identity
Formula: contrdef ~ 1 + training + offset(log(firm_size)) + (1 + training | firm_id)
Data: dta (Number of observations: 120)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Group-Level Effects:
~firm_id (Number of levels: 10)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)               0.26      0.20     0.01     0.74 1.00     2018
sd(training)                0.44      0.15     0.22     0.81 1.00      980
cor(Intercept,training)    -0.07      0.43    -0.83     0.77 1.00      576
Tail_ESS
sd(Intercept)               1599
sd(training)                2002
cor(Intercept,training)     1236

Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -9.69      0.20   -10.06    -9.28 1.00     3357     3136
training     -0.26      0.16    -0.61     0.04 1.00     1449     1623

Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     1.06      0.21     0.70     1.53 1.00     4304     2996

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
prior_summary(fit_varslopes)
prior class coef group resp dpar nlpar lb ub source
b default
normal(0, 10) b training user
normal(0, 10) Intercept user
lkj_corr_cholesky(2) L user
L firm_id default
student_t(3, 0, 2.5) sd 0 default
sd firm_id default
normal(0, 1) sd Intercept firm_id user
sd training firm_id default
gamma(0.01, 0.01) shape 0 default

The summary statistics are based on the empirical histogram of the four markov chains we fit. Here is how they looked:

### Visualizing varying slopes

parnames(fit_varslopes)
Warning: 'parnames' is deprecated. Please use 'variables' instead.
  "b_Intercept"                      "b_training"
 "sd_firm_id__Intercept"            "sd_firm_id__training"
 "cor_firm_id__Intercept__training" "shape"
 "r_firm_id[1,Intercept]"           "r_firm_id[2,Intercept]"
 "r_firm_id[3,Intercept]"           "r_firm_id[4,Intercept]"
 "r_firm_id[5,Intercept]"           "r_firm_id[6,Intercept]"
 "r_firm_id[7,Intercept]"           "r_firm_id[8,Intercept]"
 "r_firm_id[9,Intercept]"           "r_firm_id[10,Intercept]"
 "r_firm_id[1,training]"            "r_firm_id[2,training]"
 "r_firm_id[3,training]"            "r_firm_id[4,training]"
 "r_firm_id[5,training]"            "r_firm_id[6,training]"
 "r_firm_id[7,training]"            "r_firm_id[8,training]"
 "r_firm_id[9,training]"            "r_firm_id[10,training]"
 "lprior"                           "lp__"                            
mcmc_areas(fit_varslopes,
pars = paste0("r_firm_id[", 1:10,",training]"),
prob = 0.95)

### Checking the sampling

mcmc_hist(
fit_varslopes,
pars = c("b_Intercept", "b_training", "shape", "sd_firm_id__Intercept",
"cor_firm_id__Intercept__training", "sd_firm_id__training")
)

We should also have a quick look at the chains themselves:

bayesplot::mcmc_trace(
fit_varslopes,
pars = c("b_Intercept", "b_training", "shape", "sd_firm_id__Intercept",
"cor_firm_id__Intercept__training", "sd_firm_id__training")
)

### Checking the fit

Above we talked about the issue of over-dispersion. Let’s take a look at our model “fit” versus the actual data

y_pred <- posterior_predict(fit_varslopes)
ppc_dens_overlay(y = dta\$contrdef, yrep = y_pred[1:200,])

## Model comparison

fit_simple <- add_criterion(fit_simple, "loo")
Warning: Found 2 observations with a pareto_k > 0.7 in model 'fit_simple'. It
is recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
fit_negbin <- add_criterion(fit_negbin, "loo")
Warning: Found 1 observations with a pareto_k > 0.7 in model 'fit_negbin'. It
is recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
fit_varint <- add_criterion(fit_varint, "loo")
Warning: Found 1 observations with a pareto_k > 0.7 in model 'fit_varint'. It
is recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
fit_varslopes <- add_criterion(fit_varslopes, "loo")
Warning: Found 1 observations with a pareto_k > 0.7 in model 'fit_varslopes'.
It is recommended to set 'moment_match = TRUE' in order to perform moment
matching for problematic observations.
loo_compare(fit_simple, fit_negbin, fit_varint, fit_varslopes)
              elpd_diff se_diff
fit_varslopes    0.0       0.0
fit_varint      -9.6       5.1
fit_negbin     -24.8       7.9
fit_simple    -431.3     118.3