```
library(ggplot2)
library(bayesplot)
library(brms)
box::use(
collapse[qsu]
)
options(brms.backend = "cmdstanr")
source("hhs-ggtheme.R")
# seed for R's pseudo-RNGs, not Stan's
set.seed(1123)
```

The first part of the tutorial contrasted the basics of Frequentist statistics with Bayesian statistics. We talked about what the practical advantages of each method are and why Bayesian statistics might be a useful tool for Accounting researchers. In part two we will illustrate how to fit Bayesian models. First we will cover some basic intuition behind MCMC, then we will fit some models to examine the role of team training to reduce internal control deficiency.

Let’s say you have some data and want to compute the posterior of some parameter after seeing . This is difficult.

As the bayesian updating formula illustrates, first we would need to multiply two probability density functions and then scaled it by another probability density function. The result is often analytically intractable. We cannot compute the resulting function in most cases. There are exceptions, combinations of likelihood and so-called conjugate priors, where the resulting form of the posterior is the same as the prior. Conjugate priors do not exist in every situation however, and they limit the range of possible priors to use. Hence, in modern statistics, we need to resort to sampling methods.

Sampling methods are an amazing and powerful innovation. They allow us to sample from any unknown distribution. It works well in the situation where we have a distribution , where it is easy to compute but hard to compute . This is the case for . We can often compute reasonably well. The problem is the scaler , which is very difficult to compute (As it is technically the integral of over all possible values of ). However, if we set up the sampling method the right way, we can avoid having to compute .

The simplest sampling method, the Metropolis Markov Chain Monte Carlo Sampler is a good way of explaining the intuition behind most sampling methods. Compared to more modern samplers Metropolis MCMC is quite inefficient. The tools we are going to use later use a sampler called Hamiltonian MCMC. We won’t really cover it here because we don’t have time. However, if you are seriously considering using Bayesian methods, you should familiarize yourself a bit with it. For complicated models, you might need to tune the sampler a bit and for that you need to understand how it works. A good starting resource (in general) are the incredibly good teaching materials by Richard McElreath on YouTube.

Back to Metropolis Hastings MCMC. The following is adapted from wikipedia: Let be a function that is proportional to the desired posterior probability density function . We will repeat the following steps for number of steps. And we call the series of steps a *chain*.

- To begin: Choose an arbitrary point to be the first “sample”. Choose an arbitrary probability density . The requirement for this is a function that proposes new “candidates” for the next sample , given the previous sample value. A common choice is a Gaussian distribution centered at , so that points closer to are more likely to be visited next, making the sequence of samples into a random walk. We call this the proposal density function (jumping function)
- For each iteration t (the steps):

- Generate a candidate for the next sample using .
- Calculate the acceptance ratio , which will be used to decide whether to accept or reject the candidate.
- Generate a uniform random number .
- If , then accept the candidate by setting , else reject the candidate and set .

Computing the acceptance ratio is how we avoid having to compute . Because is essentially P(y)P(). Which means:

This algorithm proceeds by randomly attempting to move about the sample space, sometimes accepting the moves and sometimes remaining in place. Note that the acceptance ratio indicates how probable the new proposed sample is with respect to the current sample, according to the posterior distribution. If we attempt to move to a point that is more probable than the existing point (i.e. a point in a higher-density region of corresponding to an , we will always accept the move.

Because we start at a random spot in parameter space, it might take some time for the chain to “find” the posterior distribution (loosely speaking). Often we run not one of these chains, but many (default is to use 4) to see whether they have converged to traversing the posterior region. To visualize how this all looks, here is an animation borrowed from this great post by Maxwell B. Joseph

For the rest of part two we will use the following made-up research case:

You want to examine whether team-level training of operating teams helps in reducing internal control deficiencies. Via a national research center, you managed to find firms willing to participate in a field experiment where you randomly assign and conduct team-level training with the help of an advisory firm. You do this for a few months and record the level of internal control deficiencies as reported by internal audits.

Our question of interest is thus whether team-training is effective at reducing internal control weakness. In addition, we also want to know additional characteristics of any potential team-training effect. Bayesian analysis is particularly suited for these type of questions and we’ll use this simple (and unfortunately unrealistic example) to illustrate how to perform Bayesian stats with R.

```
library(ggplot2)
library(bayesplot)
library(brms)
box::use(
collapse[qsu]
)
options(brms.backend = "cmdstanr")
source("hhs-ggtheme.R")
# seed for R's pseudo-RNGs, not Stan's
set.seed(1123)
```

```
dta <- readRDS('data/contrdef.rds')
str(dta)
```

```
'data.frame': 120 obs. of 9 variables:
$ firm_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ quarter : int 1 2 3 4 5 6 7 8 9 10 ...
$ firm_size : num 21189 21189 21189 21189 21189 ...
$ firm_int : num -9.52 -9.52 -9.52 -9.52 -9.52 ...
$ firm_train: num 0.0864 0.0864 0.0864 0.0864 0.0864 ...
$ training : num 6 2 5 3 4 3 0 0 2 8 ...
$ noise : num 0.1363 0.0823 0.1308 0.1747 0.1745 ...
$ eta : num 0.962 0.617 0.876 0.703 0.789 ...
$ contrdef : num 1 1 5 2 2 0 1 2 3 0 ...
```

We have access to the following important fields:

`firm_id`

: The unique firm identifier`quarter`

: The number of the quarter in which the training was conducted`firm_size`

: The amount of revenues in that quarter in T$`training`

: How many trainings were conducted in that firm in that quarter`contrdef`

: The count of internal control deficiencies found in a 90 day window after the end of the quarter

```
c(
"n firms" = length(unique(dta$firm_id)),
"n quarters" = length(unique(dta$quarter))
)
```

```
n firms n quarters
10 12
```

`qsu(dta)`

```
N Mean SD Min Max
firm_id 120 5.5 2.8843 1 10
quarter 120 6.5 3.4665 1 12
firm_size 120 87464.5 51260.4165 17902 166954
firm_int 120 -9.5598 0.0934 -9.7066 -9.3488
firm_train 120 -0.3374 0.3344 -0.984 0.1061
training 120 2.275 1.8469 0 8
noise 120 0.1058 0.0443 -0.0097 0.1889
eta 120 0.8819 1.2766 -3.5323 2.6158
contrdef 120 4.675 8.4894 0 49
```

```
ggplot(dta, aes(x = contrdef)) +
geom_bar()
```

We have count data. That means we can think of various choices for processes generating count data. We can go with a normal distribution which basically means a normal linear regression. However, normal distributions are really more suitable for continuous data. We have count data—integer numbers that are distinct. One candidate distribution we could use to model the data generation is a Poisson distribution. We need to be careful here though. Poisson random variables have a special property: their variance equals the mean. Real-world count data very frequently has a variance that is higher than the mean. We call this ``over-dispersed’’ data. There are some ways to model this and we will use them. For didactic purposes we’ll start with poisson and test for over-dispersion.

Given that we have chosen a Poisson regression as our main distribution, we define the likelihood as follows (For firm at time (quarter) ):

For more complicated models we prefer to code the model directly in the stan language and use cmdstanr to fit it. The models in this tutorial can all be fit using simple formulas and using the awesome brms package.

```
fit_simple <- brm(
contrdef ~ 1 + training,
family = poisson,
data = dta,
prior = c(
prior("normal(0, 10)", class = "Intercept"),
prior("normal(0, 10)", class = "b", coef = "training")
),
chains = 4, cores = 4,
refresh = 0
)
```

`summary(fit_simple)`

```
Family: poisson
Links: mu = log
Formula: contrdef ~ 1 + training
Data: dta (Number of observations: 120)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 1.73 0.06 1.61 1.86 1.00 3403 2649
training -0.09 0.02 -0.14 -0.04 1.00 2678 2389
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
```

`prior_summary(fit_simple)`

prior | class | coef | group | resp | dpar | nlpar | lb | ub | source |
---|---|---|---|---|---|---|---|---|---|

b | default | ||||||||

normal(0, 10) | b | training | user | ||||||

normal(0, 10) | Intercept | user |

The summary statistics are based on the empirical histogram of the four markov chains we fit. Here is how they looked:

`mcmc_hist(fit_simple, pars = c("b_Intercept", "b_training"))`

We should also have a quick look at the chains themselves:

`bayesplot::mcmc_trace(fit_simple)`

Above we talked about the issue of over-dispersion. Let’s take a look at our model “fit” versus the actual data

```
y_pred <- posterior_predict(fit_simple)
ppc_dens_overlay(y = dta$contrdef, yrep = y_pred[1:200,])
```

We are underpredicting zero counts, overpredicting middle range counts and maybe slightly underpredicting large counts. That looks like our poisson model with its property that the variance should equal the mean has issues fitting well. , controls both the expected counts and the variance of these counts. We can fix that.

A common DGP model for fitting overdispersed data is the negative-binomial distribution. We also add a two control variables.

To understand what is going on here, just note that the negative binomial pdf is parameterized in terms of its log-mean, , and it has a precision, , that affects it’s variance. The mean and variance of is thus:

As gets larger the term approaches zero and so the variance of the negative-binomial approaches , i.e., the negative-binomial gets closer and closer to the Poisson.

We include as an *exposure term* (note, it does not have a coefficient in front). Our previous poisson model’s mean parameter is a rate of deficiencies in the next quarter (90 days). However, in a way the “deficiency process” also plays out over a firm’s size (bigger, more complex firms have more opportunities for deficiencies to occur). We have revenues in T$ as a measure of firm size. If we multiply by , we can interpret our coefficients as shifting a rate of deficiencies per T$ revenues per next 90 days. The last trick is to log firm size in order to put it into .

```
fit_negbin <- brm(
contrdef ~ 1 + training + offset(log(firm_size)),
family = negbinomial,
data = dta,
prior = c(
prior("normal(0, 10)", class = "Intercept"),
prior("normal(0, 10)", class = "b", coef = "training")
),
chains = 4, cores = 4,
refresh = 0
)
```

`summary(fit_negbin)`

```
Family: negbinomial
Links: mu = log; shape = identity
Formula: contrdef ~ 1 + training + offset(log(firm_size))
Data: dta (Number of observations: 120)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -9.77 0.22 -10.18 -9.33 1.00 3588 2547
training 0.00 0.07 -0.14 0.14 1.00 3840 2733
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape 0.54 0.09 0.39 0.73 1.00 3370 2670
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
```

The coefficients have changed. Note that partly this is also because we are now looking at the rate of deficiencies per 90 days per T$ in revenues

`prior_summary(fit_negbin)`

prior | class | coef | group | resp | dpar | nlpar | lb | ub | source |
---|---|---|---|---|---|---|---|---|---|

b | default | ||||||||

normal(0, 10) | b | training | user | ||||||

normal(0, 10) | Intercept | user | |||||||

gamma(0.01, 0.01) | shape | 0 | default |

`mcmc_hist(fit_negbin, pars = c("b_Intercept", "b_training", "shape"))`

We should also have a quick look at the chains themselves:

`bayesplot::mcmc_trace(fit_negbin)`

Above we talked about the issue of over-dispersion. Let’s take a look at our model “fit” versus the actual data

```
y_pred2 <- posterior_predict(fit_negbin)
ppc_dens_overlay(y = dta$contrdef, yrep = y_pred2[1:200,])
```

This looks like a much better fit.

We are not necessarily done here. We purposefully framed the question as a field experiment, so that we do not have to worry about confounding issues. Training is basically randomized. We can still try to get a more precise estimate. And we can also see if there is meaningful variation in the effectiveness of training across firms. Especially the last question is something that multilevel models employing priors can really help with.

We will first built a model that only includes firm-specific intercepts. Afterwards we’ll add firm-specific slopes for as well.

```
fit_varint <- brm(
contrdef ~ 1 + training + offset(log(firm_size)) + (1 | firm_id),
family = negbinomial,
data = dta,
prior = c(
prior("normal(0, 10)", class = "Intercept"),
prior("normal(0, 10)", class = "b", coef = "training"),
prior("normal(0, 1)", class = "sd", coef = "Intercept", group = "firm_id")
),
chains = 4, cores = 4,
refresh = 0
)
```

`summary(fit_varint)`

```
Family: negbinomial
Links: mu = log; shape = identity
Formula: contrdef ~ 1 + training + offset(log(firm_size)) + (1 | firm_id)
Data: dta (Number of observations: 120)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Group-Level Effects:
~firm_id (Number of levels: 10)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.90 0.26 0.50 1.51 1.00 1086 1870
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -9.60 0.37 -10.33 -8.88 1.00 1379 1866
training -0.20 0.08 -0.34 -0.05 1.00 3333 3113
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape 0.81 0.15 0.56 1.14 1.00 3631 2712
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
```

There is some variation explained by varying intercepts. But it does not seem like it’s doing much in terms of improving the model fit.

`prior_summary(fit_varint)`

prior | class | coef | group | resp | dpar | nlpar | lb | ub | source |
---|---|---|---|---|---|---|---|---|---|

b | default | ||||||||

normal(0, 10) | b | training | user | ||||||

normal(0, 10) | Intercept | user | |||||||

student_t(3, 0, 2.5) | sd | 0 | default | ||||||

sd | firm_id | default | |||||||

normal(0, 1) | sd | Intercept | firm_id | user | |||||

gamma(0.01, 0.01) | shape | 0 | default |

The summary statistics are based on the empirical histogram of the four markov chains we fit. Here is how they looked:

`parnames(fit_varint)`

`Warning: 'parnames' is deprecated. Please use 'variables' instead.`

```
[1] "b_Intercept" "b_training"
[3] "sd_firm_id__Intercept" "shape"
[5] "r_firm_id[1,Intercept]" "r_firm_id[2,Intercept]"
[7] "r_firm_id[3,Intercept]" "r_firm_id[4,Intercept]"
[9] "r_firm_id[5,Intercept]" "r_firm_id[6,Intercept]"
[11] "r_firm_id[7,Intercept]" "r_firm_id[8,Intercept]"
[13] "r_firm_id[9,Intercept]" "r_firm_id[10,Intercept]"
[15] "lprior" "lp__"
```

`mcmc_hist(fit_varint)`

We should also have a quick look at the chains themselves:

```
bayesplot::mcmc_trace(
fit_varint,
pars = c("b_Intercept", "b_training", "shape", "sd_firm_id__Intercept")
)
```

Above we talked about the issue of over-dispersion. Let’s take a look at our model “fit” versus the actual data

```
y_pred <- posterior_predict(fit_varint)
ppc_dens_overlay(y = dta$contrdef, yrep = y_pred[1:200,])
```

```
fit_varslopes <- brm(
contrdef ~ 1 + training + offset(log(firm_size)) + (1 + training| firm_id),
family = negbinomial,
data = dta,
prior = c(
prior("normal(0, 10)", class = "Intercept"),
prior("normal(0, 10)", class = "b", coef = "training"),
prior("normal(0, 1)", class = "sd", coef = "Intercept", group = "firm_id"),
prior(lkj(2), class = "cor")
),
chains = 4, cores = 4,
refresh = 0
)
```

`summary(fit_varslopes)`

```
Warning: There were 1 divergent transitions after warmup. Increasing
adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
```

```
Family: negbinomial
Links: mu = log; shape = identity
Formula: contrdef ~ 1 + training + offset(log(firm_size)) + (1 + training | firm_id)
Data: dta (Number of observations: 120)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Group-Level Effects:
~firm_id (Number of levels: 10)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept) 0.26 0.20 0.01 0.74 1.00 2018
sd(training) 0.44 0.15 0.22 0.81 1.00 980
cor(Intercept,training) -0.07 0.43 -0.83 0.77 1.00 576
Tail_ESS
sd(Intercept) 1599
sd(training) 2002
cor(Intercept,training) 1236
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -9.69 0.20 -10.06 -9.28 1.00 3357 3136
training -0.26 0.16 -0.61 0.04 1.00 1449 1623
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape 1.06 0.21 0.70 1.53 1.00 4304 2996
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
```

`prior_summary(fit_varslopes)`

prior | class | coef | group | resp | dpar | nlpar | lb | ub | source |
---|---|---|---|---|---|---|---|---|---|

b | default | ||||||||

normal(0, 10) | b | training | user | ||||||

normal(0, 10) | Intercept | user | |||||||

lkj_corr_cholesky(2) | L | user | |||||||

L | firm_id | default | |||||||

student_t(3, 0, 2.5) | sd | 0 | default | ||||||

sd | firm_id | default | |||||||

normal(0, 1) | sd | Intercept | firm_id | user | |||||

sd | training | firm_id | default | ||||||

gamma(0.01, 0.01) | shape | 0 | default |

The summary statistics are based on the empirical histogram of the four markov chains we fit. Here is how they looked:

`parnames(fit_varslopes)`

`Warning: 'parnames' is deprecated. Please use 'variables' instead.`

```
[1] "b_Intercept" "b_training"
[3] "sd_firm_id__Intercept" "sd_firm_id__training"
[5] "cor_firm_id__Intercept__training" "shape"
[7] "r_firm_id[1,Intercept]" "r_firm_id[2,Intercept]"
[9] "r_firm_id[3,Intercept]" "r_firm_id[4,Intercept]"
[11] "r_firm_id[5,Intercept]" "r_firm_id[6,Intercept]"
[13] "r_firm_id[7,Intercept]" "r_firm_id[8,Intercept]"
[15] "r_firm_id[9,Intercept]" "r_firm_id[10,Intercept]"
[17] "r_firm_id[1,training]" "r_firm_id[2,training]"
[19] "r_firm_id[3,training]" "r_firm_id[4,training]"
[21] "r_firm_id[5,training]" "r_firm_id[6,training]"
[23] "r_firm_id[7,training]" "r_firm_id[8,training]"
[25] "r_firm_id[9,training]" "r_firm_id[10,training]"
[27] "lprior" "lp__"
```

```
mcmc_areas(fit_varslopes,
pars = paste0("r_firm_id[", 1:10,",training]"),
prob = 0.95)
```

```
mcmc_hist(
fit_varslopes,
pars = c("b_Intercept", "b_training", "shape", "sd_firm_id__Intercept",
"cor_firm_id__Intercept__training", "sd_firm_id__training")
)
```

We should also have a quick look at the chains themselves:

```
bayesplot::mcmc_trace(
fit_varslopes,
pars = c("b_Intercept", "b_training", "shape", "sd_firm_id__Intercept",
"cor_firm_id__Intercept__training", "sd_firm_id__training")
)
```

```
y_pred <- posterior_predict(fit_varslopes)
ppc_dens_overlay(y = dta$contrdef, yrep = y_pred[1:200,])
```

`fit_simple <- add_criterion(fit_simple, "loo")`

```
Warning: Found 2 observations with a pareto_k > 0.7 in model 'fit_simple'. It
is recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
```

`fit_negbin <- add_criterion(fit_negbin, "loo")`

```
Warning: Found 1 observations with a pareto_k > 0.7 in model 'fit_negbin'. It
is recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
```

`fit_varint <- add_criterion(fit_varint, "loo")`

```
Warning: Found 1 observations with a pareto_k > 0.7 in model 'fit_varint'. It
is recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
```

`fit_varslopes <- add_criterion(fit_varslopes, "loo")`

```
Warning: Found 1 observations with a pareto_k > 0.7 in model 'fit_varslopes'.
It is recommended to set 'moment_match = TRUE' in order to perform moment
matching for problematic observations.
```

`loo_compare(fit_simple, fit_negbin, fit_varint, fit_varslopes)`

```
elpd_diff se_diff
fit_varslopes 0.0 0.0
fit_varint -9.6 5.1
fit_negbin -24.8 7.9
fit_simple -431.3 118.3
```

You can find the sheet in pdf format in this repo.

It’s the first draft of a cheat sheet (but not a full cheat sheet) for the ratio analysis part of the lecture. It contains the basic structure and leaves room for notes and comments.

Our idea is that you (our students) can fill the rest of the cheat sheet as you want. That’s the main point. It is not a ready made cheat sheet, but it contains crucial logic that hopefully helps you to make a good cheat sheet for yourself. And obviously you should not use it to cheat on an exam. Instead, making cheat sheets is useful because it’s a great way to distill and learn the material.

We hope to improve it over time based on feedback. But whether that’s gonna happen will obviously depend on how well it will be received by the 2022 FSA cohort. I’m quite curious whether this will work.

]]>Consider a model with unobserved heterogeneity :

is meant to denote unobservable subject-specific heterogeneity that is confounding the relation of interest between and (because ). Fixed effects models, as commonly used in econometrics, remove between-subject variation to estimate the coefficient of interest () using only within-variation. In the remainder of this short note, I assume that readers are similar with these type of fixed effects models.

Bayesian multi-level models can also account for , but one needs to consider that Bayesian multi-level models partially pool. Meaning, if one does not explicitly tell the model that there is a correlation between and , then it will partially pool the confounded between-variation and the within-variation. Incorporating this correlation is easy to do however. This note shows how.

```
library(cmdstanr) # for fitting Bayesian models
library(fixest) # Fitting fixed effects models
```

For illustration purposes I simulate data from the following data generating process:

So, the true value of will be one in this simulation.

```
set.seed(753)
n_firms <- 100
periods <- 7
firm_id <- rep(1:n_firms, each = periods)
firm_mean <- rnorm(n_firms, 2, 4)
d <- data.frame(
firm_id = firm_id,
period = rep(1:periods, times = n_firms),
firm_mean = rep(firm_mean, each = periods)
)
# x is a function of the firm mean unobserved firm means
d$x <- rnorm(n_firms * periods, 0, 4) + 1.4 * d$firm_mean
# true coefficient of interest on x is 1
d$y <- d$firm_mean + 1 * d$x + rnorm(n_firms * periods, 0, 4)
# data container for bayesmodel
input_data <- list(
N = nrow(d),
J = max(d$firm_id),
GroupID = d$firm_id,
y = d$y,
x = d$x,
xm =
aggregate(
cbind(x) ~ firm_id,
data = d,
FUN = \(h) c(mean = mean(h))
)$x
)
```

Unsurprising, normal OLS will yield biased estimates because :

`summary(lm(y ~ x, data = d))$coefficients`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6940142 0.19296977 3.596492 3.452875e-04
x 1.4855348 0.02437638 60.941562 1.120002e-281
```

For comparison purposes with the Bayesian models, here are the estimates of standard fixed effects models:

Using the `fixest`

package:

```
olsfe <- feols(y ~ x | firm_id, data = d)
summary(olsfe)
```

```
OLS estimation, Dep. Var.: y
Observations: 700
Fixed-effects: firm_id: 100
Standard-errors: Clustered (firm_id)
Estimate Std. Error t value Pr(>|t|)
x 1.02745 0.044027 23.3367 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 3.75528 Adj. R2: 0.878907
Within R2: 0.534057
```

The above fixed effects models yield estimates of 1.027, close to the true value of . Just as expected.

In contrast to fixed effects models, which throw away between-subject-variation, Bayesian multi-level models model directly. In addition, these models have the advantage of applying adaptive shrinkage via partial pooling. This results in more stable and robust estimates of .

The model below does partial pooling by assuming that the follow a common distribution.

But it does **not** consider that and might be correlated. As a result the coefficient on x will be the partially pooled average of the between- and within-subject variation:

```
baymod_nocorr <- "
data{
int<lower=1> N; // num obs
int<lower=1> J; // num groups
int<lower=1, upper=J> GroupID[N]; // GroupID for obs, e.g. FirmID or Industry-YearID
vector[N] y; // Response
vector[N] x; // Predictor
}
parameters{
vector[J] z; // standard normal sampler
real mu_a; // hypprior mean coefficients
real<lower=0> sig_u; // error-term scale
real<lower=0> sig_a; // error-term scale
real a1;
}
transformed parameters{
vector[J] ai; // intercept vector
ai = mu_a + z * sig_a;
}
model{
z ~ normal(0, 1);
mu_a ~ normal(0, 10);
sig_u ~ exponential(1.0 / 8);
sig_a ~ exponential(1.0 / 4);
a1 ~ normal(0, 10);
y ~ normal(ai[GroupID] + a1 * x, sig_u);
}
"
stanmod_nocorr <- cmdstan_model(write_stan_file(baymod_nocorr))
bayfit_nocorr <- stanmod_nocorr$sample(
data = input_data,
iter_sampling = 1000,
iter_warmup = 1000,
chains = 4,
parallel_chains = 4,
seed = 1234,
refresh = 0
)
```

```
Running MCMC with 4 parallel chains...
Chain 1 finished in 0.7 seconds.
Chain 2 finished in 0.7 seconds.
Chain 3 finished in 0.7 seconds.
Chain 4 finished in 0.8 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.7 seconds.
Total execution time: 0.9 seconds.
```

`bayfit_nocorr$summary(variables = c("a1", "mu_a", "sig_a", "sig_u"))`

```
# A tibble: 4 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1 a1 1.32 1.32 0.0499 0.0500 1.24 1.40 1.00 765. 1330.
2 mu_a 1.24 1.23 0.319 0.315 0.730 1.77 1.00 1243. 2228.
3 sig_a 2.25 2.26 0.399 0.403 1.60 2.90 1.00 658. 902.
4 sig_u 4.25 4.25 0.137 0.138 4.04 4.48 1.00 1528. 2219.
```

Here is a visual presentation of the posterior of

`bayesplot::mcmc_areas(bayfit_nocorr$draws("a1")) + ggplot2::xlim(0, 2)`

This model gets rid of the problem, simply by explicitly modelling the correlation between and between-firm variation in as:

will be the estimate of the between association, while estimates the within-variation. We still get shrinkage of around the between variation.

```
baymod_corr <- "
data{
int<lower=1> N; // num obs
int<lower=1> J; // num groups
int<lower=1, upper=J> GroupID[N]; // GroupID for obs, e.g. FirmID or Industry-YearID
vector[N] y; // Response
vector[N] x; // Predictor
vector[J] xm; // mean of x for groupID
}
parameters{
vector[J] z; // standard normal sampler
real mu_a; // hypprior mean coefficients
real<lower=0> sig_u; // error-term scale
real<lower=0> sig_a; // error-term scale
real a1;
real b;
}
transformed parameters{
vector[J] ai; // intercept vector
ai = mu_a + b * xm + z * sig_a;
}
model{
z ~ normal(0, 1);
mu_a ~ normal(0, 10);
sig_u ~ exponential(1.0 / 8);
sig_a ~ exponential(1.0 / 8);
a1 ~ normal(0, 10);
b ~ normal(0, 10);
y ~ normal(ai[GroupID] + a1 * x, sig_u);
}
"
stanmod_corr <- cmdstan_model(write_stan_file(baymod_corr))
bayfit_corr <- stanmod_corr$sample(
data = input_data,
iter_sampling = 1000,
iter_warmup = 1000,
chains = 4,
parallel_chains = 4,
seed = 123334,
refresh = 0
)
```

```
Running MCMC with 4 parallel chains...
Chain 4 finished in 1.1 seconds.
Chain 2 finished in 1.1 seconds.
Chain 1 finished in 1.2 seconds.
Chain 3 finished in 1.2 seconds.
All 4 chains finished successfully.
Mean chain execution time: 1.1 seconds.
Total execution time: 1.3 seconds.
```

`bayfit_corr$summary(variables = c("a1", "b", "mu_a", "sig_a", "sig_u"))`

```
# A tibble: 5 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1 a1 1.03 1.03 0.0398 0.0392 0.961 1.09 1.00 3064. 2765.
2 b 0.652 0.652 0.0490 0.0490 0.572 0.732 1.00 3228. 2923.
3 mu_a 0.0616 0.0622 0.191 0.188 -0.251 0.375 1.00 3717. 2651.
4 sig_a 0.631 0.644 0.311 0.332 0.0939 1.13 1.00 749. 1487.
5 sig_u 4.08 4.07 0.119 0.118 3.88 4.27 1.00 3050. 2722.
```

is now centered on the true value.

`bayesplot::mcmc_areas(bayfit_corr$draws("a1")) + ggplot2::xlim(0, 2)`

The next piece of code more closely resembles the causal flow as I envision it. It encodes . It fully recovers the simulation above (see the estimates of sig_a). The divergent transitions suggest that this model is harder to fit well. Not sure why that is. Something to figure out another day.

```
baymod_corr2 <- "
data{
int<lower=1> N; // num obs
int<lower=1> J; // num groups
int<lower=1, upper=J> GroupID[N]; // GroupID for obs, e.g. FirmID or Industry-YearID
vector[N] y; // Response
vector[N] x; // Predictor
vector[J] xm; // mean of x for groupID
}
parameters{
vector[J] z; // standard normal sampler
real mu_a; // hypprior mean coefficients
real<lower=0> sig_u; // error-term scale
real<lower=0> sig_a; // error-term scale
real<lower=0> sig_x; // error-term scale
real a1;
real b1;
real b0;
}
transformed parameters{
vector[J] ai; // intercept vector
ai = mu_a + z * sig_a;
}
model{
z ~ normal(0, 1);
mu_a ~ normal(0, 10);
sig_u ~ exponential(1.0 / 8);
sig_a ~ exponential(1.0 / 8);
a1 ~ normal(0, 10);
b1 ~ normal(0, 10);
b0 ~ normal(0, 10);
sig_x ~ exponential(1.0 / 7);
xm ~ normal(b0 + b1 * ai, sig_x);
y ~ normal(ai[GroupID] + a1 * x, sig_u);
}
"
stanmod_corr2 <- cmdstan_model(write_stan_file(baymod_corr2))
bayfit_corr2 <- stanmod_corr2$sample(
data = input_data,
iter_sampling = 1000,
iter_warmup = 1000,
adapt_delta = 0.9,
chains = 4,
parallel_chains = 4,
seed = 123244,
refresh = 0
)
```

```
Running MCMC with 4 parallel chains...
Chain 2 finished in 1.3 seconds.
Chain 1 finished in 1.4 seconds.
Chain 4 finished in 1.8 seconds.
Chain 3 finished in 2.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 1.6 seconds.
Total execution time: 2.0 seconds.
```

`bayfit_corr2$summary(variables = c("a1", "b0", "b1", "mu_a", "sig_a", "sig_u", "sig_x"))`

```
# A tibble: 7 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1 a1 1.03 1.03 0.0395 0.0399 0.968 1.10 1.00 1068. 1501.
2 b0 0.00162 0.00387 0.296 0.286 -0.481 0.496 1.01 1381. 1993.
3 b1 1.52 1.51 0.119 0.120 1.34 1.72 1.00 682. 523.
4 mu_a 2.17 2.15 0.445 0.464 1.45 2.91 1.04 134. 444.
5 sig_a 3.97 3.96 0.399 0.398 3.36 4.66 1.01 333. 661.
6 sig_u 4.07 4.07 0.113 0.115 3.88 4.26 1.01 1188. 1738.
7 sig_x 1.08 1.08 0.368 0.398 0.507 1.70 1.06 119. 168.
```

`bayesplot::mcmc_areas(bayfit_corr2$draws("a1")) + ggplot2::xlim(0, 2)`

Two reasons:

- Precision and estimates of
- Often multi-level models are used for more complicated situations than the one described here. Mostly for modelling heterogeneity in responses. In those situations it can still be important to acknowledge correlations with unobserved .

```
a_i_summary <- bayfit_corr$summary(variables = c("ai"))
a_i_summary2 <- bayfit_corr2$summary(variables = c("ai"))
```

```
yrange <- range(c(fixef(olsfe)$firm_id, a_i_summary$mean, a_i_summary2$mean))
wblue <- rgb(85, 130, 169, alpha = 90, maxColorValue = 255)
par(mfrow = c(1, 3), pty="s", tck = -.02, oma = c(1, 1, 1, 1), mar = c(4, 3, 1, 1))
plot(firm_mean, fixef(olsfe)$firm_id, pch = 19, col = wblue, ylim = yrange,
ylab = "OLS FE estimate", xlab = expression("True "*a[i]))
abline(a = 0, b = 1, col = "grey")
plot(firm_mean, a_i_summary$mean, pch = 19, col = wblue, ylim = yrange,
ylab = "Bayes posterior mean 1", xlab = expression("True "*a[i]))
abline(a = 0, b = 1, col = "grey")
plot(firm_mean, a_i_summary2$mean, pch = 19, col = wblue, ylim = yrange,
ylab = "Bayes posterior mean 2", xlab = expression("True "*a[i]))
abline(a = 0, b = 1, col = "grey")
```

As we can see, the Bayesian estimates are less noisy than the OLS ones.

We can also see the same thing by comparing the differences between the estimates and the true firm mean values:

```
data.frame(
bay1 = a_i_summary$mean - firm_mean,
bay2 = a_i_summary2$mean - firm_mean,
ols = fixef(olsfe)$firm_id - firm_mean
) |>
summary()
```

```
bay1 bay2 ols
Min. :-1.732067 Min. :-1.830365 Min. :-2.858362
1st Qu.:-0.647746 1st Qu.:-0.728263 1st Qu.:-1.155375
Median : 0.022125 Median :-0.004072 Median :-0.008425
Mean :-0.003507 Mean :-0.028118 Mean :-0.005805
3rd Qu.: 0.480510 3rd Qu.: 0.487135 3rd Qu.: 0.903697
Max. : 2.447100 Max. : 2.382485 Max. : 4.523085
```

This is all about incorporating and appreciating the uncertainty in your model forecasts. Time and time again I see students and young professionals make overconfident valuation calls because they do not appreciate the large uncertainty in their model assumptions.

Scenario analyses, if done well, helps of course. But it is very hard to incorporate more than a few sources of uncertainty into a scenario analysis. Simulations on the other hand can incorporate many sources of uncertainty and I found them to be a nice teaching tool for visualizing how fast uncertainty can accumulate. Here is a simple example in case someone else might find it useful.

A few months ago, I heard about the Johnson Quantile-Parameterized Distribution while watching a Stan tutorial.

The Johnson Quantile-Parameterized Distribution (J-QPD) is a flexible distribution system that is parameterised by a symmetric percentile triplet of quantile values (typically the 10th-50th-90th) along with known support bounds for the distribution. The J-QPD system was developed by Hadlock and Bickel (2017) doi:10.1287/deca.2016.0343.

It is a great tool for quantifying subjective beliefs and it is what we will use for quantifying our uncertainty about key parts of the financial model.

```
library(rjqpd)
library(ggplot2)
library(patchwork)
library(matrixStats)
set.seed(0234)
col_lgold = "#C3BCB2"
theme_set(theme_minimal())
```

Assume we want to estimate yearly volume for a product. We do this by modelling market size and market share. We model the market size per year based on the eventual maximum number of households that have demand for the product, , and the saturation at which this maximum market size is reached, . Market share is denoted as .

We further want to model the saturation rate as a function of the number of years, , it takes to fully saturate the market and a rate parameter that dictates the shape of the rate of progress.

```
gen_trend_series <- function(start, end, nr_years, m = 0.5, max_years = 10) {
i <- seq_len(nr_years) - 1
# this is just to make sure that we always consider the same amount of years
full_series <- rep.int(end, max_years)
# m controls how fast saturation happens (see plot)
trend <- start + (end - start) * ( i / (nr_years - 1) )^m
full_series[1:nr_years] <- trend
return(full_series)
}
data.frame(
x = c(seq_len(10), seq_len(10)),
y = c(gen_trend_series(0, 1, 9, m = 0.5), gen_trend_series(0, 1, 9, m = 1.2)),
lab = c(rep.int("m = 0.5", 10), rep.int("m = 1.2", 10))
) |>
ggplot(aes(x = x, y = y, group = lab, color = lab)) +
geom_point() +
labs(y = "Market saturation", x = "Year", color = "Rate")
```

Of course, this is not a full, or even a good, model. For a proper model we would break this down further. For example, should obviously be . Depending on the business, would be a function of our assessment of the firm’s marketing prowess, product characteristics versus competing products, competitors’ actions etc. And market saturation will obviously depend on these actions as well (it’s not MECE as my consulting friends are fond of saying). I’m not going to model further here because I just want to focus on showing the coding for the simulation approach. But everything here can–and probably should–be applied at the lower input levels of your model too.

In this setup there are already 3-4 things we are uncertain about:

- The maximum number of HH that represent the ceiling for this market,
- The amount of years it takes the market players to saturate the market
- Whether most of the saturation happens early on, or later (Rate parameter )
- The market share of the product

Let’s say we did our research on these four inputs and have formed beliefs about possible values. We will now try to express those believes in a sort of “multiple universes” way.

Say we want to simulate 1.000 universes.

`n_sims <- 1000`

Which is to say we want to have 1.000 draws from different random number generators. Let’s start with :

Assume that after researching the market for a while we, expect the maximum addressable market will be about 100 million HHs. But we are unsure. It could be more or less. Say, we believe there is only a small chance (1 in 20) that the maximum market size is lower than 90 million and only a 1/20 chance that it is higher than 120 million. We can express this believe via a J-QPD distribution. It takes as inputs three quantiles/percentiles (which I sort of given you above) and draws a distribution around those constraints. Once we have a J-QPD random number generator (which is in the `rjqpd`

package) and calibrated it according to our believes, we can draw 1.000 times from it. Here is how you would do that in R

```
params_HHmax <- jqpd(
c(90, 100, 120), # the three benchmark points to our quantiles
lower = 0, upper = 200, # worst case (lower): zero market, (upper): all HH in the area (200)
alpha = 0.05 # bounds are 5%, 50%, 95% percentiles
)
samples_HHmax <- rjqpd(n = n_sims, params_HHmax)
ggplot(data.frame(x = samples_HHmax), aes(x = x)) +
geom_histogram(binwidth = 5, color = "white", fill = col_lgold) +
labs(y = "Count of draws",
x = expression(HH[max]),
subtitle = "J-QPD(5% = 90, 50% = 100, 95% = 120)"
) +
theme(plot.subtitle = element_text(size = 7), panel.grid.minor = element_blank()) +
scale_x_continuous(breaks = seq(0, 200, 5))
```

What I like about this approach too, is that plots like the above make it very easy to check whether this simulation adequately reflects your believes! Here you can see that in the majority of the case (universes) HHmax will about 100 million. And you can see the distribution of the other universes and check whether that is really what you believe (e.g., the long right tail. Do we really believe that there is a small chance of 145 million ? if not, we need to adjust this.)

We can do exactly the same for and .

```
params_share <- jqpd(
c(0.2, 0.3, 0.4),
lower = 0, upper = 1,
alpha = 0.05 # bounds are 5%, 50%, 95%
)
samples_share <- rjqpd(n = n_sims, params_share)
ggplot(data.frame(x = samples_share), aes(x = x)) +
geom_histogram(binwidth = 0.05, color = "white", fill = col_lgold) +
labs(y = "Count of draws",
x = expression(share),
subtitle = "J-QPD(5% = 0.2, 50% = 0.3, 95% = 0.4)"
) +
theme(plot.subtitle = element_text(size = 7), panel.grid.minor = element_blank()) +
scale_x_continuous(breaks = seq(0, 1, 0.05)) +
coord_cartesian(xlim = c(0, 1))
```

```
params_m <- jqpd(
c(0.6, 1.0, 1.4),
lower = 0.1, upper = 2,
alpha = 0.05 # bounds are 5%, 50%, 95%
)
samples_m <- rjqpd(n = n_sims, params_m)
ggplot(data.frame(x = samples_m), aes(x = x)) +
geom_histogram(binwidth = 0.1, color = "white", fill = col_lgold) +
labs(y = "Count of draws",
x = expression(m),
subtitle = "J-QPD(5% = 0.2, 50% = 0.3, 95% = 0.4)"
) +
theme(plot.subtitle = element_text(size = 7), panel.grid.minor = element_blank()) +
scale_x_continuous(breaks = seq(0, 2, 0.1)) +
coord_cartesian(xlim = c(0, 2))
```

Finally, we need to quanitify our uncertainty about the number of years, , it takes the market to become saturated. Just to showcase another sampling function, let’s assume that we thing the market will be saturated in x years according to the following probabilities:

```
N <- c("5" = 0.1, "6" = 0.2, "7" = 0.4, "8" = 0.2, "9" = 0.1)
N
```

```
5 6 7 8 9
0.1 0.2 0.4 0.2 0.1
```

```
samples_N <- sample(5:9, size = n_sims, replace = TRUE, prob = N)
ggplot(data.frame(x = samples_N), aes(x = x)) +
geom_histogram(bins = 5, color = "white", fill = col_lgold) +
labs(y = "Count of draws",
x = expression(N),
subtitle = "Discrete sampling"
) +
theme(plot.subtitle = element_text(size = 7), panel.grid.minor = element_blank()) +
scale_x_continuous(breaks = seq(1, 10, 1)) +
coord_cartesian(xlim = c(1, 10))
```

Now we can put it all together. According to:

we now just need to calculate and then multiply our samples (because of independent sampling) like this:

```
sat_grade <-
mapply(function(x, y) gen_trend_series(0, 1, nr_years = x, m = y),
x = samples_N,
y = samples_m)
sat_grade[, 1:3]
```

```
[,1] [,2] [,3]
[1,] 0.0000000 0.0000000 0.00000000
[2,] 0.3114973 0.1798650 0.04958143
[3,] 0.4891172 0.3313905 0.13496237
[4,] 0.6368562 0.4737893 0.24244274
[5,] 0.7680183 0.6105671 0.36737225
[6,] 0.8880883 0.7433117 0.50712011
[7,] 1.0000000 0.8729282 0.65993753
[8,] 1.0000000 1.0000000 0.82455547
[9,] 1.0000000 1.0000000 1.00000000
[10,] 1.0000000 1.0000000 1.00000000
```

to make the code more understandable here the rest in a for loop:

```
vol_series <- matrix(NA, nrow = 10, ncol = n_sims)
for (i in seq_len(n_sims)) {
vol_series[, i] <- samples_HHmax[[i]] * sat_grade[, i] * samples_share[[i]]
}
vol_series[, 1:3]
```

```
[,1] [,2] [,3]
[1,] 0.00000 0.000000 0.000000
[2,] 8.72598 6.223404 1.904877
[3,] 13.70165 11.466248 5.185140
[4,] 17.84026 16.393303 9.314445
[5,] 21.51451 21.125872 14.114131
[6,] 24.87803 25.718889 19.483126
[7,] 28.01302 30.203674 25.354241
[8,] 28.01302 34.600410 31.678724
[9,] 28.01302 34.600410 38.419154
[10,] 28.01302 34.600410 38.419154
```

And now, we can plot the result

```
plot_data <-
as.data.frame(vol_series) |>
stack()
plot_data$Year <- rep(1:10, times = n_sims)
ci_data <- data.frame(
Year = 1:10,
Median = rowMedians(vol_series),
Q05 = rowQuantiles(vol_series, prob = 0.05),
Q25 = rowQuantiles(vol_series, prob = 0.25),
Q75 = rowQuantiles(vol_series, prob = 0.75),
Q95 = rowQuantiles(vol_series, prob = 0.95)
)
```

```
p1 <-
ggplot(plot_data, aes(x = Year, y = values, group = ind)) +
geom_line(alpha = 0.10) +
labs(subtitle = paste(n_sims, "indidual volume trends"))
p2 <-
ggplot(ci_data, aes(x = Year)) +
geom_ribbon(aes(ymin = Q05, ymax = Q95), alpha = 0.5, fill = col_lgold) +
geom_ribbon(aes(ymin = Q25, ymax = Q75), alpha = 0.5, fill = col_lgold) +
geom_line(aes(y = Median), color = col_lgold) +
labs(subtitle = "Summary via credible intervals") +
annotate("text",
x = c(5, 3, 8), y = c(60, 50, 10),
label = c("5%-95% probability", "25%-75% probability", "Median"),
color = "gray20", size = 3) +
annotate("segment",
x = c(5, 3, 8), y = c(60-1, 50-1, 10+1),
xend = c(8, 5, 7), yend = c(ci_data[8, "Q95"], ci_data[5, "Q75"], ci_data[7, "Median"]),
color = "gray40")
fig1 <- p1 + p2 +
plot_annotation(
tag_levels = 'A',
title = 'Quantifying beliefs via simulation',
subtitle = paste(n_sims, 'random draws (universes)')
) &
scale_x_continuous(breaks = seq(1, 10, 1)) &
labs(x = "Year", y = "Sales volume") &
coord_cartesian(ylim = c(0, 70)) &
theme(panel.grid.minor = element_blank())
fig1
```

And here is the result, we have propagated our uncertainty and believes about the input parameters into an empirical distribution. This empirical distribution quantifies the resulting believes about which values of yearly sales volume are more likely than others. Essentially, we can derive subjective probabilities from these 1,000 draws.

And we can see, the uncertainty in this model is actually quite big. For example, in year 7, we believe that with 50% probability sales volume will be between 23.65 and 33.01 million. But that there is also a 5% chance that it will be lower than 17.70 million.

`round(ci_data, 2)`

```
Year Median Q05 Q25 Q75 Q95
1 1 0.00 0.00 0.00 0.00 0.00
2 2 5.13 1.97 3.54 7.36 11.99
3 3 10.20 4.90 7.72 13.40 19.57
4 4 15.15 7.94 12.23 19.22 26.98
5 5 20.12 11.36 16.82 25.18 34.83
6 6 24.83 14.51 20.65 30.17 39.38
7 7 28.40 17.38 23.96 33.45 41.59
8 8 29.66 19.05 25.10 34.23 42.14
9 9 30.00 19.36 25.40 34.41 42.23
10 10 30.00 19.36 25.40 34.41 42.23
```

What I like about this approach:

- It is so easy to visualize how fast uncertainty propagates and also exactly what form it takes. And you really want to know your forecast uncertainty before making any serious decisions (in my humble opinion). I have not found a better way to communicate this uncertainty as precisely and transparently, as it is done here.
- It is very flexible. You can do your normal modelling. And just add one dimension. Instead of putting in a value for an input assumption, you put in x amount of samples drawn from a sampling distribution that helps you quantify your subjective beliefs
- It is a good guide for highlighting where more research is needed to reduce uncertainty. And conversely, where you simply cannot get more precise forecast given the resources you have.
- I found that the J-QPD is quite helpful for teaching. It is often hard to vocalize your beliefs in detail. But one can often say, “well at the median I would expect this value and I would be highly surprised (e.g., chance 1/20 or 1/00) if it is lower than x and higher than y”.

The idea behind valuing firms via multiples is simple: If firm B has identical characteristics to firm A, then it should also be worth as much as firm A. The tricky part is to find comparable firms. One of the most often asked question in FSA classrooms is: “how do I choose valuation multiples?” Which characteristics matter to declare firm A as “comparable”? The good news is that valuation models can tell us clearly which characteristics matter (spoiler: The usual suspects matter).

Take a simple example. Firm A earns earnings per share (eps) of €10 per share and is worth €100 per share. Firm B earns eps of €5 and has an identical growth, risk, and profitability profile. What is B worth, even though it is smaller? The answer is €50: We apply the same price-earnings multiple (PE, €100/€10 = 10 for A) to B (€5 * 10 = €50).

Why is that? If you agree that a company should be worth the present value of the future cash flows it generates (and you allow me to use the clean surplus equation), then we can start from the dividend discount model and express share price via the residual income model:

This simple reformulation of the dividend discount model says price is equal to the book value of equity ( ) plus the present value of “abnormal” or “residual” earnings ( . is net income and cost of equity capital. From this we can derive a simple representation for the price earnings ratio.

Assume for simplicity that we have a forecast for next year’s earnings and we assume a fixed yearly growth rate for residual earnings ( ). The formula then simplifies to:

Dividing both sides by and rearranging terms yields the following expression for the price earnings ratio:

The previous equation shows that the usual value drivers determine the P/E multiple. The same goes for other multiples; one can do very similar reformulations for EV/EBIT, etc. It is also intuitive. A good candidate for a comparable company should have a similar risk ( ), growth (in abnormal profitability ), and profitability ( , return on equity) profile.

The formula above does not have a very intuitive form though, so let’s plot it for reasonable ranges of , , and ):

```
library(ggplot2)
library(dplyr)
library(viridisLite)
```

```
pe <- function(r, g, RoE){
(1 / (r - g)) * (1 - (g / RoE))
}
# some reasonable range of value driver values
value_grid <- expand.grid(r = seq(0.11, 0.15, 0.001),
g = seq(0, 0.08, 0.001),
RoE = seq(0.05, 0.3, 0.025)
)
pe_data <- as.data.frame(value_grid)
pe_data$PE = pe(pe_data$r, pe_data$g, pe_data$RoE)
pe_data$RoE <- factor(
paste("RoE:", scales::percent(pe_data$RoE)),
levels = c("RoE: 5.0%", "RoE: 7.5%", "RoE: 10.0%", "RoE: 12.5%",
"RoE: 15.0%", "RoE: 17.5%", "RoE: 20.0%", "RoE: 22.5%",
"RoE: 25.0%", "RoE: 27.5%", "RoE: 30.0%")
)
```

```
pe_plots <-
pe_data %>%
filter(
is.infinite(PE) == FALSE,
PE > 0
) %>%
ggplot(aes(x = r, y = g, fill = PE)) +
geom_raster() +
facet_wrap(~RoE) +
labs(
x = expression(r[equity]),
y = expression(g[RI]),
fill = expression(P/E)
) +
scale_fill_viridis_c(option = "A") +
scale_x_continuous(labels = function(x) paste0(round(x*100, 0), "%")) +
scale_y_continuous(labels=scales::percent) +
theme_minimal() +
theme(axis.text = element_text(size = 6),
panel.grid.minor = element_blank())
pe_plots
```

And because this picture is not as helpful as could be (and I wanted an excuse to fiddle with the fabulous rayshader package), here is the same plot in 3D:

The main points are:

- Generally P/E is highest for high growth relative to risk (the upper left corner). Profitability has a positive moderating effect. Higher return on equity “bends” the plane further upwards by the upper left corner. The only exception (unsurprisingly) are those facets where RoE < r. In this case residual income–the abnormal profits–are negative. Which means the company is destroying value. Higher growth just means the company is destroying value faster, which is why the PE plane curves down with higher growth for the RoE < r cases.
- Back to the original question: Finding similar firms doesn’t mean just taking firms in the same industry per-se. It means finding firms with similar risk ( ), growth ( ), and profitability ( ) profile. Naturally, firms in the same market will likely have the same market risk, market growth opportunities, etc. So those are good candidates. But you’d still want to think about this further. For example, if two firms compete, they could have different competitive advantages and different profitability characteristics. One firm’s growth could come at the expense of the other.
- Note that different combinations, of , , can give you similar multiples. For example, both combinations, r = 13%, g = 2.5%, and RoE: 25.0% and R = 14%, g = 7%, and RoE: 17.5% lead to a PE of 8.6. But, even though they have the same multiple right now, since they are on different slopes in this valuation space, the multiples will
**change differently**, if things change. This just reiterates that you want to look for comparables with the same , , profile, because only then will changes in multiples be similar, if things change.

Multiple valuation means finding similar firms and applying their valuation–just following the logic that similar assets should be valued similarly. The tricky part is finding similar firms. What this small note hopefully showed is that similar firms are those with similar expected risk, growth, and profitability profiles. Quantifying those in the first place is of course not trivial (and depending on how detailed you do this, you probably do not need to resort to multiples anymore). For example, growth in abnormal profits is hard to reason about–as compared to, for example, growth in sales. But, even just ballparking whether the business models are comparable in terms of general risk, growth, and profitability prospects will already prevent clearly wrong comparables from entering the valuation.

```
library(ggplot2)
library(dplyr)
library(viridisLite)
library(rayshader)
library(rgl)
pe <- function(r, g, RoE){
(1 / (r - g)) * (1 - (g / RoE))
}
value_grid <- expand.grid(r = seq(0.11, 0.15, 0.001),
g = seq(0, 0.08, 0.001),
RoE = seq(0.05, 0.3, 0.025)
)
pe_data <- as.data.frame(value_grid)
pe_data$PE = pe(pe_data$r, pe_data$g, pe_data$RoE)
pe_data$RoE <- factor(
paste("RoE:", scales::percent(pe_data$RoE)),
levels = c("RoE: 5.0%", "RoE: 7.5%", "RoE: 10.0%", "RoE: 12.5%",
"RoE: 15.0%", "RoE: 17.5%", "RoE: 20.0%", "RoE: 22.5%",
"RoE: 25.0%", "RoE: 27.5%", "RoE: 30.0%")
)
pe_plots <-
pe_data %>%
filter(
is.infinite(PE) == FALSE,
PE > 0
) %>%
ggplot(aes(x = r, y = g, fill = PE)) +
geom_raster() +
facet_wrap(~RoE) +
labs(
x = expression(r[equity]),
y = expression(g[RI]),
fill = expression(P/E)
) +
scale_fill_viridis_c(option = "A") +
scale_x_continuous(labels = function(x) paste0(round(x*100, 0), "%")) +
scale_y_continuous(labels=scales::percent) +
theme_minimal() +
theme(axis.text = element_text(size = 6),
panel.grid.minor = element_blank())
pe_plots
plot_gg(
pe_plots,
multicore = TRUE,
width = 6,
height = 6,
scale = 250,
windowsize = c(1400, 866),
zoom = 0.45, theta = -40, phi = 30
)
render_movie(
filename = 'pe',
type = "orbit",
phi = 30,
zoom = 0.45,
theta = -40
)
```

It was a great event, hosted by the Frankfurt School of Finance and Management. The theme of the meeting was the “Influence of digitalisation on research methods”. The first session was about how recent developments in society and the scientific toolkit changed what data we as scientists have access to. The second session was an introductory discussion about new ways to measure latent concepts that we as accounting researchers are interested in (think: information processing cost, transparency, etc.)

You can find the workshop slides on the workshop’s homepage here.

You can find a video of my talk and the other, fantastic talks of the PhD forum here.

A github repo with the slides and code to generate all figures can be found here on my github page.