Introductory Bayesian Course Notes - 2014


The purpose of these course notes is to introduce participants to Bayesian analysis with R, RStudio and JAGS. It is assumed that participants are familiar with R and RStudio as covered in the Introductory R Course notes at

The notes are also available as a pdf.


The notes, which are released under a CC BY 4.0 license, are a draft of the material to be presented at the Introductory Bayesian Course in Kelowna on November 20th-21st, 2014. They were written by Dr. Joseph Thorley R.P.Bio.. The Rmd file to generate these notes is available on Github together with R/BUGS code to answer to the exercises.


If you haven’t already done so, download the base distribution binary of R 3.1.2 for your platform from and install using the default options. Next download and install the most recent version of RStudio from using the default options. Then, download JAGS 3.4.0 from (right-click and open this link in a new window) and install with the default options.

To install all the required packages execute the following code at the R console.

If everything was successful, when you run the following code

you should see something like

Analysis converged (rhat:1)


samples  chains 
   1500       3 


         estimate     lower    upper       sd error significance
bLambda 0.8771658 0.7151765 1.062476 0.088936    20            0

R Script Headers

During the course you should begin R scripts with

to load the required packages in the search path and to print numbers to four significant digits and

to clean up the workspace and close any graphics windows.

Bayesian and Frequentist Statistical Analysis

Statistical analysis uses one or more probability distributions to provide bounded estimates of parameter values (\(\theta\)) from the data (\(y\)).

There are two primary approaches to statistical analysis: Bayesian and frequentist. As far as a frequentist is concerned the best estimates of \(\theta\) are those values that maximise the likelihood which is the probability of the data given the estimates, i.e., \(p(y|\theta)\). A Bayesian on the other hand chooses the values with the highest posterior probability - that is to say the probability of the estimates given the data , i.e., \(p(\theta|y)\).

Coin Flips

Consider the case where \(n = 10\) flips of a coin produce \(y = 3\) tails. We can model this using a binomial distribution.

\[y \sim dbin(\theta, n)\]

where \(\theta\) is the probability of throwing a head.

Maximum Likelihood

The likelihood for the binomial model is given by the following equation

\[ p(y|\theta) = {n \choose y} \theta^{y}(1-\theta)^{n-y}\].

The likelihood values for different values of \(\theta\) are therefore as follows

The frequentist point estimate (\(\hat{\theta}\)) is the value of \(\theta\) with the maximum likelihood (ML) value, which in this case is 0.3.

A 95% confidence interval (CI) can then be calculated using the asymptotic normal approximation

\[\hat{\theta} \pm 1.96 \frac{1}{\sqrt{I(\hat{\theta})}}\]

where \(I(\hat{\theta})\) is the expected second derivative of the log-likelihood at the estimate. This calculation is based on the assumption that the sample size is of sufficient size that the likelihood is normally distributed.

In the current case,

\[I(\hat{\theta}) = \frac{n}{\hat{\theta}(1 - \hat{\theta})}\]

which gives a point estimate of 0.3 and lower and upper 95% confidence intervals of 0.02 and 0.58 respectively.

Posterior Probability

The posterior probability on the other hand is given by Bayes rule which states that

\[p(\theta|y) \propto p(y|\theta)p(\theta)\]

where \(p(\theta)\) is the prior probability.

Bayesians consider the necessity to define prior probabilities an advantage because prior information can be incorporated into an analysis while frequentists consider it subjective. In most cases Bayesians use low-information priors which have little to no influence on the posteriors such as a uniform distribution with a lower limit of 0 and an upper limit of 1 for a probability. As it is generally not possible to calculate the posterior probability, the posterior probability distribution is sampled using Markov Chain Monte Carlo (MCMC) algorithms such as Gibbs Sampling.

Gibbs Sampling

Consider the case where the parameters \(\theta = (\theta_1, \theta_2, \ldots, \theta_k)\) then Gibbs Sampling proceed as follows

Step 1

Choose starting initial values for \(\theta_1^{(0)}\) and \(\theta_2^{(0)}\)

Step 2

Sample \(\theta_1^{(1)}\) from \(p(\theta_1|\theta_2^{(0)}, y)\)

Sample \(\theta_2^{(1)}\) from \(p(\theta_2|\theta_1^{(1)}, y)\)

Step 3

Iterate step 2 thousands (or millions) of times to obtain a sample from \(p(\theta|y)\).

Typically this is performed for two or more independent chains.

JAGS and the BUGS Language

Programming an efficient MCMC algorithm for a particular model is outside the scope of most research projects. Fortunately, JAGS (which stands for Just Another Gibbs Sampler) can take a dataset and a model specified in the simple but flexible BUGS language (which stands for Bayesian Analysis Using Gibbs Sampling) and perform MCMC sampling for us. In order to do this we will use the jaggernaut package to talk to the standalone JAGS program via the rjags package.

First we need to specify the underlying probability model in the BUGS language and save it as an object of class jags_model.

then we call JAGS using jaggernaut (in the default report mode) to generate a total of \(\geq 10^3\) samples using three chains from \(\theta\)’s posterior probability distribution.

##       estimate  lower upper     sd error significance
## theta   0.3301 0.1079 0.595 0.1287    74            0

The model output indicates that the point estimate (in this case the mean of the samples) is 0.33 and the 95% credible interval (in this case the 2.25th and 97.75th percentiles) is 0.11 to 0.59. The model output also indicates that the posterior probability distribution has a standard deviation (sd) of 0.13. The significance and error values are discussed later.

Exercise 1 Previous studies indicate that the coin was definitely biased towards tails. Modify the prior distribution accordingly and rerun the above model. How does the posterior distribution change?

Black Cherry Trees

The trees data set in the dataset package provides information on the girth and volume of 31 black cherry trees.

Algebraically, the linear regression of Volume against Girth can be defined as follows

\[Volume_{i} = \alpha + \beta * Girth_{i} + \epsilon_{i}\]

where \(\alpha\) is the intercept and \(\beta\) is the slope and the error terms (\(\epsilon_{i}\)) are independently drawn from a normal distribution with an standard deviation of \(\sigma\).

The model can be defined as follows in the BUGS language where <- indicates a deterministic as opposed to stochastic node (which is indicated by ~).


The standard deviations of the normal distributions are raised to the power of -2 because (for historical reasons) Bayesians quantify variation in terms of the precision (\(\tau\)) as opposed to the variance (\(\sigma^2\)) or standard deviation (\(\sigma\)) where \(\tau = 1/\sigma^2\).

Parallel Chains

To reduce the analysis time, MCMC chains can be run on parallel processes. In jaggernaut this achieved using the registerDoParallel function and by setting the parallel option to TRUE. This only needs to be done once at the start of a session. I add it to my R script header.

The resultant trace plots and coefficients for the trees analysis are as follows.

##       estimate   lower   upper     sd error significance
## alpha  -36.157 -42.713 -29.699 3.2638    18            0
## beta     5.008   4.533   5.481 0.2395     9            0
## sigma    4.476   3.444   5.982 0.6202    28            0

Exercise 2 What do you notice about the trace plots? The output of auto_corr(analysis1) and cross_cor(analysis1) might give you some clues.


The \(\hat{R}\) (pronounced R-hat) convergence metric uses the within-chain and between-chain variances to quantify the extent to which the chains have converged on the same distribution. Lack of convergence suggests that the MCMC samples may not be representative of the posterior distributions. Although a value of 1.0 indicates full convergence, \(\hat{R} \leq 1.05\) is typically considered sufficient for most papers and \(\leq 1.10\) for most reports.

Exercise 3 What is the R-hat value for each of the parameters in the current model? Clue: type ?convergence.


Convergence can often be improved by simply increasing the number of iterations.

Exercise 4 What is the R-hat value for each of the parameters in the current model with 10,000 iterations?

jaggernaut Options

With the the default options_jaggernaut(mode = "report") settings:

To see the current mode and option settings type options_jaggernaut() (or opts_jagr() for short) and for more information type ?opts_jagr.

Chain Mixing

Cross-correlations between parameters, which cause poor chain mixing, i.e., high auto-correlation, can sometimes be eliminated or reduced by reparameterising the model. In the current model, Girth can be centered, i.e., Girth - mean(Girth), in the BUGS code or by setting the select_data argument to be select_data(model1) <- c("Volume", "Girth+").

Exercise 5 What is the effect of centering Girth on the trace plots with 1,000 iterations?

Note if you ever want to examine the actual data being passed to JAGS set the modify_data term of your jags_model object to be a simple function that prints and returns its one argument

Derived Parameters

Many researchers estimate fitted values and residuals, generate predictions and perform posterior predictive checks by monitoring additional nodes in their model code.

The disadvantages of this approach are that:

jaggernaut overcomes these problems by allowing derived parameters to be defined in a separate chunk of BUGS code as demonstrated below.


To better understand a model and/or inform management decisions it is generally useful to plot a model’s predictions.

In the following example, the predict function is used to estimate the Volume with 95% CRIs and 95% Prediction Intervals (PRIs) across the range of the observed values of Girth.

Exercise 6 The newdata argument in the predict function can also take a data.frame. What is the 95% Prediction Interval for the Volume of a tree of Girth 8?


The model assumes that the error terms are independently drawn from a normal distribution. It is possible to assess the adequacy of this assumption by plotting the residual variation.

Exercise 7 What does the current residual plot suggest to you about model adequacy?

Posterior Predictive Checks

A complementary approach to assessing model adequacy is to simulate data given the model parameters and compare it to the observed data. Any systematic differences indicate potential failings of the model.

##             estimate  lower upper   sd error significance
## discrepancy   0.5787 -10.27 12.39 5.65  1958       0.9042

Exercise 8 What does the posterior predictive check suggest to you about model adequacy?


As discussed in the R course notes the relationship between Volume and Girth is expected to be allometric because the cross-sectional area at an given point scales to the square of the girth (circumference).

Expressed as an allometric relationship the model becomes

\[Volume_{i} = \alpha * Girth_{i}^\beta * \epsilon_{i}\]

which can be reparameterised as a linear regression by log transforming Volume and Girth.

\[log(Volume_{i}) = \alpha + \beta * log(Girth_{i}) + \epsilon_{i}\]

Variables can be log transformed in the model code or in the select_data argument, i.e., select_data(model1) <- c("log(Volume)", "log(Girth)+").

Exercise 9 Fit the linear regression of the allometric model to the trees data set. Is the model fit improved?

Exercise 10 Is there any support for adding log(Height)+ to the model?

Significance Values

The significance value in the jaggernaut table of coefficients is twice the probability that the posterior distribution spans zero. As such it represents the Bayesian equivalent of a frequentist two-sided p-value (Greenland and Poole 2013). By definition, parameters that represent standard deviations will have a significance value of 0 (as they must be greater than zero).

Error Values

The error value in the table of coefficients is the percent relative error, which is half the credible interval as a percent of the point estimate. Standard deviations with a uniform prior distribution that is not updated by the data have error values of 0.95. As a general rule, I question the informativeness of parameters representing standard deviations which have an error value \(\geq 0.8\).

Tooth Growth

The basic ANCOVA (analysis of covariance) model includes one categorical and one continous predictor variable without interactions. It can be expressed algebraically as follows

\[y_{ij} = \alpha_{i} + \beta * x_{j} + \epsilon_{ij}\]

where \(\alpha_{i}\) is the intercept for the ith group mean and \(\beta\) is the slope and the error terms (\(\epsilon_{ij}\)) are independently drawn from a normal distribution with standard deviation \(\sigma\).

The following code fits the basic ANCOVA model to the ToothGrowth data and plots the model’s predictions and residuals.

##          estimate lower  upper     sd error significance
## alpha[1]    9.307 6.742 11.968 1.3750    28            0
## alpha[2]    5.627 2.989  8.258 1.3549    47            0
## beta        9.730 7.911 11.538 0.9281    19            0
## sigma       4.326 3.625  5.221 0.4004    18            0

Exercise 11 What is the significance value for the effect of OJ versus VC?

Exercise 12 What does the residual plot suggest about the model fit?

Multiple Models

The linear regression on dose is given by

The combine function allows multiple jags_model objects which can have unique mode_id’s to be combined into a single object.

##          estimate lower  upper     sd error significance
## alpha[1]    9.218 6.767 11.761 1.2754    27            0
## alpha[2]    5.540 3.032  8.176 1.2861    46            0
## beta        9.799 8.109 11.512 0.8637    17            0
## sigma       4.315 3.623  5.227 0.4192    19            0
## $regression
##       estimate lower  upper     sd error significance
## alpha    7.258 4.867  9.884 1.2902    35            0
## beta     9.889 7.986 11.666 0.9691    19            0
## sigma    4.716 3.890  5.702 0.4601    19            0

Exercise 13 Fit 1) the ANCOVA, 2) the linear regression, 3) the ANOVA and 4) the ANCOVA with an interaction between dose and supp models and plot their predictions. Which model do you prefer?

Exercise 14 How might you further improve your preferred model?

Effects Size

Often the results of an analysis are easier to understand when they are presented in terms of the percent change in the response (Bradford, Korman, and Higgins 2005). The following code predicts and plots the percent change in len relative to 0.75 mg of Vitamin C.

Exercise 15 Plot the percent change in len for all four models relative to 0.5 mg of Vitamin C

Peregrine Falcon Population Abundance

Consider the peregrine falcon population data from Kery and Schuab (2011). The following code regresses the number of reproductive Pairs on Year.

##       estimate  lower  upper     sd error significance
## alpha   90.565 84.694 96.606 3.0722     7            0
## beta     4.728  4.188  5.244 0.2726    11            0
## sigma   19.430 15.591 23.981 2.1940    22            0

Exercise 16 What happens to the model’s predictions if Year isn’t centred?

When a response variable cannot take negative values, the use of a log-link function ensures that the expected value must be positive.

Exercise 17 Replace ePairs[i] <- ... with log(ePairs[i]) <- .... How does the log-link function alter the model?

Poisson Distribution

The Poisson distribution models counts about a positive mean expected value. It can only assume discrete non-negative values and has a variance equal to the mean.

Exercise 18 Next, replace Pairs[i] ~ dnorm(ePairs[i], sigma^-2) with Pairs[i] ~ dpois(ePairs[i]). How does the assumption of Poisson distributed counts alter the model?

Polynomial Regression

Curvature in a predictor variable can be modelled by allowing the influence of a variable to vary with the variable. For example, the following model code fits a second-order (quadratic) polynomial regression on Year.

Exercise 19 How does the second-order polynomial change the model’s predictions?

Exercise 20 Fit a third-order (cubic) polynomial regression (... + Year[i]^2 + beta3 * Year[i]^3). How does it alter the model?

Exercise 21 Use the third-order polynomial regression to predict the number of breeding pairs in 2006. How confident are you in your answer?

State-Space Population Growth Model

An alternative to a polynomial regression would be to fit a state-space population growth model which explicity estimates the step changes in the underlying population

\[log(N_{t+1}) = log(N_{t}) + r_{t}\] \[r_{t} \sim dnorm(\bar{r}, \sigma_{r})\]

The specification of r and logN as random effect means that by default they are excluded from the trace plots and table of coefficients.

Exercise 22 Plot the state-space population growth rate model predictions in terms of the percent change since 1970. What is the estimated percent change in the population in 2003 compared to 1970?

Exercise 23 What is the probability that the population in 2008 will be less than that in 2003? Note you can produce the projections by simply appending five years of missing counts to the dataset.

Overdispersed Poisson

The Poisson distribution assumes that the variation in the counts about the expected value is due to independent events. Often however this assumption is violated and the variance is greater than the mean. In this case an overdispersed Poisson distribution is required.

The following model uses a gamma distribution where the scale and shape parameters are both 1 / sDispersion^2 to generate the extra-Poisson multiplier because such a distribution has a mean of 1, a standard deviation of sDispersion and is always positive.

##             estimate      lower   upper      sd error significance
## logN1        3.54960  3.3016252 3.78656 0.12538     7       0.0000
## mean_r       0.04292 -0.0004407 0.08252 0.02068    97       0.0547
## sDispersion  0.02684  0.0017635 0.07634 0.02063   139       0.0000
## sd_r         0.12741  0.0888883 0.17948 0.02361    36       0.0000

Exercise 24 Is the inclusion of a Plot the state-space population growth rate model predictions in terms of the percent change since 1970. What is the estimated percent change in the population in 2003 compared to 1970?

Peregrine Falcon Breeding Success

Now let us consider the proportion of peregrine Pairs that successfully reproduced (R.pairs/Pairs).

Exercise 25 How does a second-order polynomial regression alter the model’s predictions?

When a response variable is a probability, the use of the logistic-link function ensures that the expected values must lie between 0 and 1.

\[logit(p) = log(\frac{p}{1-p})\]

Exercise 26 How does adding the logistic-link function, i.e., logit(eProportion[i]) <- ... alter the model’s predictions?

Binomial Distribution

The binomial distribution models the number of successes given the expected probability of successes and the number of independent trials.

Exercise 27 How does replacing the normal distribution Proportion[i] ~ dnorm(eProportion[i], sigma^-2) with the binomial distribution R.Pairs ~ dbin(eProportion[i], Pairs[i]) alter the model?

Autoregressive Smoothing

Earlier we used an autoregressive term to explicitly model population change. Autoregressive models can also be used simply for their smoothing properties.

Here we model the differences in the log-odds probability of breeding success from one year to the next as a normally distributed random effect.

Exercise 28 What happens to the model if you replace line 5 with theta[i] <- theta[i-1]?

Overdispersed Binomial Distribution

The binomial distribution assumes independent trials. Often this is not the case. Overdispersion can be simply modelled through a random effect on the expected probability of success for each sample.

Exercise 29 What effect does a normally distributed random effect on logit(eProportion[i]) have on the model? Are the data overdispersed?

Further Information

The JAGS manual is definitely worth a read to familiarise yourself with the available distributions and functions.

All the R and BUGS code for the analyses Poisson Consulting Ltd. staff perform is available under MIT open source licences from GitHub.

The following blog plot articulates: the joy and martyrdom of trying to be a Bayesian.

And finally, to avoid contributing to the statistical crisis in science please

In other words allow the scientific community to fully understand and replicate what you have done.


Bradford, Michael J, Josh Korman, and Paul S Higgins. 2005. “Using Confidence Intervals to Estimate the Response of Salmon Populations (Oncorhynchus Spp.) to Experimental Habitat Alterations.” Canadian Journal of Fisheries and Aquatic Sciences 62 (12): 2716–26.

Greenland, Sander, and Charles Poole. 2013. “Living with P Values: Resurrecting a Bayesian Perspective on Frequentist Statistics.” Epidemiology 24 (1): 62–68.

Kery, Marc, and Michael Schaub. 2011. Bayesian Population Analysis Using WinBUGS : A Hierarchical Perspective. Boston: Academic Press.