# Glossary

# 12. Glossary¶

**Autocorrelation**: Autocorrelation is the correlation of a signal
with a lagged copy of itself. Conceptually, you can think of it as how
similar observations are as a function of the time lag between them.
Large autocorrelation is a concern in MCMC samples as it reduces the
effective sample size.

**Aleatoric Uncertainty**: Aleatoric uncertainty is related to the
notion that there are some quantities that affect a measurement or
observation that are intrinsically unknowable or random. For example,
even if we were able to exactly replicate condition such as direction,
altitude and force when shooting an arrow with a bow. The arrow will
still not hit the same point, because there are other conditions that we
do not control like fluctuations of the atmosphere or vibrations of the
arrow shaft, that are random.

**Bayesian Inference**: Bayesian Inference is a particular form of
statistical inference based on combining probability distributions in
order to obtain other probability distributions. In other words is the
formulation and computation of conditional probability or probability
densities,
\(p(\boldsymbol{\theta} \mid \boldsymbol{Y}) \propto p(\boldsymbol{Y} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta})\).

**Bayesian workflow**: Designing a good enough model for a given problem
requires significant statistical and domain knowledge expertise. Such
design is typically carried out through an iterative process called
Bayesian workflow. This process includes the three steps of model
building [17]: inference, model checking/improvement, and model
comparison. In this context the purpose of model comparison is not
necessarily restricted to pick the *best* model, but more importantly to
better understand the models.

**Causal inference**: or Observational causal inference. The procedures
and tools used to estimate the impact of a treatment (or intervention)
in some system without testing the intervention. That is from
observational data instead of experimental data.

**Covariance Matrix and Precision Matrix**: The covariance matrix is a
square matrix that contains the covariance between each pair of elements
of a collection of random variable. The diagonal of the covariance
matrix is the variance of the random variable. The precision matrix is
the matrix inverse of the covariance matrix.

**Design Matrix**: In the context of regression analysis a design matrix
is a matrix of values of the explanatory variables. Each row represents
an individual object, with the successive columns corresponding to the
variables and their specific values for that observation. It can contain
indicator variables (ones and zeros) indicating group membership, or it
can contain continuous values.

**Decision tree**: A decision tree is a flowchart-like structure in
which each internal node represents a “test” on an attribute (e.g.
whether a coin flip comes up heads or tails), each branch represents the
outcome of the test, and each leaf node represents a class label
(decision taken after computing all attributes). The paths from root to
leaf represent classification rules. The values at the leaf nodes can be
continuous if the tree is used for regression.

**dse**: The standard error of component-wise differences of `elpd_loo`

between two models. This error is smaller than the standard error (`se`

in `az.compare`

) for individual models. The reason being that generally
some observations are as easy/hard to predict for all models and thus
this introduce correlations.

**d_loo**: The difference in `elpd_loo`

for two models. If more than
two models are compared, the difference is computed relative to the
model with highest `elpd_loo`

).

**Epistimic Uncertainty**: Epistemic uncertainty is related to the lack
of knowledge of the states of a system by some observer. It is related
to the knowledge that we could have in principle but not in practice and
not about the intrinsic unknowable quantity of nature (contrast with
aleatory uncertainty). For example, we may be uncertain of the weight of
an item because we do not have an scale at hand, so we estimate the
weight by lifting it, or we may have one scale but with a precision
limited to the kilogram. We could also have epistemic uncertainty if we
design an experiment or perform a computation ignoring factors. For
example, to estimate how much time we will have to drive to another
city, we may omit the time spent at tolls, or we may assume excellent
weather or road conditions etc. In other words, epistemic uncertainty is
about ignorance and in opposition to aleatoric, uncertainty, we can in
principle reduce it by obtaining more information.

**Statistic**: A statistic (not plural) or sample statistic is any
quantity computed from a sample. Sample statistics are computed for
several reasons including estimating a population (or data generating
process) parameter, describing a sample, or evaluating a hypothesis. The
sample mean (also known as empirical mean) is a statistic, the sample
variance (or empirical variance) is another example. When a statistic is
used to estimate a population (or data generating process) parameter,
the statistic is called an estimator. Thus, the sample mean can be an
estimator and the posterior mean can be another estimator.

**ELPD**: Expected Log-pointwise Predictive Density (or expected log
pointwise predictive probabilities for discrete model). This quantity is
generally estimated by cross-validation or using methods such as WAIC
(`elpd_waic`

) or LOO (`elpd_loo`

). As probability densities can be
smaller or larger than 1, the ELPD can be negative or positive for
continuous variables and non-negative for discrete variables.

**Exchangeability**: A sequence of Random variables is exchangeable if
their joint probability distribution does not change when the positions
in the sequence is altered. Exchangeable random variables are not
necessarily iid, but iid are exchangeable.

**Exploratory Analysis of Bayesian Models**: The collection of tasks
necessary to perform a successful Bayesian data analysis that are not
the inference itself. This includes. Diagnosing the quality of the
inference results obtained using numerical methods. Model criticism,
including evaluations of both model assumptions and model predictions.
Comparison of models, including model selection or model averaging.
Preparation of the results for a particular audience.

**Hamiltonian Monte Carlo** Hamiltonian Monte Carlo (HMC) is a Markov
chain Monte Carlo (MCMC) method that uses the gradient to efficiently
explore a probability distribution function. In Bayesian statistics this
is most commonly used to obtain samples from the posterior distribution.
HMC methods are instances of the Metropolis–Hastings algorithm, where
the proposed new points are computed from a Hamiltonian, this allows the
methods to proposed new states to be far from the current one with high
acceptance probability. The evolution of the system is simulated using a
time-reversible and volume-preserving numerical integrator (most
commonly the leapfrog integrator). The efficiency of the HMC method is
highly dependant on certain hyperparameters of the method. Thus, the
most useful methods in Bayesian statistics are adaptive dynamics
versions of HMC that can adjust those hyperparameters automatically
during the warm-up or tuning phase.

**Heteroscedasticity**: A sequence of random variables is
heteroscedastic if its random variables do not have the same variance,
i.e. if they are not homoscedastic. This is also known as heterogeneity
of variance.

**Homoscedasticity**: A sequence of random variables is homoscedastic if
all its random variables have the same finite variance. This is also
known as homogeneity of variance. The complementary notion is called
heteroscedasticity.

**iid**: Independent and identically distributed. A collection of random
variables is independent and identically distributed if each random
variable has the same probability distribution as the others and all are
mutually independent. If a collection of random variables is iid it is
also exchangeable, but the converse is not necessarily true.

**Individual Conditional Expectation** ICE: An ICE shows the dependence
between the response variable and a covariate of interest. This is done
for each sample separately with one line per sample. This contrast to
PDPs where the average effect of the covariate is represented.

**Inference**: Colloquially, inference is reaching a conclusion based on
evidence and reasoning. In this book refer to inference we generally
mean about Bayesian Inference, which has a more restricted and precise
definition. Bayesian Inference is the process of conditioning models to
the available data and obtaining posterior distributions. Thus, in order
to reach a conclusion based on evidence and reasoning, we need to
perform more steps that mere Bayesian inference. Hence the importance of
discussing Bayesian analysis in terms of exploratory analysis of
Bayesian models or more generally in term of Bayesian workflows.

**Imputation**: Replacing missing data values through a method of
choice. Common methods may include most common occurrence or
interpolation based on other (present) observed data.

**KDE**: Kernel Density Estimation. A non-parametric method to estimate
the probability density function of a random variable from a finite set
of samples. We often use the term KDE to talk about the estimated
density and not the method.

**LOO**: Short for Pareto smoothed importance sampling leave one out
cross-validation (PSIS-LOO-CV). In the literature “LOO” may be
restricted to leave one out cross-validation.

**Maximum a Posteriori (MAP)** An estimator of an unknown quantity, that
equals the mode of the posterior distribution. The MAP estimator
requires optimization of the posterior, unlike the posterior mean which
requires integration. If the priors are flat, or in the limit of
infinite sample size, the MAP estimator is equivalent to the Maximum
Likelihood estimator.

**Odds** A measure of the likelihood of a particular outcome. They are
calculated as the ratio of the number of events that produce that
outcome to the number that do not. Odds are commonly used in gambling.

**Overfitting**: A model overfits when produces predictions too closely
to the dataset used for fitting the model failing to fit new datasets.
In terms of the number of parameters an overfitted model contains more
parameters than can be justified by the data. An arbitrary
over-complex model will fit not only the data but also the noise,
leading to poor predictions.

**Partial Dependence Plots** PDP: A PDP shows the dependence between the
response variable and a set of covariates of interest, this is done by
marginalizing over the values of all other covariates. Intuitively, we
can interpret the partial dependence as the expected value of the
response variable as function of the covariates of interest.

**Pareto k estimates** \(\hat k\): A diagnostic for Pareto smoothed
importance sampling (PSIS), which is used by LOO. The Pareto k
diagnostic estimates how far an individual leave-one-out observation is
from the full distribution. If leaving out an observation changes the
posterior too much then importance sampling is not able to give reliable
estimates. If \(\hat \kappa < 0.5\), then the corresponding component of
`elpd_loo`

is estimated with high accuracy. If \(0.5< \hat \kappa <0.7\)
the accuracy is lower, but still useful in practice. If
\(\hat \kappa > 0.7\), then importance sampling is not able to provide a
useful estimate for that observation. The \(\hat \kappa\) values are also
useful as a measure of influence of an observation. Highly influential
observations have high \(\hat \kappa\) values. Very high \(\hat \kappa\)
values often indicate model misspecification, outliers, or mistakes in
the data processing.

**Point estimate** A single value, generally but not necessarily in
parameter space, used as a summary of *best estimate* of an unknown
quantity. A point estimate can be contrasted with an interval estimate
like highest density intervals, which provides a range or interval of
values describing the unknown quantity. We can also contrast a point
estimate with distributional estimates, like the posterior distribution
or its marginals.

**p_loo**: The difference between `elpd_loo`

: and the
non-cross-validated log posterior predictive density. It describes how
much more difficult it is to predict future data than the observed data.
Asymptotically under certain regularity conditions, `p_loo`

can be
interpreted as the effective number of parameters. In well behaving
cases `p_loo`

should be lower than the number of parameters in the model
and smaller than the number observations in the data. If not, this is an
indication that the model has very weak predictive capability and may
thus indicate a severe model misspecification. See high Pareto k
diagnostic values.

**Probabilistic Programming Language**: A programming syntax composed of
primitives that allows one to define Bayesian models and perform
inference automatically. Typically a Probabilistic Programming Language
also includes functionality to generate prior or posterior predictive
samples or even to analysis result from inference.

**Prior predictive distribution**: The expected distribution of the data
according to the model (prior and likelihood). That is, the data the
model is expecting to see before seeing any data. See Equation
(1.7). The prior predictive
distribution can be used for prior elicitation, as it is generally
easier to think in terms of the observed data, than to think in terms of
model parameters.

**Posterior predictive distribution**: This is the distribution of
(future) data according to the posterior, which in turn is a consequence
of the model (prior and likelihood) and observed data. In other words,
these are the model’s predictions. See Equation
(1.8). Besides generating
predictions, the posterior predictive distribution can be used to asses
the model fit, by comparing it with the observed data.

**Residuals**: The difference between an observed value and the
estimated value of the quantity of interest. If a model assumes that the
variance is finite and the same for all residuals, we say we have
homoscedasticity. If instead the variance can change, we say we have
heteroscedasticity.

**Sufficient statistics**: A statistic is sufficient with respect to a
model parameter if no other statistic computed from the same sample
provides any additional information about that sample. In other words,
that statistic is *sufficient* to summarize your samples without losing
information. For example, given a sample of independent values from a
normal distribution with expected value \(\mu\) and known finite variance
the sample mean is sufficient statistics for \(\mu\). Notice that the mean
says nothing about the dispersion, thus it is only sufficient with
respect to the parameter \(\mu\). It is known that for iid data the only
distributions with a sufficient statistic with dimension equal to the
dimension of \(\theta\) are the distributions from the exponential family.
For other distribution, the dimension of the sufficient statistic
increases with the sample size.

**Synthetic data**: Also known as fake data it refers to data generated
from a model instead of being gathered from experimentation or
observation. Samples from the posterior/prior predictive distributions
are examples of synthetic data.

**Timestamp**: A timestamp is an encoded information to identify when a
certain event happens. Usually a timestamp is written in the format of
date and time of day, with more accurate fraction of a second when
necessary.

**Turing-complete** In colloquial usage, is used to mean that any
real-world general-purpose computer or computer language can
approximately simulate the computational aspects of any other real-world
general-purpose computer or computer language.