World Library  
Flag as Inappropriate
Email this Article

Statistical model

Article Id: WHEBN0000027576
Reproduction Date:

Title: Statistical model  
Author: World Heritage Encyclopedia
Language: English
Subject: Optimal design, Mathematical model, Linear least squares (mathematics), Analysis of variance, Identifiability
Collection: Mathematical Modeling, Scientific Modeling, Statistical Models, Statistical Theory
Publisher: World Heritage Encyclopedia

Statistical model

A statistical model is a set of assumptions concerning the generation of the observed data, and similar data from a larger population. The model represents, often in considerably idealized form, the data-generating process.

A statistical model is used to describe a set of probability distributions, some of which are assumed to reasonably approximate the distribution from which a particular data set is sampled. The model is formally specified by relationships among one or more random variables and other (non-random) variables; the relationships are usually given as mathematical equations. Herman Adèr quotes Kenneth Bollen as saying, "A model is a formal representation of a theory".[1]

All statistical tests can be described in the form of statistical models. For example, the Student's t-test for comparing the means of two sets can be formulated as determining if an estimated parameter in the model is different from 0. Similarly, all statistical estimates, such as confidence intervals, are derived from statistical models.


  • Formal definition 1
  • An example 2
  • Dimension of a model 3
  • Nested models 4
  • Model comparison 5
  • See also 6
  • Notes 7
  • References 8
  • Further reading 9

Formal definition

In mathematical terms, a statistical model is usually thought of as a pair (S, \mathcal{P}), where S is the set of possible observations, i.e. the sample space, and \mathcal{P} is a set of probability distributions on S.[2]

The intuition behind this definition is as follows. It is assumed that there is a "true" probability distribution that generates the observed data. We choose \mathcal{P} to represent a set (of distributions) which contains a distribution that adequately approximates the true distribution. Note that we do not require that \mathcal{P} contains the true distribution, and in practice that is rarely the case. Indeed, as Burnham & Anderson state, "A model is a simplification or approximation of reality and hence will not reflect all of reality".[3]

The set \mathcal{P} is usually parameterized: \mathcal{P}=\{P_{\theta} : \theta \in \Theta\}. The set \Theta defines the parameters of the model.

An example

Height and age are each probabilistically distributed over humans. They are stochastically related: when you know that a person is of age 10, this influences the chance of the person being 6 feet tall. You could formalize that relationship in a linear regression model with the following form: heighti = b_0 + b_1agei + εi, where b_0 is the intercept, b_1 is a parameter that age is multiplied by to get a prediction of height, ε is the error term, and i identifies the person. This means that height is predicted by age, with some error.

A model must fit all the data points. Thus, the straight line (heighti = b_0 + b_1agei) is not a model of the data. The line cannot be a model, unless it exactly fits all the data points—i.e. all the data points lie perfectly on a straight line. The error term, εi, must be included in the model, so that the model can account for all the data points.

To do statistical inference, we would first need to assume some probability distributions for the εi. For instance, we might assume that the εi distributions are i.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters (or, equivalently, one 3-dimensional parameter): b_0, b_1, and the variance of the Gaussian distribution.

We can formally specify the model in the form (S, \mathcal{P}) as follows. The sample space, S, of our model comprises the set of all possible pairs (age, height). Each possible value of the parameter \theta = (b_0, b_1, \sigma^2) determines a distribution on S; denote that distribution by P_{\theta}. If \Theta is the set of all possible values of \theta, then we have \mathcal{P}=\{P_{\theta} : \theta \in \Theta\}.

Dimension of a model

Suppose that we have a statistical model (S, \mathcal{P}) with \mathcal{P}=\{P_{\theta} : \theta \in \Theta\}. The model is said to be parametric if \Theta has a finite dimension. In notation, we write that \Theta \subseteq \mathbb{R}^d where d is a positive integer (\mathbb{R} denotes the real numbers; other sets can be used, in principle). Here, d is the dimension of the model.

As an example, if we assume that data arise from a univariate Gaussian distribution, then we are assuming that

\mathcal{P}=\{P_{\mu,\sigma }(x) \equiv \frac{1}{\sqrt{2 \pi} \sigma} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2}\right) : \mu \in \mathbb{R}, \sigma > 0\}.

In this example, the dimension, d, equals 2. Similarly, if we assume that data are distributed according to a straight line with i.i.d. Gaussian residuals, then the dimension is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals (the mean of the distribution of the residuals is zero).

A statistical model is nonparametric if the parameter set \Theta is infinite dimensional. Semiparametric models have parameters that are both finite and infinite dimensional. Formally, if d is the dimension of \Theta and n is the number of samples, both semiparametric and nonparemtric models have d \rightarrow \infty as n \rightarrow \infty. If d/n \rightarrow 0 as n \rightarrow \infty, then the model is semiparametric; otherwise, the model is nonparametric.

Parametric models are by far the most commonly-used statistical models. Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".[4]

Nested models

Two statistical models are said to be nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. For example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. In this example, the first model has a higher dimension than the second model (the zero-mean model has dimension 1); that is usually, but not always, the case. As a different example, the set of positive-mean Gaussian distributions, which has dimension 2, is nested within the set of all Gaussian distributions.

Model comparison

It is assumed that there is a "true" probability distribution that generates the observed data. The main goal of model selection is to make statements about which elements of \mathcal{P} are most likely to adequately approximate the true distribution.

Models can be compared to each other. This can either be done when we have done an exploratory data analysis or a confirmatory data analysis. In an exploratory analysis, we formulate all models we can think of, and see which describes your data best. In a confirmatory analysis we check which of the models that we have described before the data was collected best fits the data, or test if our only model fits the data.

Common tools for comparing models include R2, Bayes factor, and the likelihood-ratio test together with its generalization relative likelihood.

Konishi & Kitagawa state: "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling. They are typically formulated as comparisons of several statistical models."[5]

See also


  1. ^ Adèr
  2. ^ McCullagh
  3. ^ Burnham & Anderson, §1.2.5
  4. ^ Cox, p.2
  5. ^ Konishi & Kitagawa, p.75


  • Adèr H.J. (2008), "Modelling". In H.J. Adèr & G.J. Mellenbergh (editors), Advising on Research Methods: a consultant's companion (Chapter 12: p.271-304). Huizen, The Netherlands: Johannes van Kessel Publishing.
  • Burnham K.P., Anderson D.R. (2002), Model Selection and Multimodel Inference, Springer.
  • Cox D.R. (2006), Principles of Statistical Inference, Cambridge University Press.
  • Konishi S., Kitagawa G. (2008), Information Criteria and Statistical Modeling, Springer.
  • McCullagh P. (2002), "What is a statistical model?", Annals of Statistics, 30: 1225-1310.

Further reading

This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.

Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.