World Library  
Flag as Inappropriate
Email this Article
 

Mallows's Cp

In statistics, Mallows's Cp,[1][2] named for Colin Lingwood Mallows, is used to assess the fit of a regression model that has been estimated using ordinary least squares. It is applied in the context of model selection, where a number of predictor variables are available for predicting some outcome, and the goal is to find the best model involving a subset of these predictors.

Mallows's Cp has been shown to be equivalent to Akaike Information Criterion in the special case of Gaussian linear regression.[3]

Definition and properties

Mallows's Cp addresses the issue of overfitting, in which model selection statistics such as the residual sum of squares always get smaller as more variables are added to a model. Thus, if we aim to select the model giving the smallest residual sum of squares, the model including all variables would always be selected. The Cp statistic calculated on a sample of data estimates the mean squared prediction error (MSPE) as its population target

E\sum_j (\hat{Y}_j - E(Y_j\mid X_j))^2/\sigma^2,

where \hat{Y}_j is the fitted value from the regression model for the jth case, E(Yj | Xj) is the expected value for the jth case, and σ2 is the error variance (assumed constant across the cases). The MSPE will not automatically get smaller as more variables are added. The optimum model under this criterion is a compromise influenced by the sample size, the effect sizes of the different predictors, and the degree of collinearity between them.

If P regressors are selected from a set of K > P, the Cp statistic for that particular set of regressors is defined as:

C_p={SSE_p \over S^2} - N + 2P,

where

Practical use

The Cp statistic is often used as a stopping rule for various forms of stepwise regression. Mallows proposed the statistic as a criterion for selecting among many alternative subset regressions. Under a model not suffering from appreciable lack of fit (bias), Cp has expectation nearly equal to P; otherwise the expectation is roughly P plus a positive bias term. Nevertheless, even though it has expectation greater than or equal to P, there is nothing to prevent Cp < P or even Cp < 0 in extreme cases. It is suggested that one should choose a subset that has Cp approaching P,[4] from above, for a list of subsets ordered by increasing P. In practice, the positive bias can be adjusted for by selecting a model from the ordered list of subsets, such that Cp < 2P.

Since the sample-based Cp statistic is an estimate of the MSPE, using Cp for model selection does not completely guard against overfitting. For instance, it is possible that the selected model will be one in which the sample Cp was a particularly severe underestimate of the MSPE.

Model selection statistics such as Cp are generally not used blindly, but rather information about the field of application, the intended use of the model, and any known biases in the data are taken into account in the process of model selection.

References

  • Hocking, R. R. (1976). "The Analysis and Selection of Variables in  
  1. ^ Mallows, C. L. (1973). "Some Comments on CP". Technometrics 15 (4): 661–675.  
  2. ^ Gilmour, Steven G. (1996). "The interpretation of Mallows's Cp-statistic". Journal of the Royal Statistical Society, Series D 45 (1): 49–56.  
  3. ^ Boisbunon, Aurélie; Canu, Stephane; Fourdrinier, Dominique; Strawderman, William; Wells, Martin T. (2014-05-27). and estimators of loss for elliptically symmetric distributions"Cp"AIC, .  
  4. ^ Daniel, C.; Wood, F. (1980). Fitting Equations to Data (Rev. ed.). New York: Wiley & Sons, Inc. 
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
 
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
 
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.
 


Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.