Multicollinearity: Why Your Regression Model's Coefficients Are Making Things Up

Your regression model ran. The R-squared looks great. Individual coefficients, though, are doing something strange: one of your obviously important variables is showing up with a negative sign, another has a standard error wider than a highway, and the whole thing shifts dramatically every time you add or remove a predictor. Congratulations. You have a multicollinearity problem.

Scrabble tiles spelling 'Own Your Error' on a white background. Photo by Brett Jordan on Pexels.

Most people learn about multicollinearity as a footnote in a stats course. It gets a paragraph, a warning, and then everyone moves on to cross-validation. That's a mistake, because multicollinearity doesn't just make your model ugly. It makes the coefficients actively misleading while leaving the overall fit completely intact.

What's Actually Happening

Regression works by isolating the contribution of each predictor, holding everything else constant. That phrase "holding everything else constant" is doing enormous work. When two predictors move together almost perfectly, the math cannot separate their individual effects. The algebra becomes unstable. Small changes in the data produce wild swings in the estimated coefficients.

Think about trying to figure out how much of a car's fuel efficiency comes from engine size versus vehicle weight. Those two variables are correlated. Heavier cars tend to have bigger engines. If you only ever observe cars where weight and engine size increase together, you have no information about what happens when one changes and the other doesn't. The regression will give you an answer anyway. That answer will be garbage.

Here's the part that quietly ruins analyses: the predictions themselves are often fine. Multicollinearity inflates coefficient variance without necessarily degrading predictive accuracy. Your model can generalize reasonably well on new data while every single coefficient tells you a story that has no interpretable meaning.

The Variance Inflation Factor

The standard diagnostic is the Variance Inflation Factor, or VIF. For each predictor, you regress it against all the other predictors and compute how much of its variance is explained. A VIF of 1 means no linear relationship with other predictors. A VIF of 5 means the variance of that coefficient is five times larger than it would be if the predictor were uncorrelated with everything else. Above 10 is widely treated as a serious problem, though some analysts flag anything above 5.

graph TD
    A[Fit full regression model] --> B{Check VIF for each predictor}
    B --> C[VIF under 5]
    B --> D[VIF 5 to 10]
    B --> E[VIF over 10]
    C --> F(Coefficients likely stable)
    D --> G(Investigate correlation structure)
    E --> H{Consider dropping, combining, or regularizing}

VIF is useful, but it only catches linear dependence. Two predictors can have a low pairwise correlation and still contribute to collinearity when you consider the whole predictor matrix. Always look at the condition number of the design matrix too. A condition number above 30 is a signal that something is structurally wrong with the predictor space.

Why Dropping Variables Isn't Always the Answer

The reflexive response is to drop one of the correlated predictors. Sometimes that's right. Often it trades one problem for another: omitted variable bias. If you drop a variable that genuinely belongs in the model, the remaining coefficients absorb its effect in distorted ways. You've fixed the variance problem by introducing a bias problem.

Ridge regression is frequently a better tool. By adding a small penalty to the size of coefficients (the L2 penalty), ridge shrinks them toward zero and trades off a little bias for a substantial reduction in variance. The coefficients you get are not unbiased, but they're stable. If interpretation matters less than stability and prediction, ridge is often the right call.

Principal component regression is another path. Decompose the correlated predictors into orthogonal components, regress on those. You lose the original coefficient interpretations entirely, but the math is no longer fighting itself.

The Real Cost

Data science culture is obsessed with model fit metrics. R-squared, RMSE, AUC: these get reported, tracked, celebrated. Coefficient stability gets ignored. That asymmetry has consequences when models are used to make decisions.

Imagine a marketing mix model where spend on two highly correlated channels (say, display and paid social) shows up with a negative coefficient for one. An analyst reads that coefficient and concludes cutting display spend would improve revenue. The prediction surface of the model might be perfectly reasonable. That one coefficient is not.

Multicollinearity doesn't announce itself loudly. It produces numbers that look like results. The discipline is knowing when to distrust the output even when the fit statistics look clean. R-squared is not absolution. Your coefficients can still be lying to you, confidently, with full statistical significance attached.

Multicollinearity: Why Your Regression Model's Coefficients Are Making Things Up

What's Actually Happening

The Variance Inflation Factor

Why Dropping Variables Isn't Always the Answer

The Real Cost

Related Reading

Heteroscedasticity: Why Your Regression's Error Bars Are Lying in a Pattern

Zero-Inflated Data: Why Your Model Thinks Nothing Is Happening

Endogeneity: The Reason Your Regression Coefficients Are Arguing With Themselves