Analysts often include too many independent variables in their models to improve model quality and find statistical patterns. We demonstrate how this sometimes leads to inaccurate findings.
Let’s use a similar example from the fast-moving consumer good industry as we did in the previous post. But instead of regressing volume only against household income, we include other variables.
We test various variables (in itself a dubious practice because variable choice should be driven by theory) and find that population, unemployment rate, price, marketing spend and distribution appear to correlate* with volume.
We therefore regress volume against these variables, and get an R² of 0.88. The p-value, a key number that statisticians typically examine**, indicates that the independent variables’ coefficients are statistically significant. However, there is a problem.
The income coefficient shows that 1% increase in income would lead to over 9% increase in volume. Similarly, every 1% increase in population would result in 30% decrease in volume! These coefficients are nonsensical even though they are statistically significant. We see an example of junk statistics. Why is this?
The table below indicates that some of the independent variables correlate highly with each other, so-called collinearity. Both the population and unemployment rate variables correlate strongly with household income. Marketing spend correlates highly with price.
With such collinearity, the coefficients of our independent variables are not meaningful.
One way to correct for collinearity is to remove independent variables with high correlation. We run a new regression, and the result gives a lower R² but more accurate coefficient estimates. In other words, we now have more confidence to interpret that a price decrease of 1% leads to 1.5% growth in volume, holding other independent variables in the model constant.
In conclusion, adding more independent variables to a regression does not always guarantee the best model or meaningful statistical patterns. It is crucial to inspect the variables for problems such as collinearity, which cause the model results to be inaccurate.
Another problem with adding many independent variables is that it will always lead to high R². In fact, as the number of variables approach the number of observations, R² automatically is 1.
Usually, a simple model with only a few variables gives the most reliable results.
Lesson 3: A model should be parsimonious. That is, it should use as few variables
as possible to eliminate issues like collinearity and imaginary model fit.
* Correlation is a statistical measure of how two variables move in relation to each other. Positive correlation implies that as one variable moves, the other will move in the same direction. Negative correlation implies that the two variables will move in opposite directions.
** We typically look for p-value less than 0.05. The higher p-value that is, the more likely that the relationship between consumption and income is just a random occurrence.