Multiple Regression


 

 


In multiple regression, there is more than one predictor variable (more than one y variable from our previous discussion). The coefficient of multiple correlation (symbolized as R) is the Pearson correlation between the criterion variable and a linear combination of the predictors. You have seen linear combinations before (L in Hyperstat). A major difference with multiple regression is the coefficients. They are not neat grouping values, but represent weights that create the best correlation between the linear combination and the criterion. These are the regression coefficients, and they come in two forms just as with bivariate regression. There are standardized weights and unstandardized weights. When the weights are standardized, the predictors and the criterion are expressed as z scores.

 

A major difference between bivariate regression and multiple regression is that the beta weights are not equal to the correlation (r) between the predictor and the criterion unless there is zero correlation between predictors. This is rarely true, and so the standardized regression coefficients depend on all the variables and the intercorrelations between them. The standardized regression coefficients satisfy the equation shown below for k predictor variables, where each r is the correlation between the predictor and the criterion.

 

R2 = b1 r1 + b2 r2 . . . bk rk

 

We learned in our lesson on bivariate correlation and regression that the square of Pearson r is the proportion of variance in one variable that is accounted for by variance in the other variable. It is the proportion of variance shared between the two variables. You should be able to see from the equation above that beta is r when the variables are not correlated.

 

The Venn diagrams (http://en.wikipedia.org/wiki/John_Venn) shown below are a useful way of depicting shared variance graphically. In the figure below, the yellow area is r2 .  If we consider the area of the circles for the variables to be one by definition, then the yellow area is the proportion of the variance shared between the variables. If we had written a bivariate regression equation for these two variables, then the yellow area would be  b2  for the regression equation predicting the criterion variable y from the predictor variable x.

 

z'y. =  b zx .

 

 

 

In multiple regression, there is more than one predictor. A standardized regression equation for two predictors is shown below:

 

z'yb1 zx1 + b2 zx2 .

 

 

The Venn diagram below represents this equation. The square of the correlation between zx1 and zx2 is the sum of the purple and blue areas. The blue area represents the variance shared between both predictor variables and the criterion, and neither predictor can claim to be able to predict this variance better than the other.

 

 

 

 

 

 

 The yellow area in the figure above is the variance in the criterion accounted for by Predictor 1 alone -- without Predictor 1, this shared variance would not be there. Note that with highly correlated predictors, it is possible for all predictors to have relatively small unique contributions, yet the combination can be a very good predictor. The Venn diagram below shows this situation with three variables. Again, the yellow areas are contributions of the individual predictors over and above the variance shared with other predictors (variance not shared by any other predictors).

 

 

 

Notice that the total amount of variance in the criterion which is accounted for by the linear combination of the three predictors (the sum of the blue and yellow areas) is very large (obviously more the 50%). At the same time, the individual betas for the predictors are going to be relatively small in order for the equation R2 = b1 r1 + b2 r2 . . . bk rk  to remain true. That is, none of the variables uniquely predict much variance in the criterion, and none of them can claim any share of the blue area exclusively. You should be able to see that the standardized regression coefficient thus represents the relative "importance" of the predictor variable in the equation. That is, the yellow area is all that is lost in our ability to predict the criterion if the variable is removed from the analysis. In fact, in some uses of multiple regression, we make use of that idea.

 

The figure below represents the situation where there is zero correlation between the predictors, and therefore the betas are equal to the Pearson r between each predictor and the criterion. There are no "contested" blue areas, and the yellow area is the total shared variance between a predictor and the criterion, just as in the first figure in this lesson.

 

 

 

 

 

 

 

 

Testing the significance of standardized regression coefficients.

 

It is possible to test the beta values for each predictor to determine if they are significantly different from zero (if they have any predictive ability at all). David Lane's Hyperstat covers this method, but the computation for the standard error is not given. The standard error for the beta test is given below:

 

 

sb j   = [(1-R2y)/((1-R2j)(N-K-1))]1/2

 

Where R2is the R2 for the analysis, and R2j  is the R2 for predicting variable j from the rest of the predictor variables. N is total number of participants, and K is the number of predictor variables. The t value for testing the significance of the beta is thus the beta value divided by the standard error above, and the t critical value is t for N-K-1 df.

 

The three major categories of multiple regression analyses.

 

The three major categories of multiple regression analyses are

 

1) standard or simultaneous

 

2) sequential

 

3) stepwise

 

Although we will not go into the details of these methods, you should know what they are all about.

 

 

Standard or simultaneous

 

In this approach, the beta weights are computed and tested for significance. The interpretation is mainly the size of R2  and the magnitudes and significance of the various betas. This is the basic form of analysis which produces an equation for prediction.

 

 

Sequential

 

In the sequential method, a number of analyses are run. What occurs is that predictor variables are added (or removed) from the equation and the analyses is repeated. In this way, the researcher can test hypotheses about the gains or losses in predictive ability when a predictor variable is included or excluded. The method uses a test of significance for the change in R2  after a variable is added or deleted. The researcher decides the order of the addition or deletion of the variables based on theory. For instance, I might be interested in predicting graduate GPA from undergraduate GPA, GRE, and entrance exam scores. I may want to know if GRE is really necessary, so I would first do a regression with the predictors of undergraduate GPA and entrance exam scores. Then I would do another analysis with GRE in addition to those original variables. The statistical test would be for a change (increase) in R2 between the two analyses. If  R2 increases significantly, then GRE adds to the prediction of graduate GPA. 

 

 

Stepwise 

In the stepwise approach, the same system of adding and deleting predictor variables is used, but the decisions as to which variables to add or delete are made from the correlations between the variable and the criterion. This method seems to provide an easy way to dump in a lot of predictors and let the computer tell you which ones are important. The method has severe problems. It should never be used except with very large samples because the error variance of the correlations can lead to serious errors. The most important variables could be thrown out based on sampling error for the correlation, and they may never be added.

 

 

Conclusion

 

Multiple regression is the first technique in our series which takes us away from any practical hand computation. With three or more predictors, matrix algebra is required to solve for betas. So, for those who think I am a lost man in a lost world for teaching computational methods, we are finally at the end of that road. Welcome to the middle of the twentieth century (when this stuff was all worked out on mainframes!). If you are interested in solving a multiple regression by hand, I did one during one of the more compulsive periods of my life ---

 

http://www.jamesstacks.com/stat/multregr.ppt