Multiple Regression with Continuous and Categorical Variables

(MR Technique for ANCOVA)

 

 

       Categorical (nominal) variables can be coded for multiple regression analyses in several ways, three of which we will examine here. The topic of the multiple regression approach to ANOVA is very complex. For this reason, I will try to reduce the complexity for purposes of introduction by limiting our discussion to cases where all values of the categorical variable (all the groups) have the same number of subjects (n). We will discuss three types of coding, (1) dummy coding, (2) effect coding, and (3) orthogonal contrast coding. We will then explore how the multiple regression analysis is run and interpreted for each type of coding.

 

       When we add a set of categorical coded variables to an equation which already has an interval predictor, we are essentially doing ANCOVA where the interval predictor is a covariate. For this reason, I have run ANCOVA results with the various methods below to show how they relate.

 

In coding nominal variables, the first thing to do is to determine the number of categories the nominal variable has. For our purposes, we will assume we have a nominal variable for zip code, and we will assume our total sample has 10 people from each of 4 zip codes (n=10, N=40).

 

       For all types of categorical coding, the number of categorical variables needed for the regression analysis is the number of categories or groups minus 1. Since we have 4 categories in our variable, we will need 3 recoded regression variables to represent the one nominal variable of zip code. For each participant, we have a score of some type which is the subject of our analysis (the criterion variable), and we have a rating of some type which is an interval level predictor variable..

 

The data for this example is shown below:

 

 

 

 

 

 

Participant #

Zip Code

Label

Rating

Score

1

10023

1

32

11

2

10023

1

51

14

3

10023

1

33

13

4

10023

1

39

13

5

10023

1

29

8

6

10023

1

23

8

7

10023

1

48

13

8

10023

1

77

20

9

10023

1

21

13

10

10023

1

42

13

11

43229

2

47

22

12

43229

2

35

19

13

43229

2

78

28

14

43229

2

22

20

15

43229

2

42

21

16

43229

2

28

17

17

43229

2

33

18

18

43229

2

77

26

19

43229

2

48

26

20

43229

2

63

25

21

82673

3

52

12

22

82673

3

38

9

23

82673

3

85

17

24

82673

3

44

11

25

82673

3

45

12

26

82673

3

53

17

27

82673

3

50

11

28

82673

3

15

11

29

82673

3

63

14

30

82673

3

41

9

31

75428

4

55

13

32

75428

4

56

17

33

75428

4

28

8

34

75428

4

34

12

35

75428

4

26

10

36

75428

4

28

12

37

75428

4

26

9

38

75428

4

60

17

39

75428

4

69

18

40

75428

4

40

9

 

 

 

Dummy Coding

 

       1. Coding

 

       In dummy coding for our data, we have k=4 categories for zip code, so we need k-1 = 3 dummy variables. We will name these variables D1, D2, and D3.  All of the four zip code categories except for one are each assigned to one of the 3 dummy variables. It really makes no difference how this assignment is made, other than choosing the reference category. The unassigned category is the reference category, and the multiple regression results will be easiest to interpret in terms of comparing each of the other three categories to the reference category. Then each dummy variable is coded as "1" for the cases in it's assigned category, and it is coded "0" for all other cases. In our case, I have chosen to make the reference category Commerce, so that our MR results will allow us to compare the other zip codes to our own zip code. The dummy variable coding for this analysis is shown below:

 

 

 

 

 

 

 

Dummy Variables

Participant #

Zip Code

Label

Rating

Score

D1

D2

D3

1

10023

1

32

11

1

0

0

2

10023

1

51

14

1

0

0

3

10023

1

33

13

1

0

0

4

10023

1

39

13

1

0

0

5

10023

1

29

8

1

0

0

6

10023

1

23

8

1

0

0

7

10023

1

48

13

1

0

0

8

10023

1

77

20

1

0

0

9

10023

1

21

13

1

0

0

10

10023

1

42

13

1

0

0

11

43229

2

47

22

0

1

0

12

43229

2

35

19

0

1

0

13

43229

2

78

28

0

1

0

14

43229

2

22

20

0

1

0

15

43229

2

42

21

0

1

0

16

43229

2

28

17

0

1

0

17

43229

2

33

18

0

1

0

18

43229

2

77

26

0

1

0

19

43229

2

48

26

0

1

0

20

43229

2

63

25

0

1

0

21

82673

3

52

12

0

0

1

22

82673

3

38

9

0

0

1

23

82673

3

85

17

0

0

1

24

82673

3

44

11

0

0

1

25

82673

3

45

12

0

0

1

26

82673

3

53

17

0

0

1

27

82673

3

50

11

0

0

1

28

82673

3

15

11

0

0

1

29

82673

3

63

14

0

0

1

30

82673

3

41

9

0

0

1

31

75428

4

55

13

0

0

0

32

75428

4

56

17

0

0

0

33

75428

4

28

8

0

0

0

34

75428

4

34

12

0

0

0

35

75428

4

26

10

0

0

0

36

75428

4

28

12

0

0

0

37

75428

4

26

9

0

0

0

38

75428

4

60

17

0

0

0

39

75428

4

69

18

0

0

0

40

75428

4

40

9

0

0

0

 

 

 

 

 

       2. Analysis

 

       The analysis is carried out by a sequential regression. The predictors are entered in two separate blocks in the SPSS Multiple Linear Regression menu. The "dependent" variable is SCORE. The RATING variable is placed in the first block, then all three dummy variables (D1,D2, & D3) are placed in the second block. Also, be sure to order "R-square Change" and Descriptives. The output is shown below.

 

 

Regression

Notes

Output Created

15-OCT-2006 03:28:37

Comments

 

Input

Filter

<none>

Weight

<none>

Split File

<none>

N of Rows in Working Data File

78

Missing Value Handling

Definition of Missing

User-defined missing values are treated as missing.

Cases Used

Statistics are based on cases with no missing values for any variable used.

Syntax

REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA CHANGE
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT score
/METHOD=ENTER rating /METHOD=ENTER D1 D2 D3 .

Resources

Elapsed Time

0:00:00.03

Memory Required

2396 bytes

Additional Memory Required for Residual Plots

0 bytes

 

 
[DataSet0] 

 

Descriptive Statistics

 

 

Mean

Std. Deviation

N

score

14.9000

5.41034

40

rating

44.4000

17.35423

40

D1

.2500

.43853

40

D2

.2500

.43853

40

D3

.2500

.43853

40

 

Correlations

 

 

 

score

rating

D1

D2

D3

Pearson Correlation

score

1.000

.571

-.249

.789

-.281

rating

.571

1.000

-.165

.098

.142

D1

-.249

-.165

1.000

-.333

-.333

D2

.789

.098

-.333

1.000

-.333

D3

-.281

.142

-.333

-.333

1.000

Sig. (1-tailed)

score

.

.000

.061

.000

.040

rating

.000

.

.154

.274

.192

D1

.061

.154

.

.018

.018

D2

.000

.274

.018

.

.018

D3

.040

.192

.018

.018

.

N

score

40

40

40

40

40

rating

40

40

40

40

40

D1

40

40

40

40

40

D2

40

40

40

40

40

D3

40

40

40

40

40

 

Variables Entered/Removed(b)

Model

Variables Entered

Variables Removed

Method

1

rating(a)

.

Enter

2

D2, D1, D3(a)

.

Enter

a All requested variables entered.

b Dependent Variable: score

 

Model Summary

Model

R

R Square

Adjusted R Square

Std. Error of the Estimate

Change Statistics

 

R Square Change

F Change

df1

df2

Sig. F Change

 

1

.571(a)

.326

.309

4.49894

.326

18.402

1

38

.000

 

2

.940(b)

.883

.870

1.95367

.557

55.504

3

35

.000

 

a Predictors: (Constant), rating

 

b Predictors: (Constant), rating, D2, D1, D3

 

 

ANOVA(c)

Model

 

Sum of Squares

df

Mean Square

F

Sig.

1

Regression

372.462

1

372.462

18.402

.000(a)

Residual

769.138

38

20.240

 

 

Total

1141.600

39

 

 

 

2

Regression

1008.011

4

252.003

66.024

.000(b)

Residual

133.589

35

3.817

 

 

Total

1141.600

39

 

 

 

a Predictors: (Constant), rating

b Predictors: (Constant), rating, D2, D1, D3

c Dependent Variable: score

 

Coefficients(a)

Model

 

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

 

B

Std. Error

Beta


 

1

(Constant)

6.993

1.976

 

3.540

.001

 

rating

.178

.042

.571

4.290

.000

 

2

(Constant)

5.627

.994

 

5.659

.000

 

rating

.163

.018

.522

8.821

.000

 

D1

.540

.875

.044

.617

.541

 

D2

8.869

.879

.719

10.093

.000

 

D3

-1.242

.882

-.101

-1.409

.168

 

a Dependent Variable: score

 

 

Excluded Variables(b)

Model

 

Beta In

t

Sig.

Partial Correlation

Collinearity Statistics

 

Tolerance

 

1

D1

-.159(a)

-1.181

.245

-.191

.973

 

D2

.740(a)

12.375

.000

.897

.990

 

D3

-.369(a)

-3.025

.005

-.445

.980

 

a Predictors in the Model: (Constant), rating

 

b Dependent Variable: score

 



Univariate Analysis of Variance (ANCOVA with RATING as a covariate --

included for illustration of adjusted means)

Notes

Output Created

15-OCT-2006 01:19:00

Comments

 

Input

Filter

<none>

Weight

<none>

Split File

<none>

N of Rows in Working Data File

78

Missing Value Handling

Definition of Missing

User-defined missing values are treated as missing.

Cases Used

Statistics are based on all cases with valid data for all variables in the model.

Syntax

UNIANOVA
score BY label WITH rating
/METHOD = SSTYPE(3)
/INTERCEPT = INCLUDE
/EMMEANS = TABLES(label) WITH(rating=MEAN)
/CRITERIA = ALPHA(.05)
/DESIGN = rating label .

Resources

Elapsed Time

0:00:00.03

 

 

 

Between-Subjects Factors

 

 

 

N

label

1.00

10

2.00

10

3.00

10

4.00

10

 

Tests of Between-Subjects Effects
Dependent Variable: score

Source

Type III Sum of Squares

df

Mean Square

F

Sig.

Corrected Model

1008.011(a)

4

252.003

66.024

.000

Intercept

292.471

1

292.471

76.627

.000

rating

297.011

1

297.011

77.816

.000

label

635.549

3

211.850

55.504

.000

Error

133.589

35

3.817

 

 

Total

10022.000

40

 

 

 

Corrected Total

1141.600

39

 

 

 

a R Squared = .883 (Adjusted R Squared = .870)

 

Estimated Marginal Means

label
Dependent Variable: score

label

Mean

Std. Error

95% Confidence Interval

 

Lower Bound

Upper Bound

 

1.00

13.398(a)

.624

12.130

14.666

 

2.00

21.728(a)

.620

20.469

22.987

 

3.00

11.616(a)

.623

10.352

12.880

 

4.00

12.858(a)

.619

11.601

14.115

 

a Covariates appearing in the model are evaluated at the following values: rating = 44.4000.

 

 

 

The significant R2 change which results from adding the block of dummy variables to the equation can be interpreted to mean that zip code does add significantly to the prediction of SCORE. The dummy variable D2 has a significant beta. To me, that means that knowing whether you are in Commerce versus zip 43229 (label 2) adds significantly to the prediction. Notice that the unstandardized regression coefficient for D2 is 8.869. Look at the difference between the estimated marginal means (given in the illustrative ANCOVA output) for Commerce (Label 4) and the group with Label 2 (which is the Dummy variable D2). It should follow that the unstandardized regression coefficient for the dummy variables are the difference between the estimated or adjusted ANCOVA means of the dummy group and the reference group controlling for RATING.

 

None of the other comparisons with Commerce (75428, label 4) make any difference. From looking at the means, it is clear why. Note that this method of coding only allows us to determine zip code as a whole adds to prediction, but other than that, it only allows us to see if each zip code is relevant compared to Commerce (75428, label 4).

 

 

Effect Coding

 

       1. Coding

 

       The process for effect coding is the same as for dummy coding, except the last category is coded as -1. This produces a situation where the unstandardized regression coefficient is the difference between the mean criterion score for the dummy group and the overall mean criterion score, controlling for the variance which RATING contributes to prediction. An alternative way of looking at this is that the unstandardized regression coefficient is the difference between the grand mean and the adjusted mean for the group in an ANCOVA where RATING is a covariate. It also allows for further examination of differences between groups (which is conceptualized as differences between regression coefficient values for the Effect Code variables).

 

 

 

 

 

 

 

Effect Code Variables

Participant #

Zip Code

Label

Rating

Score

E1

E2

E3

1

10023

1

32

11

1

0

0

2

10023

1

51

14

1

0

0

3

10023

1

33

13

1

0

0

4

10023

1

39

13

1

0

0

5

10023

1

29

8

1

0

0

6

10023

1

23

8

1

0

0

7

10023

1

48

13

1

0

0

8

10023

1

77

20

1

0

0

9

10023

1

21

13

1

0

0

10

10023

1

42

13

1

0

0

11

43229

2

47

22

0

1

0

12

43229

2

35

19

0

1

0

13

43229

2

78

28

0

1

0

14

43229

2

22

20

0

1

0

15

43229

2

42

21

0

1

0

16

43229

2

28

17

0

1

0

17

43229

2

33

18

0

1

0

18

43229

2

77

26

0

1

0

19

43229

2

48

26

0

1

0

20

43229

2

63

25

0

1

0

21

82673

3

52

12

0

0

1

22

82673

3

38

9

0

0

1

23

82673

3

85

17

0

0

1

24

82673

3

44

11

0

0

1

25

82673

3

45

12

0

0

1

26

82673

3

53

17

0

0

1

27

82673

3

50

11

0

0

1

28

82673

3

15

11

0

0

1

29

82673

3

63

14

0

0

1

30

82673

3

41

9

0

0

1

31

75428

4

55

13

-1

-1

-1

32

75428

4

56

17

-1

-1

-1

33

75428

4

28

8

-1

-1

-1

34

75428

4

34

12

-1

-1

-1

35

75428

4

26

10

-1

-1

-1

36

75428

4

28

12

-1

-1

-1

37

75428

4

26

9

-1

-1

-1

38

75428

4

60

17

-1

-1

-1

39

75428

4

69

18

-1

-1

-1

40

75428

4

40

9

-1

-1

-1

 

 

       2. Analysis   

 

       The analysis is carried out in the same way as with dummy coding. First, RATING is entered as the continuous predictor, and then all effect code variables (E1, E2, & E3) are added as a block in a sequential test. The results from SPSS are shown below.

 

Regression

Notes

Output Created

15-OCT-2006 01:06:02

Comments

 

Input

Filter

<none>

Weight

<none>

Split File

<none>

N of Rows in Working Data File

78

Missing Value Handling

Definition of Missing

User-defined missing values are treated as missing.

Cases Used

Statistics are based on cases with no missing values for any variable used.

Syntax

REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA CHANGE
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT score
/METHOD=ENTER rating /METHOD=ENTER E1 E2 E3 .

Resources

Elapsed Time

0:00:00.05

Memory Required

2396 bytes

Additional Memory Required for Residual Plots

0 bytes

 

 

 

Descriptive Statistics

 

 

Mean

Std. Deviation

N

score

14.9000

5.41034

40

rating

44.4000

17.35423

40

E1

.0000

.71611

40

E2

.0000

.71611

40

E3

.0000

.71611

40

 

Correlations

 

 

 

score

rating

E1

E2

E3

Pearson Correlation

score

1.000

.571

.007

.642

-.013

rating

.571

1.000

-.056

.105

.132

E1

.007

-.056

1.000

.500

.500

E2

.642

.105

.500

1.000

.500

E3

-.013

.132

.500

.500

1.000

Sig. (1-tailed)

score

.

.000

.484

.000

.468

rating

.000

.

.366

.259

.208

E1

.484

.366

.

.001

.001

E2

.000

.259

.001

.

.001

E3

.468

.208

.001

.001

.

N

score

40

40

40

40

40

rating

40

40

40

40

40

E1

40

40

40

40

40

E2

40

40

40

40

40

E3

40

40

40

40

40

 

Variables Entered/Removed(b)

Model

Variables Entered

Variables Removed

Method

1

rating(a)

.

Enter

2

E1, E2, E3(a)

.

Enter

a All requested variables entered.

b Dependent Variable: score

 

Model Summary

Model

R

R Square

Adjusted R Square

Std. Error of the Estimate

Change Statistics

 

R Square Change

F Change

df1

df2

Sig. F Change

 

1

.571(a)

.326

.309

4.49894

.326

18.402

1

38

.000

 

2

.940(b)

.883

.870

1.95367

.557

55.504

3

35

.000

 

a Predictors: (Constant), rating

 

b Predictors: (Constant), rating, E1, E2, E3

 

 

ANOVA(c)

Model

 

Sum of Squares

df

Mean Square

F

Sig.

1

Regression

372.462

1

372.462

18.402

.000(a)

Residual

769.138

38

20.240

 

 

Total

1141.600

39

 

 

 

2

Regression

1008.011

4

252.003

66.024

.000(b)

Residual

133.589

35

3.817

 

 

Total

1141.600

39

 

 

 

a Predictors: (Constant), rating

b Predictors: (Constant), rating, E1, E2, E3

c Dependent Variable: score

 

Coefficients(a)

Model

 

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

 

B

Std. Error

Beta


 

1

(Constant)

6.993

1.976

 

3.540

.001

 

rating

.178

.042

.571

4.290

.000

 

2

(Constant)

7.669

.876

 

8.754

.000

 

rating

.163

.018

.522

8.821

.000

 

E1

-1.502

.543

-.199

-2.768

.009

 

E2

6.828

.538

.904

12.698

.000

 

E3

-3.284

.541

-.435

-6.075

.000

 

a Dependent Variable: score

 

 

Excluded Variables(b)

Model

 

Beta In

t

Sig.

Partial Correlation

Collinearity Statistics

 

Tolerance

 

1

E1

.039(a)

.286

.777

.047

.997

 

E2

.588(a)

6.182

.000

.713

.989

 

E3

-.090(a)

-.667

.509

-.109

.983

 

a Predictors in the Model: (Constant), rating

 

b Dependent Variable: score

 

 

Univariate Analysis of Variance (ANCOVA with RATING as a covariate --

included for illustration of adjusted means)

Notes

Output Created

15-OCT-2006 01:19:00

Comments

 

Input

Filter

<none>

Weight

<none>

Split File

<none>

N of Rows in Working Data File

78

Missing Value Handling

Definition of Missing

User-defined missing values are treated as missing.

Cases Used

Statistics are based on all cases with valid data for all variables in the model.

Syntax

UNIANOVA
score BY label WITH rating
/METHOD = SSTYPE(3)
/INTERCEPT = INCLUDE
/EMMEANS = TABLES(label) WITH(rating=MEAN)
/CRITERIA = ALPHA(.05)
/DESIGN = rating label .

Resources

Elapsed Time

0:00:00.03

 

 

 

Between-Subjects Factors

 

 

 

N

label

1.00

10

2.00

10

3.00

10

4.00

10

 

Tests of Between-Subjects Effects
Dependent Variable: score

Source

Type III Sum of Squares

df

Mean Square

F

Sig.

Corrected Model

1008.011(a)

4

252.003

66.024

.000

Intercept

292.471

1

292.471

76.627

.000

rating

297.011

1

297.011

77.816

.000

label

635.549

3

211.850

55.504

.000

Error

133.589

35

3.817

 

 

Total

10022.000

40

 

 

 

Corrected Total

1141.600

39

 

 

 

a R Squared = .883 (Adjusted R Squared = .870)

 

Estimated Marginal Means

label
Dependent Variable: score

label

Mean

Std. Error

95% Confidence Interval

 

Lower Bound

Upper Bound

 

1.00

13.398(a)

.624

12.130

14.666

 

2.00

21.728(a)

.620

20.469

22.987

 

3.00

11.616(a)

.623

10.352

12.880

 

4.00

12.858(a)

.619

11.601

14.115

 

a Covariates appearing in the model are evaluated at the following values: rating = 44.4000.

 



 

 

 

       Note the adjusted marginal means in the last table of the ANCOVA output. Now, go back and jot down the grand mean for SCORE, as well as the unstandardized regression coefficients for E1, E2 & E3 from the regression output. Those values are noted below:

 

Grand Mean for SCORE = 14.9

BE1 = -1.502

BE2 =  6.828

BE3 = -3.284

 

Note we can get the adjusted means as verified in the ANCOVA output above for the zip code groups by adding the B values from the regression to the grand mean for SCORE:

 

ZIP1 = 14.9 - 1.502 =13.40

ZIP2 = 14.9 + 6.828 = 21.73

ZIP3 = 14.9 - 3.284 = 11.62

 

Since (ZIP1 + ZIP2 + ZIP3 + ZIP4)/4 = 14.9, we know that

 

ZIP4 = (4)14.9 - ZIP1 - ZIP2 -ZIP3

 

ZIP4 = 59.6 - 13.4 - 21.73 - 11.62 = 12.85

 

The following equation can now be used to compare the four groups for significant differences:

 

 

 

 

 

Here the means in the numerator are the adjusted means for ZIP's computed above, "a" is the number of groups (4 in our case), MS'wg is the residual mean square from the regression Model 2, the Y-Y is the difference between the means for RATING in the two groups (the means can easily be obtained using MEANS procedure in SPSS* as shown below),  SSwg(y) is the SSresidual for a regression of RATING on the effect variables (easily obtained by running such regression at the same time), and the "c" values are contrast coefficients as we have used for post-hoc comparisons before.

 

*Report

group means for rating

label

Mean

N

Std. Deviation

1.00

39.5000

10

16.46714

2.00

47.3000

10

19.63019

3.00

48.6000

10

17.94560

4.00

42.2000

10

16.29451

Total

44.4000

40

17.35423

 

 

 

       You should also note with this analysis that we get the exact same R2 change when we enter the Effect Codes as we got in the first analysis when we entered the Dummy Codes. The overall variance accounted for does not change by changing the coding method.

 

       For groups that are significantly different, the final task is to create a separate regression equation for each group or each set of homogenous groups.

 

 

 

 

Orthogonal Contrast Coding

 

       1. Coding

 

       In orthogonal contrast coding, we create contrasts with the k-1 regression variables which satisfy the orthogonality criterion below,

 

 

where the c values are contrast coefficients and the j values are groups. Two contrasts are considered orthogonal if the sum of the products of their contrast coefficients across all groups is zero. Orthogonal contrasts are statistically independent contrasts, meaning they provide unique information not dependent on other comparisons. Orthogonal contrasts are peculiarly related to our system of assigning values to k-1 variables when we have k groups, because each set of possible orthogonal contrasts have at most k-1 comparisons.

 

       Let's consider our effect size contrasts to determine if all possible pairs of the three contrasts we used are orthogonal. We can compare E1 & E2, E1 & E3, and E2 & E3. The contrast coefficients used in effect coding are given below:

 

 

E1

E2

E3

ZIP1

1

0

0

ZIP2

0

1

0