We’re using cookies, but you can turn them off in Privacy Settings.  Otherwise, you are agreeing to our use of cookies.  Accepting cookies does not mean that we are collecting personal data. Learn more in our Privacy Policy.

2024 Curriculum CFA Program Level II Quantitative Methods

Introduction

Multiple linear regression uses two or more independent variables to describe the variation of the dependent variable rather than just one independent variable, as in simple linear regression. It allows the analyst to estimate using more complex models with multiple explanatory variables and, if used correctly, may lead to better predictions, better portfolio construction, or better understanding of the drivers of security returns. If used incorrectly, however, multiple linear regression may yield spurious relationships, lead to poor predictions, and offer a poor understanding of relationships.

The analyst must first specify the model and make several decisions in this process. The analyst must answer the following questions: What is the dependent variable of interest? What independent variables are important? What form should the model take? What is the goal of the model—prediction or understanding of the relationship?

The analyst specifies the dependent and independent variables and then employs software to estimate the model and produce related statistics. The good news is that the software does the estimation (see Exhibit 1). The analyst’s primary tasks are to specify the model and interpret the output from this software, which are the main subjects of this content.

Software Programs/Functions
 Excel Data Analysis > Regression
 Python  

scipy.stats.linregress

statsmodels.lm

sklearn.linear_model.LinearRegression
 R lm
 SAS

PROC REG

PROC GLM

STATA regress

Learning Outcomes

The member should be able to:

  • describe the types of investment problems addressed by multiple linear regression and the regression process;
  • formulate a multiple linear regression model, describe the relation between the dependent variable and several independent variables, and interpret estimated regression coefficients; and
  • explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions.

Summary

  • Multiple linear regression is used to model the linear relationship between one dependent variable and two or more independent variables. 
  • In practice, multiple regressions are used to explain relationships between financial variables, to test existing theories, or to make forecasts. 
  • The regression process covers several decisions the analyst must make, such as identifying the dependent and independent variables, selecting the appropriate regression model, testing if the assumptions behind linear regression are satisfied, examining goodness of fit, and making needed adjustments.
  • We have presented the multiple linear regression model and discussed violations of regression assumptions, model specification and misspecification, and models with qualitative variables.
  • A multiple regression model is represented by the following equation:

Yi = b0 + b1X1i + b2X2i + b3X3i + … + bkXki + εi, i = 1, 2, 3, …, n,

  where Y is the dependent variable, Xs are the independent variables from 1 to k, and the model is estimated using n observations.

  • Coefficient b0 is the model’s “intercept,” representing the expected value of Y if all independent variables are zero.
  • Parameters b1  to bare the slope coefficients (or partial regression coefficients) for independent variables X1 to Xk. Slope coefficient bdescribes the impact of independent variable Xj  on Y, holding all the other independent variables constant.
  • Five main assumptions underlying multiple regression models must be satisfied: (1) linearity, (2) homoskedasticity, (3) independence of errors, (4) normality, and (5) independence of independent variables.
  • Diagnostic plots can help detect whether these assumptions are satisfied. Scatterplots of dependent versus and independent variables are useful for detecting nonlinear relationships, while residual plots are useful for detecting violations of homoskedasticity and independence of errors.
Share on Facebook Share on Weibo Share on Twitter Share on LinkedIn