2019 Curriculum CFA Program Level II Quantitative Methods
Multiple Regression and Machine LearningView the full reading (PDF)
Available to members
As financial analysts, we often need to use more-sophisticated statistical methods than correlation analysis or regression involving a single independent variable. For example, a trading desk interested in the costs of trading NASDAQ stocks might want information on the determinants of the bid–ask spread on the NASDAQ. A mutual fund analyst might want to know whether returns to a technology mutual fund behaved more like the returns to a growth stock index or like the returns to a value stock index. An investor might be interested in the factors that determine whether analysts cover a stock. We can answer these questions using linear regression with more than one independent variable—multiple linear regression.
In Sections 2 and 3, we introduce and illustrate the basic concepts and models of multiple regression analysis. These models rest on assumptions that are sometimes violated in practice. In Section 4, we discuss three commonly occurring violations of regression assumptions. We address practical concerns such as how to diagnose an assumption violation and what remedial steps to take when a model assumption has been violated. Section 5 outlines some guidelines for building good regression models and discusses ways that analysts sometimes go wrong in this endeavor. In Section 6, we discuss a class of models whose dependent variable is qualitative in nature. These models are useful when the concern is over the occurrence of some event, such as whether a stock has analyst coverage or not.
The candidate should be able to:
- formulate a multiple regression equation to describe the relation between a dependent variable and several independent variables and determine the statistical significance of each independent variable;
- interpret estimated regression coefficients and their p-values;
- formulate a null and an alternative hypothesis about the population value of a regression coefficient, calculate the value of the test statistic, and determine whether to reject the null hypothesis at a given level of significance;
- interpret the results of hypothesis tests of regression coefficients;
- calculate and interpret 1) a confidence interval for the population value of a regression coefficient and 2) a predicted value for the dependent variable, given an estimated regression model and assumed values for the independent variables;
- explain the assumptions of a multiple regression model;
- calculate and interpret the F-statistic, and describe how it is used in regression analysis;
- distinguish between and interpret the R2 and adjusted R2 in multiple regression;
- evaluate how well a regression model explains the dependent variable by analyzing the output of the regression equation and an ANOVA table;
- formulate a multiple regression equation by using dummy variables to represent qualitative factors and interpret the coefficients and regression results;
- explain the types of heteroskedasticity and how heteroskedasticity and serial correlation affect statistical inference;
- describe multicollinearity and explain its causes and effects in regression analysis;
- describe how model misspecification affects the results of a regression analysis and describe how to avoid common forms of misspecification;
- describe models with qualitative dependent variables;
- evaluate and interpret a multiple regression model and its results
- distinguish between supervised and unsupervised machine learning;
- describe machine learning algorithms used in prediction, classification, clustering, and dimension reduction;
- describe the steps in model training.
In this reading, we have presented the multiple linear regression model and discussed violations of regression assumptions, model specification and misspecification, and models with qualitative variables.
- The general form of a multiple linear regression model is Yi = b0 + b1X1i + b2X2i + … + bkXki + εi
- We conduct hypothesis tests concerning the population values of regression coefficients using t-tests of the form
- The lower the p-value reported for a test, the more significant the result.
- The assumptions of classical normal multiple linear regression model are as follows:
- A linear relation exists between the dependent variable and the independent variables.
- The independent variables are not random. Also, no exact linear relation exists between two or more of the independent variables.
- The expected value of the error term, conditioned on the independent variables, is 0.
- The variance of the error term is the same for all observations.
- The error term is uncorrelated across observations.
- The error term is normally distributed.
- To make a prediction using a multiple linear regression model, we take the following three steps:
- Obtain estimates of the regression coefficients.
- Determine the assumed values of the independent variables.
- Compute the predicted value of the dependent variable.
- When predicting the dependent variable using a linear regression model, we encounter two types of uncertainty: uncertainty in the regression model itself, as reflected in the standard error of estimate, and uncertainty about the estimates of the regression coefficients.
- The F-test is reported in an ANOVA table. The F-statistic is used to test whether at least one of the slope coefficients on the independent variables is significantly different from 0. Under the null hypothesis that all the slope coefficients are jointly equal to 0, this test statistic has a distribution of Fk,n−(k+1), where the regression has n observations and k independent variables. The F-test measures the overall significance of the regression.
- R2 is nondecreasing in the number of independent variables, so it is less reliable as a measure of goodness of fit in a regression with more than one independent variable than in a one-independent-variable regression. Analysts often choose to use adjusted R2 because it does not necessarily increase when one adds an independent variable.
- Dummy variables in a regression model can help analysts determine whether a particular qualitative independent variable explains the model’s dependent variable. A dummy variable takes on the value of 0 or 1. If we need to distinguish among n categories, the regression should include n − 1 dummy variables. The intercept of the regression measures the average value of the dependent variable of the omitted category, and the coefficient on each dummy variable measures the average incremental effect of that dummy variable on the dependent variable.
- If a regression shows significant conditional heteroskedasticity, the standard errors and test statistics computed by regression programs will be incorrect unless they are adjusted for heteroskedasticity.
- One simple test for conditional heteroskedasticity is the Breusch Pagan test. Breusch and Pagan showed that, under the null hypothesis of no conditional heteroskedasticity, nR2 (from the regression of the squared residuals on the independent variables from the original regression) will be a χ2 random variable with the number of degrees of freedom equal to the number of independent variables in the regression.
- The principal effect of serial correlation in a linear regression is that the standard errors and test statistics computed by regression programs will be incorrect unless adjusted for serial correlation. Positive serial correlation typically inflates the t-statistics of estimated regression coefficients as well as the F-statistic for the overall significance of the regression.
- The most commonly used test for serial correlation is based on the Durbin–Watson statistic. If the Durbin–Watson statistic differs sufficiently from 2, then the regression errors have significant serial correlation.
- Multicollinearity occurs when two or more independent variables (or combinations of independent variables) are highly (but not perfectly) correlated with each other. With multicollinearity, the regression coefficients may not be individually statistically significant even when the overall regression is significant as judged by the F-statistic.
- Model specification refers to the set of variables included in the regression and the regression equation’s functional form. The following principles can guide model specification:
- The model should be grounded in cogent economic reasoning.
- The functional form chosen for the variables in the regression should be appropriate given the nature of the variables.
- The model should be parsimonious.
- The model should be examined for violations of regression assumptions before being accepted.
- The model should be tested and be found useful out of sample before being accepted.
- If a regression is misspecified, then statistical inference using OLS is invalid and the estimated regression coefficients may be inconsistent.
- Assuming that a model has the correct functional form, when in fact it does not, is one example of misspecification. There are several ways this assumption may be violated:
- One or more important variables could be omitted from the regression.
- One or more of the regression variables may need to be transformed before estimating the regression.
- The regression model pools data from different samples that should not be pooled.
- Another type of misspecification occurs when independent variables are correlated with the error term. This is a violation of Regression Assumption 3, that the error term has a mean of 0, and causes the estimated regression coefficients to be biased and inconsistent. Three common problems that create this type of time-series misspecification are:
- including lagged dependent variables as independent variables in regressions with serially correlated errors;
- including a function of dependent variable as an independent variable, sometimes as a result of the incorrect dating of variables; and
- independent variables that are measured with error.
- Probit and logit models estimate the probability of a discrete outcome (the value of a qualitative dependent variable, such as whether a company enters bankruptcy) given the values of the independent variables used to explain that outcome. The probit model, which is based on the normal distribution, estimates the probability that Y = 1 (a condition is fulfilled) given the values of the independent variables. The logit model is identical, except that it is based on the logistic distribution rather than the normal distribution.
- Supervised learning is machine learning that makes use of labelled training data and contrasts with unsupervised learning which does not make use of labelled data.
- Focuses of data analytics include correlation, prediction, causal inference, classification, clustering, and dimension reduction. Supervised ML is typically used for prediction and classification while unsupervised machine learning is used for clustering and dimension reduction.
- Penalized regression is a computationally efficient technique used in prediction problems. CART is a common supervised ML technique which can be applied to predict either a categorical or continuous target variable. Neural networks are applied to a variety of tasks characterized by nonlinearities and interactions among variables. Neural networks consist of three layers: an input layer, hidden layer(s), and an output layer. The K-means algorithm is a simple, bottom-up clustering algorithm based on concepts of geometric distance from points called centroids. PCA is an unsupervised learning algorithm that supplies a lower dimensional view of the structure of the volatility in data.
- The process to train ML models involves following steps:
- Specify the ML technique/algorithm
- Specify the associated hyperparameters
- Divide data into training and validation samples
- Evaluate learning with performance measure P, using the validation sample, and tune the hyperparameters
- Repeat the training cycle the specified number of times or until the required
- performance level is obtained.