Regression analysis is a statistical technique used to fit a model expressed in terms of one or more variables to some data. There are many types of regression analysis techniques, however one of the most widely used is based on fitting data to a linear model, and using an approach called Least Squares. The Morpheus API currently supports 3 variations of Linear Least Squares regression, namely Ordinary Least Squares (OLS), Weighted Least Squares (WLS) and Generalized Least Squares (GLS). The following article reviews some OLS regression theory and provides an example of how to use the Morpheus API to apply this technique.

Regression analysis is a statistical technique used to fit a model expressed in terms of one or more variables to some data. In particular, it allows one to analyze the relationship of a dependent variable (also referred to as the regressand) on one or more independent or predictor variables (also referred to as regressors), and assess how influential each of these are.

There are many types of regression analysis techniques, however one of the most commonly used is based on fitting data to a linear model, and in particular using an approach called Least Squares. The Morpheus API currently supports3 variations of Linear Least Squares regression, namely Ordinary Least Squares (OLS), Weighted Least Squares (WLS) and Generalized Least Squares (GLS). The following section reviews some OLS regression theory, and provides an example of how to use the Morhpeus API to apply this technique.

A linear regression model in matrix form can be expressed as:

$$ Y = X \beta + \epsilon \ \ \ \ where \ E[\epsilon]=0 \ \ \ \ Var[\epsilon] = \sigma^2 I $$

Y represents an **nx1** vector of regressands, and X represents an **nxp** design matrix, where n is the number of observations in the data, and p represents the number of parameters to estimate. If the model includes an intercept term, the first column in the design matrix is populated with 1's and therefore the first entry in the **px1** β vector would represent the intercept value. The **nx1** ϵ vector represents the error or disturbance term, which is assumed to have a conditional mean of 0 and also to be free of any serial correlation(see section below on the Gauss Markov assumptions).

A least squares regression model estimates β so as to minimize the sum of the squared error, or ϵTϵ. We can use the model equation above to express \(\epsilon^T\epsilon\) in terms of Y and Xβ, differentiate this by β, set the result to zero since we wish to minimize the squared error, and solve for β as follows:

$$ \begin{align}

\epsilon^T\epsilon &= (Y - X \beta)^T(Y - X \beta) \\\\

&= Y^TY - \beta^TX^TY - Y^TX\beta + \beta^T X^T X \beta \\\\

&= Y^TY - 2\beta^T X^T Y + \beta^T X^T X \beta

\end{align} $$

We now differentiate this expression with respect to β and set it to zero which yields the following:

$$ \frac{d\epsilon^T\epsilon}{d\beta} = -2X^T Y + 2X^T X \beta = 0 $$

This expression can be re-arranged to solve for β and yields the following equation:

$$ \beta = (X^T X)^{-1} X^T Y $$

The value of β can only be **estimated** given some sample data drawn from a population or data generating process, the *true* value is unknown. We usually refer to the estimate as \(\hat{\beta}\) (beta "hat") in order to distinguish it from the true population value. In addition, when expressing the model in terms of \(\hat{\beta}\), the stochastic term is referred to as *residuals* rather than *errors*, and while they are conceptually related, they are not the same thing. Errors represent the deviation of observations from the unknown population line, while residuals represent the deviations from the estimation line, a subtle difference and easily confused. In terms of \(\hat{\beta}\) and residuals **u**, the model is expressed in the usual way.

$$ Y = X \hat{\beta} + u \ \ \ \ where \ E[u]=0 \ \ \ \ Var[u] = \sigma^2 I $$

The residuals are of course still assumed to be consistent with the Gauss Markov assumptions, and so the estimator for \(\hat{\beta}\) is:

$$ \hat{\beta} = (X^T X)^{-1} X^T Y $$

We can solve for \(\hat{\beta}\) directly by calculating the right hand side of the above equation, which is what the Morpheus library does by default. There are situations in which this can present numerical stability issues, in which case it may be preferable to solve for \(\hat{\beta}\) by factorizing the design matrix X using a QR decomposition where Q represents an orthogonal matrix such that \(Q^T Q = I\) and thus \(Q^T = Q^{-1}\), and where R is an upper triangular matrix. The QR decomposition approach is based on the following reasoning.

$$ \begin{align}

X^T Y & = X^T X \hat{\beta} \\\\

\big( QR \big)^T Y & = \big( QR \big)^T \big( QR \big) \hat{\beta} \\\\

R^T Q^T Y & = R^T Q^T Q R \hat{\beta} \\\\

R^T Q^T Y & = R^T R \hat{\beta} \\\\

(R^T)^{-1} R^T Q^T Y & = \big( R^T \big)^{-1} R^T R \hat{\beta} \\\\

Q^T Y & = R \hat{\beta}

\end{align} $$

This can be solved efficiently by backward substitution because the R matrix is upper triangular, and therefore no inversion of the design matrix is required. The Morpheus library also supports estimating \(\hat{\beta}\) using this technique, which can be configured on a case by case basis via the withSolver() method on the DataFrameLeastSquares interface.

Just like any other statistical technique, regression analysis is susceptible to sampling error, and it is therefore common to compute the variance of the parameter estimates, as well as their standard error, which can then be used for inferential purposes. In the context of a linear regression, the null hypotheses H0, is generally that the model parameters are zero. The parameter standard errors can be used to calculate a t-statistic and corresponding p-value in order to decide if \(H_{0}\) can be rejected in favor of the alternative hypothesis, and thus assert that the estimates are statistically significantly different from zero.

The variance of \(\hat{\beta}\) with respect to the true population value can be expressed as follows:

$$ Var(\hat{\beta}) = E[(\hat{\beta} - \beta)(\hat{\beta} - \beta)^T] $$

We substitute our population model Y=Xβ+ϵ in our sample estimator \(\hat{\beta} = (X^T X)^{-1} X^T Y \) as follows:

$$ \begin{align}

\hat{\beta} &= (X^T X)^{-1} X^T ( X \beta + \epsilon ) \\\\

\hat{\beta} &= (X^T X)^{-1} X^T X \beta + (X^T X)^{-1} X^T \epsilon \\\\

\hat{\beta} &= \beta + (X^T X)^{-1} X^T \epsilon \\\\

\hat{\beta} - \beta &= (X^T X)^{-1} X^T \epsilon \\\\

\end{align} $$

With the above expression for \(\hat{\beta}\)−β we can now solve for the variance of the OLS estimator as follows:

$$ \begin{align}

Var(\hat{\beta}) &= E[(\hat{\beta} - \beta)(\hat{\beta} - \beta)^T] \\\\

& = E[((X^T X)^{-1} X^T \epsilon) ((X^T X)^{-1} X^T \epsilon)^T] \\\\

& = E[(X^T X)^{-1} X^T \epsilon \epsilon^T X (X^T X)^{-1} ]

\end{align} $$

Given that the design matrix X is non-stochastic and the \(E[\epsilon \epsilon^T] = \sigma^2 I \):

$$ \begin{align}

Var(\hat{\beta}) & = (X^T X)^{-1} X^T E[\epsilon \epsilon^T] X (X^T X)^{-1} \\\\

& = (X^T X)^{-1} X^T (\sigma^2 I) X (X^T X)^{-1} \\\\

& = \sigma^2 I (X^T X)^{-1} X^T X (X^T X)^{-1} \\\\

& = \sigma^2 I (X^T X)^{-1} \\\\

& = \sigma^2 (X^T X)^{-1}

\end{align} $$

Other regression diagnostics that are calculated include the coefficient of determination or \(R^2\) which is a number that indicates the proportion of variance in the dependent variable that is explained by the independent variables, and is documented in the table below. The parameter variance estimates as calculated above are used to compute the standard errors and a corresponding t-statistic which can be used for statistical inference. Other diagnostics include the following:

$$ RSS = \sum_{i=1}^n \big(y_{i} - \hat{y_{i}} \big)^2 = \sum_{i=1}^n \epsilon_{i}^2 = \epsilon^T \epsilon $$

$$ TSS = \sum_{i=1}^{n} \big(y_{i} - \overline{y}\big)^2 \, \textrm{where} \; \overline{y} = \frac{1}{n} \sum_{i=1}^n y_{i} $$

$$ ESS = \sum_{i=1}^{n} \big(\hat{y_{i}} - \overline{y}\big)^2 \, \textrm{where} \; \overline{y} = \frac{1}{n} \sum_{i=1}^n y_{i} $$

$$ R^2 = 1 - \frac{RSS}{TSS} $$

$$ R^2_{adj} = 1 - \frac{ RSS * \big( n - 1 \big) }{ TSS * \big( n - p \big)} $$

$$ SE = \sqrt{ \frac{RSS}{ n - p }} $$

$$ Var(\hat{\beta}) = SE^2( X^T X )^{-1} $$

$$ SE(\hat{\beta_{i}}) = \sqrt{ Var(\hat{\beta_{i}})} $$

$$ T_{\beta_{i}} = \frac{\hat{ \beta_{i}}}{ SE( \hat{ \beta_{i} } ) } $$

The Gauss Markov Theorem states that Ordinary Least Squares is the Best Linear Unbiased and Efficient (BLUE) estimator of β, conditional on a certain set of assumptions being met. In this context, "best" means that there are no other unbiased estimators with a smaller sampling variance than OLS. Unbiased means that that the expectation of \(\hat{\beta}\) is equal to the population β, or otherwise stated \(E[ \hat{\beta} ] = \beta \). The assumptions that must hold for OLS to be BLUE are as follows:

- The regression model is linear in parameters, and therefore well specified.
- The regressors are linearly independent, and therefore do not exhibit perfect multicollinearity.
- The errors in the regression have a conditional mean of zero.
- The errors are homoscedastic, which means they exhibit constant variance.
- The errors are uncorrelated between observations.
- The errors are normally distributed, and independent and identically distributed (iid).

The first assumption regarding linearity suggests that the dependent variable is a linear function of the independent variables. This does not imply that there is a linear relationship between the independent and dependent variables, it only states the the model is linear in parameters. For example, a model of the form \(y = \alpha + \beta x^2 \) qualifies as being linear in parameters, while \(y = \alpha + \beta x^2 \) does not. If the functional form of a model under investigation is not linear in parameters, it can often be transformed so as to render it linear.

The second assumption that there is no perfect multicollinearity between the regressors is important, as if it exists, the OLS estimator cannot be calculated. Another way of expressing this condition is that one of the independent variables cannot be a function of any of the other independent variables, and therefore the design matrix X must be non-singular, and therefore have full rank.

The third assumption above states that the disturbance term averages out to zero for any given instance of X, which implies that no observations of the independent variables convey any information about the error. Mathematically this is stated as \( E[ \epsilon | X ] = 0 \). This assumption is violated if the independent variables are stochastic in nature, which can arise as a result of measurement error, or if there is endogeneity in the model.

The fourth and fifth assumptions relate to the covariance matrix of the error term, and specifically states that \(E[ \epsilon \epsilon^T | X] = \sigma^2 I \). There are two key concepts embedded in this statement, the first is that the disturbance term has uniform variance of \(\sigma^2\) regardless of the values of the independent variables, and is referred to as homoscedasticity. In addition, the off diagonal terms of the covariance matrix are assumed to be zero, which suggests there is no serial correlation between errors. Either or both of these assumptions may not hold in the real world, in which case a WLS or GLS estimator may prove to be a better, unbiased linear estimator.

The Morpheus API defines an interface called DataFrameRegression which exposes a number of methods that support different linear regression techniques, namely OLS, WLS and GLS. There are overloaded methods that take one or more regressors in order to conveniently support simple and multiple linear regression. The regression interface, which operates on the column data in a DataFrame, can be accessed by calling the regress() method on the frame. If a regression in the row dimension is required, simply call transpose() on the frame before calling regress().

To illustrate an example, consider the same motor vehicle dataset introduced earlier, which can be loaded with the code below. The first 10 rows of this DataFrame is also included for inspection, and in this exercise we are going to be interested in the **EngineSize** and **Horsepower** columns.

Based on our understanding of the factors that influence the power of an internal combustion engine, one might hypothesize that there is a positive and linear relationship between the size of an engine and how much horsepower it produces. . We can use the car dataset to test this hypothesis. First we can use the Morpheus library to generate a scatter plot of the data, with **EngineSize** on the x-axis and **Horsepower** on the y-axis as follows:

The scatter plot certainly appears to suggest that there is a positive relationship between **EngineSize** and **Horsepower**. In addition, it seems somewhat linear, however the dispersion appears to get more significant for larger engine sizes, which would be a violation of one of the Gauss Markov assumptions, namely of homoscedastic errors. Nevertheless, let us proceed to regress **Horsepower** on**EngineSize** and see what the results look like. The code below runs a single variable regression and simply prints the model results to standard-out for inspection.

The regression results yield a slope coefficient of **36.96**, suggesting that for every additional liter of engine capacity, we can expect to add another36.96 horsepower. While the p-value associated with the slope coefficient suggests that it is statistically significantly different from zero, it does not tell us about how relevant the parameter is in the regression. In this case we can reasonably surmise that engine size is relevant given our understanding of how an internal combustion engine works and what factors affect output power. Having said that, the coefficient of determination is perhaps lower than one might expect, and the heteroscedasticity of the residuals provides some hint that the model may be incomplete. In particular, omitted-variable bias may be at play here, in the sense that there are other important factors that influence an engine's horsepower that we have not included.

While the code example above simply prints the model results to standard out, the illustration below demonstrates how to access all the relevant model outputs via the API. The **ols()** method takes a lambda parameter that consumes the regression model, which is an instance of the **DataFrameLeastSquares** interface, and provides all the relevant hooks to access the model inputs and outputs.

Finally, the chart below adds the OLS trendline to the initial scatter plot to get a better sense of how the solution fits the data.

The code to generate this chart is as follows:

An Ordinary Least Squares estimator is said to be unbiased in the sense that if you run regressions on many samples of data generated from the same population process, the coefficient estimates from all these samples would be centered on the true population values. To demonstrate this empirically, we can define a population process in 2D space with a known slope and intercept coefficient, and then proceed to generate many samples from this process while adding Gaussian noise to the dependent variable in order to simulate the error term. The code below defines a data generating function that returns a DataFrame of X and Y values initialized from the population coefficients, while adding white noise scaled according to the standard deviation specified by the sigma parameter.

To get a sense of the nature of the dataset generated by this function for some chosen set of parameters, we can plot a number of samples. The code below plots 4 random samples of this population process with beta = 1.45, alpha = 4.15 and a sigma value of 20.

Given this data generating function, we can produce many samples from a known population process and then proceed to run OLS regressions on these samples. For each run we capture the coefficient estimates, and then plot a histogram of all the recorded estimates to confirm that the coefficients are indeed centered on the known population values. The following code performs this procedure for 100,000 regressions, and is followed by the resulting plots.

The alpha and beta histogram plots above clearly show that the distribution of the 100000 estimates of each coefficient are centered on the known population values. In the case of the slope coefficient, the known population value is 1.45 and the mean value over the 100000 estimates is a good match. Similarly, the intercept estimate mean 4.1266 is very close to the known population value of 4.15.

An OLS estimator is said to be consistent in the sense that as the sample size increases, the variance in the coefficient estimates should decrease. This can be demonstrated empirically once again using the data generation function introduced earlier. In this experiment we run a certain number of regressions based on samples generated from a known population process, but we would need to do this multiple times with increasing sample sizes. The code below implements this by running 100,000 regressions for sample sizes ranging from 100 to 500 in steps of 100, and captures the coefficient estimates for each run. It then plots histograms for the beta and intercept estimates to illustrate then arrowing variance as sample size increases.

Thank you for your interest! We will be in touch shortly.