When fitting an OLS regression model, it may become apparent that there is an inconsistent variance in the residuals, which is known as heteroscedasticity. This is a violation of one of the Gauss Markov assumptions, and therefore OLS is no longer the Best Linear Unbiased Estimator (BLUE). An OLS model in such circumstances is still expected to be unbiased and consistent, however it will not be the most efficient. More concerning is that OLS is likely to yield biased estimates of the standard errors of the coefficients, making statistical interference unreliable. This article introduces Weighted Least Squares regression which is an appropriate model in such circumstances.

When fitting an OLS regression model, it may become apparent that there is an inconsistent variance in the residuals, which is known as heteroscedasticity. As discussed earlier, this is a violation of one of the Gauss Markov assumptions, and therefore OLS is no longer the Best Linear UnbiasedEstimator (BLUE). An OLS model in such circumstances is still expected to be unbiased and consistent, however it will not be the most efficient estimator which suggests there are other estimators that exhibit lower variance in parameter estimates. More concerning is that OLS is likely to yield biased estimates of the standard errors of the coefficients, making statistical interference unreliable.

To address the issue of heteroscedasticity, a Weighted Least Squares (WLS) regression model is used in favor of OLS, which in effect applies a transformation to the original model so that the transformed model does in fact exhibit homoscedastic errors. To demonstrate, consider the usual form of the regression model, but in this case we assume the variance in the error term follows a form of \( Var[\epsilon] = \sigma^2 D\) where D is a diagonal matrix of factors that scales the variance appropriately.

$$ Y = X \beta + \epsilon \ \ \ \ where \ E[\epsilon]=0 \ \ \ \ Var[\epsilon] = \sigma^2 D $$

Now let us assume there exists a matrix P that can transform the above model such that the resulting error term becomes homoscedastic, and therefore satisfies the Gauss Markov assumption of constant variance.

$$ P Y = P X \beta + P \epsilon \ \ \ \ where \ E[P\epsilon]=0 \ \ \ \ Var(P \epsilon) = \sigma^2 I $$

Since the Gauss Markov assumptions are satisfied for the transformed system, we can apply OLS in the usual fashion. The next logical question therefore is, how do we come up with P? It turns out that we can solve for Pin terms of D, the latter of which we can estimate reasonably well from the sample data. Consider the following derivation, where the key assumption is that P is a symmetric matrix and therefore \( P^T = P\).

$$ \begin{align}

Var(P \epsilon) & = \sigma^2 I \\\\

P Var(\epsilon) P^T & = \sigma^2 I \\\\

\sigma^2 P D P^T & = \sigma^2 I \\\\

P D P^ T & = I \\\\

D P^T & = P^{-1} \\\\

D & = (P^2)^{-1} \\\\

P & = D^{-\frac{1}{2}}

\end{align} $$

Now let us re-write the transformed model in terms of Z, B and E as follows:

$$ Z = B \beta + E \ \ \ \ where \ Z = P Y, \ \ B = PX, \ \ and \\ E = P \epsilon $$

We can use the standard OLS estimator for this since we can assume it satisfies the Gauss Markov assumptions.

$$ \hat{\beta}_{ols} = (B^T B)^{-1} B^T Z $$

Knowing that \(B=PX\) and that \( P = D^{-\frac{1}{2}}\) we can write the WLS estimator as follows:

$$ \hat{\beta}_{wls} = (X^T P^T P X)^{-1} X^T P^T P Y $$

$$ \hat{\beta}_{wls} = (X^T D^{-1} X)^{-1} X^T D^{-1} Y $$

We know that D is a diagonal matrix, so the inverse is also a diagonal matrix where the elements are simply the reciprocal of the non-zero elements of D, which we can think of as weights. This makes intuitive sense as the higher the variance, the lower the weight applied to those observations. We can therefore declare that the diagonal weight matrix \( W = D^{-1} \) and can proceed to write the estimator as follows:

$$ \hat{\beta}_{wls} = (X^T W X)^{-1} X^T W Y $$

In Regression Analysis By Example by Chatterjee, Hadi & Price, one of the demonstrations involves a study of 27 industrial companies of various sizes, where the number of workers and number of supervisors was recorded. The study attempts to fit a linear model to this data to assess the nature of the relationship between these two types of employee. As it turns out, when you plot the data using a scatter plot, it becomes apparent that the variance in the dependent variable(number of supervisors) gets larger as the independent variable (number of workers) gets larger. The Morpheus code below can be used to load and plot this data as follows:

If the increasing variance in the dependent variable is not obvious from the scatter plot, it can be useful to plot the residuals from an OLS regression against the fitted values. The code below shows how to do this, and is followed by the resulting plot.

While it is pretty clear from the above plots that heteroscedasticity is present in the data, which is a violation of one of the Gauss Markov assumptions, let us first proceed by running an OLS regression in the usual fashion. The OLS estimates will be useful for comparison with those generated by a WLS regression, but more importantly, we can use the residuals of the former to estimate the weight matrix required by the latter. The code below runs the OLS regression and the prints the results to standard out.

The next step is to use the residuals from the OLS model in order to estimate the weight matrix for WLS. There is no hard rule on how to do this, and each case will require the researcher to come up with an approach for estimating W that makes sense given the sample dataset under investigation. In this particular case however, it is clear that the variance in the dependent variable is proportional to the level of the independent variable, so we could fit a linear model to the OLS residuals against the independent variable and use that as a proxy for the change in variance. Obviously the variance cannot be negative, so we could regress the absolute value of the residuals or the residuals squared.The code below generates a plot of the absolute value of the residuals against the predictor, and fits and OLS model to the dataset.

We can now use the reciprocal of the fitted values of this regression as the elements of the diagonal weight matrix, W. The code below generates this array of weights which can then be passed to the DataFrame.regress().wls() method.

Now that we have the diagonal elements of the W matrix that describes how much weight we wish to apply to each observation of our original dataset, we can run a Weighted Least Squares regression. The code below does exactly this, and prints the model summary results to standard out.

There are a few notable differences in the output between the OLS and WLS models as follows:

- The WLS model yields lower parameter standard errors then OLS.
- The WLS model has narrower confidence intervals for the parameters then the OLS.
- The WLS model has a higher \(R^2\) than OLS

Finally, the code below generates the scatter plot of the data while adding a regression line for both the OLS (red line) and WLS (green line)estimates to see how they compare. It is clear that the WLS regression line is being pulled closer to observations associated with lower values of the independent variable compared to the OLS line, which equally weights all observations.

Conditional on the Gauss Markov assumptions being met for the transformed linear system as described earlier, parameter estimates are expected to be unbiased. Similar to the OLS example, we can demonstrate this empirically by creating many randomly generated 2-D data samples with the required characteristics given known population parameters for the slope and intercept.

The function below returns a Morpheus DataFrame with two columns representing X and Y values, where the Y values are generated by a linear process with given parameters α and β. White noise is added to the Y values with increasing variance as X increases, which will yield a dataset with the desired characteristics(i.e. heteroscedasticity).

To get a sense of what this data looks like, the code below generates 4 random samples of the dataset with 100 observations and plots themon a scatter chart with a fitted OLS regression line. It should be clear from these charts that the dispersion in the dependent variable isincreasing for larger values of the independent variable.

Now that we have a function that can generate data samples given some population parameters and with the appropriate variance characteristics, we can proceed to run regressions on these samples and record the estimates and assess whether they are indeed centered on the known population values. The code below executes 100,000 regressions, collects the sample estimates for each run in a DataFrame, and then plots a histogram of the results.

Obviously the exact estimates for this will vary each time given the stochastic term in our model, however the plots below over a very large number of regressions clearly demonstrate that the distribution of our estimates is centered on the known population value. In fact, for the results printed below, the **mean** slope and intercept coefficient was an exact match to the known population value, at least to two decimal places.

Earlier it was suggested that in the presence of heteroscedasticity, Ordinary Least Squares(OLS) was still expected to be **unbiased** and **consistent** but less **efficient** thanWeighted Least Squares (WLS) in that the variance of parameter estimates are likely to be higher for OLS. This section attempts to show this empirically using the Morpheus library.

To demonstrate the relative efficiency of the two estimators, we can use the same data generating function introduced earlier, and simply run both OLS and WLS regressions on the datasets to see how they compare. In each case, we can capture the parameter estimates from each regression, and then plot the resulting frequency distributions.

The code below runs OLS and WLS regressions on 100,000 samples created by our data generation function, and plots the frequency distributions of the resulting β and α estimates. These results should ideally allow us to conclude that WLS estimates have lower estimation variance than OLS, and we should also be able to confirm that OLS is still **unbiased** in the presence of heteroscedasticity, just not BLUE. The following code executes these regressions, and then generates the plots.

Inspection of the histograms below of the α and β estimates over the 100,000 regressions in the simulation clearly demonstrate that both OLS and WLS are **unbiased** in that they are centered on the known population values for each coefficient. The efficiency of each model however is clearly different, and as expected, the WLS regression model yields much lower variance in both the intercept and slope estimates compared to OLS.

Having empirically demonstrated the unbiasedness and efficiency of the WLS estimator, this section attempts to assess the consistency using the same data generating function introduced earlier. A consistent estimator is one that has a property such that as the sample size of the data being fitted increases, the variance of the estimates around the *true* value decreases. In other words, the estimated coefficients converge in probability n the true population values

The code below runs 100,000 regressions with a *true* beta of 4 and an intercept or alpha of 20, except in this case the process is repeated 5 times with sample sizes of 100, 200, 300, 400 and 500 (so 500,000 regressions in all). The slope and intercept coefficients for all these runs are captured in a DataFrame, which is then used to generate a histogram in order to be able to visualize the variance in the estimates for each scenario.

The plots below demonstrate a clear pattern of decreasing variance in the slope estimate as the sample size increases. With regard to the intercept estimate, the evidence is less obvious, and the chart suggests only a marginal reduction in variance.

Thank you for your interest! We will be in touch shortly.