Robust Linear Models¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
Estimation¶
Load data:
[3]:
data = sm.datasets.stackloss.load(as_pandas=False)
data.exog = sm.add_constant(data.exog)
Huber’s T norm with the (default) median absolute deviation scaling
[4]:
huber_t = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
hub_results = huber_t.fit()
print(hub_results.params)
print(hub_results.bse)
print(hub_results.summary(yname='y',
xname=['var_%d' % i for i in range(len(hub_results.params))]))
[-41.02649835 0.82938433 0.92606597 -0.12784672]
[9.79189854 0.11100521 0.30293016 0.12864961]
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 21
Model: RLM Df Residuals: 17
Method: IRLS Df Model: 3
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Mon, 24 Feb 2020
Time: 22:49:06
No. Iterations: 19
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
var_0 -41.0265 9.792 -4.190 0.000 -60.218 -21.835
var_1 0.8294 0.111 7.472 0.000 0.612 1.047
var_2 0.9261 0.303 3.057 0.002 0.332 1.520
var_3 -0.1278 0.129 -0.994 0.320 -0.380 0.124
==============================================================================
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
Huber’s T norm with ‘H2’ covariance matrix
[5]:
hub_results2 = huber_t.fit(cov="H2")
print(hub_results2.params)
print(hub_results2.bse)
[-41.02649835 0.82938433 0.92606597 -0.12784672]
[9.08950419 0.11945975 0.32235497 0.11796313]
Andrew’s Wave norm with Huber’s Proposal 2 scaling and ‘H3’ covariance matrix
[6]:
andrew_mod = sm.RLM(data.endog, data.exog, M=sm.robust.norms.AndrewWave())
andrew_results = andrew_mod.fit(scale_est=sm.robust.scale.HuberScale(), cov="H3")
print('Parameters: ', andrew_results.params)
Parameters: [-40.8817957 0.79276138 1.04857556 -0.13360865]
See help(sm.RLM.fit)
for more options and module sm.robust.scale
for scale options
Comparing OLS and RLM¶
Artificial data with outliers:
[7]:
nsample = 50
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, (x1-5)**2))
X = sm.add_constant(X)
sig = 0.3 # smaller error variance makes OLS<->RLM contrast bigger
beta = [5, 0.5, -0.0]
y_true2 = np.dot(X, beta)
y2 = y_true2 + sig*1. * np.random.normal(size=nsample)
y2[[39,41,43,45,48]] -= 5 # add some outliers (10% of nsample)
Example 1: quadratic function with linear truth¶
Note that the quadratic term in OLS regression will capture outlier effects.
[8]:
res = sm.OLS(y2, X).fit()
print(res.params)
print(res.bse)
print(res.predict())
[ 5.10884646 0.53153982 -0.01493959]
[0.43932799 0.06782631 0.00600158]
[ 4.73535665 5.01080071 5.28126698 5.54675546 5.80726615 6.06279905
6.31335416 6.55893148 6.79953101 7.03515275 7.2657967 7.49146286
7.71215123 7.92786181 8.1385946 8.3443496 8.54512681 8.74092623
8.93174786 9.1175917 9.29845775 9.474346 9.64525647 9.81118915
9.97214404 10.12812114 10.27912045 10.42514197 10.56618569 10.70225163
10.83333978 10.95945014 11.08058271 11.19673748 11.30791447 11.41411367
11.51533508 11.6115787 11.70284452 11.78913256 11.87044281 11.94677526
12.01812993 12.08450681 12.14590589 12.20232719 12.2537707 12.30023641
12.34172434 12.37823448]
Estimate RLM:
[9]:
resrlm = sm.RLM(y2, X).fit()
print(resrlm.params)
print(resrlm.bse)
[ 4.99505385 0.52677011 -0.00604195]
[0.15225709 0.02350644 0.00207996]
Draw a plot to compare OLS estimates to the robust estimates:
[10]:
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.plot(x1, y2, 'o',label="data")
ax.plot(x1, y_true2, 'b-', label="True")
prstd, iv_l, iv_u = wls_prediction_std(res)
ax.plot(x1, res.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm.fittedvalues, 'g.-', label="RLM")
ax.legend(loc="best")
[10]:
<matplotlib.legend.Legend at 0x7fdea84ea550>

Example 2: linear function with linear truth¶
Fit a new OLS model using only the linear term and the constant:
[11]:
X2 = X[:,[0,1]]
res2 = sm.OLS(y2, X2).fit()
print(res2.params)
print(res2.bse)
[5.71100351 0.38214389]
[0.3860745 0.03326574]
Estimate RLM:
[12]:
resrlm2 = sm.RLM(y2, X2).fit()
print(resrlm2.params)
print(resrlm2.bse)
[5.25790817 0.46873405]
[0.12445114 0.01072321]
Draw a plot to compare OLS estimates to the robust estimates:
[13]:
prstd, iv_l, iv_u = wls_prediction_std(res2)
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x1, y2, 'o', label="data")
ax.plot(x1, y_true2, 'b-', label="True")
ax.plot(x1, res2.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm2.fittedvalues, 'g.-', label="RLM")
legend = ax.legend(loc="best")
