Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     888.8
Date:                Mon, 24 Feb 2020   Prob (F-statistic):           1.03e-40
Time:                        22:49:06   Log-Likelihood:                -1.5789
No. Observations:                  50   AIC:                             11.16
Df Residuals:                      46   BIC:                             18.81
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.8491      0.089     54.642      0.000       4.671       5.028
x1             0.5219      0.014     38.134      0.000       0.494       0.549
x2             0.4503      0.054      8.369      0.000       0.342       0.559
x3            -0.0214      0.001    -17.806      0.000      -0.024      -0.019
==============================================================================
Omnibus:                        2.777   Durbin-Watson:                   1.657
Prob(Omnibus):                  0.249   Jarque-Bera (JB):                1.644
Skew:                           0.163   Prob(JB):                        0.440
Kurtosis:                       2.174   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.31422266  4.78974504  5.22877456  5.60677211  5.90805457  6.1283716
  6.2756039   6.36846844  6.43344333  6.50041753  6.5977802   6.7477564
  6.96275562  7.24333291  7.57809768  7.94558516  8.31778302  8.66473365
  8.95945619  9.18237935  9.32455694  9.38913771  9.39084801  9.35357229
  9.30642897  9.27898633  9.29640456  9.37530201  9.52102481  9.72676857
  9.97469625 10.23886881 10.48950687 10.69788338 10.84104345 10.90557568
 10.88981571 10.80411958 10.66916236 10.51254119 10.36424123 10.25170915
 10.19534371 10.20514433 10.27906819 10.40336586 10.55484031 10.7046581
 10.82308664 10.88437917]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.85680717 10.70864515 10.45755087 10.14376412  9.82025463  9.53975322
  9.34184147  9.24326075  9.23381337  9.27885931]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.849149
x1                  0.521925
np.sin(x1)          0.450267
I((x1 - 5) ** 2)   -0.021397
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.856807
1    10.708645
2    10.457551
3    10.143764
4     9.820255
5     9.539753
6     9.341841
7     9.243261
8     9.233813
9     9.278859
dtype: float64