Class 3: Estimation

In the previous classes 1 and 2 we learned some basic things about Stata: In this class, we shall get a hold on the basics of Stata estimation commands. Automobile data will be used as an example of data, and regress will be used to demonstrate the main features of many estimation and post-estimation commands. We'll only have major differences for systems of equations such as those in simultaneous equations or Heckman model. Panel, time-series, or survey data will only differ in the amount of time one would need to spend setting up the data for estimation.

All of the estimation commands feature the same overall approach. You need to specify the dependent variable, the explanatory variables (if any), the subsample (if/in) and some estimating options (such as corrections for heteroskedasticity or clustering, maximum likelihood convergence criteria, twostep vs. FIML estimation in Heckman model, etc.). Stata gives you the estimation output where you can check the basic results such as overall significance, and significance of individual explanatory variables. After estimation, you can test linear and nonlinear hypothesis or estimate linear and nonlinear combinations, and get predicted values, predicted probabilities, residuals, and some other observation-level statistics. The results of the estimation are stored in Stata memory until the next estimation command, or until erased explicitly.

OK, let's start:
. cd your favorite location
. log using class3, replace
. sysuse auto

Linear regression

Linear regression is a standard starting point in many analyses (although it really is a starting point, as the analysis usually moves much beyond the scope of this simple model pretty fast). As we found out in the first class, the Stata command to perform a linear regression is regress. It deserves putting up its full syntax here. Click here to open it in a separate window. Let us go slowly through it exemplifying the concepts as we go.

The basic part of the syntax is
. regress depvar[ varlist]
which means a regression where the dependent variable is depvar, and the list of regressors is in varlist. Let's try it off:
. reg pri mpg wei for

ANOVA table Overall significance R<sup>2</sup> Residual s Dependent variable Explanatory variables Point estimates Standard errors t-statistics Confidence intervals In the upper left corner of regression output is the ANOVA table, i.e., a decomposition of the total sum of squares of the dependent variable. If you don't have a clue about ANOVA, you can leave it at that; it can be mentioned that the R2 is computed as the ratio of the Model SS to the Total SS. On the right, there are overall summaries of the regression: the number of observations, the F-test for the overall significance, R2 and adjusted R2adj, and the square root mean squared error of the residuals.

The most important information is in the lower pane of the regression output. Here we have the names of the dependent variable (price) and of the explanatory variables (mpg, weight, foreign; _cons is the constant term, or the intercept of the regression. To omit one, specify nocons option of regress. To indicate to Stata that you already have a constant term among regressors (this is helpful in some panel data regressions), specify hascons option). The next column are the estimated coefficients β = (XTX)-1XTY. The third column are the estimated standard errors, which are the square roots of the diagonal entries of the estimated covariance matrix, which in this is s2 (XTX)-1 (it is going to be different for some other models). The t statistics in the next column are the ones to test the null hypothesis of no linear relation between the explanatory and the dependent variable conditional on other explanatory variables (i.e., that respective β is equal to zero). The p-value for the test is provided in the next column. Finally, the confidence interval for the coefficient is given in the last two columns.

A little bit longer explanation of what Stata shows after regress is available at UCLA Academic Technologies Services Stata website.

From the estimation results, we can infer that weight and price are positively related, that the foreign cars are on average more expensive by $3673, and the mileage is not significantly related to price in this model. The R2 is reasonable at 0.50, but not particularly high. Let us refer to this model as Model 1. We can save the estimation results in Stata, too:
. estimates store Model1

As with most Stata commands (and as with all estimation commands), you can add if/in qualifiers to restrict the estimation subsample:

This will be Model 2. A linear regression model is overly simplistic and very limited, and you as a researcher should get a feeling of whether it is appropriate or not. The minimal thing to do (and it is very easy to do in Stata) is to correct the covariance matrix of the estimates for heteroskedasticity (non-constant variance). Economists know this correction as White estimator; statisticians, as Huber estimator; survey researchers, as linearization estimator; other names it goes under are robust estimator (although an argument can be raised against this term since the meaning of robust methods is somewhat different in the classic texts on robust statistics), and sandwich estimator (which is probably the most appropriate and neutral; the name is such since the analytic form of this estimator does resemble a sandwich of several matrices). The option is robust:
Let us call this Model 3:
. estimates store Model3
Note that we have the same subsample as in
Model 1. The point estimates are the same, but the standard errors are slightly larger, except for foreign variable. This increase is a typical outcome of the correction.

In general, the robust correction is available for most models estimated by maximum likelihood; see help for _robust. Linear regression is an example of such model if one assumes normality of residuals. The robust correction in this case gives consistent estimates of the standard errors if the distribution of the residuals is different from the normal with fixed parameters. For models such as probit, the corrections is for the violations of the assumed probability distribution (say the underlying link between Xβ and the probability of a positive outcome is logit rather than probit).

If you have reasons to believe that some observations are grouped together by the data generating mechanism, you can specify cluster() option to account for this effect:

Let this be Model 4. Note that the estimation subsample was restricted to the observations where rep78 is not missing, so the point estimates are identical to those of Model 2. Some of the standard errors, however, have increased substantially, almost by the factor of two, due to clustering. This is usually suggestive that observations within clusters have similar values of the corresponding variable. Note also that Stata mentioned that there were 5 cluster according to the unique values of rep78. If we had more regressors than the number of clusters, Stata would run out of degrees of freedom to assess the overall significance:
The models where the number of clusters is less than the number of explanatory variables produce singular cluster-corrected covariance matrices of the coefficient estimates. This is not as drastic as under-identification of a linear regression model if you have fewer observations than explanatory variables, but it still is a rather unpleasant effect. For instance, it means that some combinations of the coefficient estimates would have zero estimated variance, the effect similar to collinearity among regressors.

The correction for clustering is a must for survey data where the clustering reflects the multi-stage sampling design. The correction to be made then is for clustering at the very first level of sampling.
Self Check: Run the regression of mpg on weight, foreign and length. Do the results conform to your expectation? What is the R2 of the model? Which variables are significant?

Post-estimation commands

Saved results

As was mentioned before, Stata saves estimation results in its memory between two estimation commands. One can retrieve the results in their original form by re-typing the estimation command again:
In this case, the last estimated model was the one with insufficient number of clusters (or rather excessive number of variables). Let us fall back to Model 3 with "robust" standard errors, the reason being that it is believed to give the most realistic estimates of the variance:
. estimates restore Model3
The list of all saved information can be retrieved through
. ereturn list
And it can be fairly long and technical...

What we can recognize here are the number of observations in the subsample e(N), the degrees of freedom of the model e(df_m) and of the residuals e(df_r), the value of R2 e(r2)... and may be some other stuff. The vector of the estimated coefficients is stored in e(b), and we can access it by
. matrix list e(b)
(we won't go into many details about the matrix commands at this point; you can look up
help for matrix if you are curious). Also, the covariance matrix of the coefficient estimates is stored as e(V), and there is a shortcut to access it:
Finally, the estimation subsample is stored inside Stata as a 0/1 variable _est_modelname, and it can be used to perform operations exactly on those observations that were used in estimation:
. count if e(sample)
In fact, other Stata commands also save something in Stata memory that can be used occasionally:
Try
. sum ...
. ret li
to see how one can access the results of summarize, for instance.

There is also subscripting notation for the coefficients and their standard errors:

For multiple equations, the notation will be [eqname]_b[varname].

Hypotheses testing

Let us first replicate the results of hypothesis testing in our powerful hand calculator... which is, the display command together with the saved results and a few statistical functions.
Self Check: Conduct the test of hypothesis that _b[weight]=0. You would need one display command to get the statistic, and another one, to compute the p-value. See help on probability distributions to find the necessary function that returns the CDF/tail probability of the normal or t-distribution.
Well that is a hard way to do it, and the simple way to do it is of course with an obvious command:
. test testexp ...
where testexp, optionally put into parentheses, is a variable name, or linexp = #, where linexp is a linear combination of parameters / variable names.

The first test command shows the basic syntax, and the results (in terms of the p-value) should be identical to the results in the corresponding column of the regression output. The joint test for two parameters is performed in the second test. Test of equality of the parameters only requires specifying a linear combination, as in the third test command. Note how Stata reformulated the specified condition by taking mpg to the left hand side of the equality. Finally, some crazy combinations of parameters can be specified, and multiple tests conducted by wrapping individual conditions with parentheses.

For multiple equation systems, you would need to specify the name of the equation as explained above.

The nonlinear hypotheses can be tested with testnl:

The problem with the nonlinear tests, however, is that they are sensitive to the particular specification (to be precise, this is the problem with Wald type tests). Note that the condition is essentially the same as test wei = mpg, but the p-value is different from the one above (0.6030 vs. 0.8192).
Self Check: From your previous regression of mpg on weight, foreign and length, test all the pairwise combinations of parameters. Do you need to test all of the parameters together? If yes, do so. If not, explain where you can find the test results.

Predictions

To get observation level estimation results, such as predicted values, their standard errors, residuals, predicted probabilities, etc., use predict. It is a universal post-estimation command of the form
. predict [type] newvarname, [option]
where option would vary from one estimation command to another. regress probably has the greatest number of various predict options across the board. One can predict the fitted values (default, or xb) option and their standard errors (stdp; another option, stdf, gives the standard error of the forecast, i.e., the combination of the predicted value and yet unobserved error term); residuals -- regular (residuals), standardized (rstandardized: a residual over its standard deviation) and studentized (rstudentized: the difference between the observed value and the prediction obtained from the regression where the current observation is excluded), as well as the standard error of the regular residual (stdr); probability of an interval (pr(...)), as well as a number of influence diagnostic measures.

For a general combination of estimated coefficients, you can use lincom and nlcom to get linear and nonlinear combinations of the coefficients, respectively, as long as the standard errors and confidence intervals.
Self Check: From your previous regression of mpg on weight, foreign and length, predict residuals and fitted values. summarize them along with mpg and comment on the results.

Post-estimation commands specific for regress

There is much more to linear regression than is typically taught even in the most advanced econometric classes. See the help on regression diagnostics, and get a hold of the theoretical foundations with books like Draper & Smith, Applied Regression Analysis (3rd edition, Wiley, 1998) or Belsley, Kuh and Welsch, Regression Diagnostics (Wiley, 1980). Let's have a look at a couple of regression diagnostics plots to safeguard against any obvious faults. Some of them are only available after the basic formulation without robust, so let us
. est restore Model1
Most of those diagnostic tests are based on the regression residuals ei under the null hypothesis that heteroskedasticity is not a problem. At any rate, we shall not be performing statistical inference for the original regression, and the residuals are the same between Model 1 and Model 3.
Self Check: From your previous regression of mpg on weight, foreign and length,
  1. Run tests of nonlinearity and heteroskedasticity. Is the correction for heteroskedasticity needed in this case? Compare the output with and without the robust option.
  2. Check the residuals graphically with the commands explained above.

Generalizations

What was said above about regress remains valid for most estimation commands:
  1. They all store the estimation results in memory, at least temporarily, or for longer periods with estimates store
  2. You can use test, testnl, lincom, nlcom to form combinations of coefficients and test hypotheses about them
  3. You can get the var-cov matrix by vce
  4. You can predict something based on the estimation results
Most of the estimation commands will have some unique predict options, as well as some specification tests specific to those commands. They will be mentioned as we go along to explore them.

Self Check: Estimate the probit regression of foreign on mpg, weight and log of price. Test the hypothesis about individual coefficients, pairs of coefficients, and the model as a whole. Predict the probabilities of a positive outcome.
A note on interpretation: economists tend to think about LHS variables as endogeneous ones, so that there really is a process that generates the LHS variables conditional on the values of the RHS variables. This is not the only possible interpretation of the regression models, including the probit model. Think about classification problems: if you are given a set of characteristics, would you recognize which category does the object belong to? Application of such problems range from spam filtering to credit card applications. Here, given the values of mpg, weight and log of price, is it more likely that a given car was made in the US or abroad?


Let us stop here and review what we've done in this class:

More on regression: web book at UCLA Academic Technologies Server

Your further steps in mastering Stata estimation commands would be to check the following:


On to the next class, to learn about how repeating things in Stata does not have to be boring.

Back to the list of classes and Stata resources


Questions, comments? E-mail me!.
Stas Kolenikov