All of the estimation commands feature the same overall approach. You need to specify the dependent variable, the explanatory variables (if any), the subsample (if/in) and some estimating options (such as corrections for heteroskedasticity or clustering, maximum likelihood convergence criteria, twostep vs. FIML estimation in Heckman model, etc.). Stata gives you the estimation output where you can check the basic results such as overall significance, and significance of individual explanatory variables. After estimation, you can test linear and nonlinear hypothesis or estimate linear and nonlinear combinations, and get predicted values, predicted probabilities, residuals, and some other observation-level statistics. The results of the estimation are stored in Stata memory until the next estimation command, or until erased explicitly.
OK, let's start:
. cd your favorite location
. log using class3, replace
. sysuse auto
Linear regression
Linear regression is a standard starting point in many analyses (although it
really is a starting point, as the analysis usually moves much beyond the scope
of this simple model pretty fast). As we found out in the first class, the Stata command
to perform a linear regression is regress. It deserves putting up its full
syntax here.
Click here to open
it in a separate window.
Let us go slowly through it exemplifying the concepts as we go.
The basic part of the syntax is
The most important information is in the lower pane of the regression output.
Here we have the names of the dependent variable (price) and of the explanatory variables
(mpg, weight, foreign; _cons is the constant term, or the intercept
of the regression. To omit one, specify nocons option of regress.
To indicate to Stata that you already have a constant term among regressors (this is
helpful in some panel data regressions), specify hascons option). The next
column are the estimated coefficients β = (XTX)-1XTY.
The third column are the estimated standard errors, which are the square roots of the diagonal
entries of the estimated covariance matrix, which in this is s2 (XTX)-1
(it is going to be different for some other models). The t statistics in the next
column are the ones to test the null hypothesis of no linear relation between the explanatory
and the dependent variable conditional on other explanatory variables (i.e., that respective
β is equal to zero). The p-value for the test is provided in the next column.
Finally, the confidence interval for the coefficient is given in the last two columns.
A little bit longer explanation of what Stata shows after regress is available
at UCLA Academic Technologies Services
Stata website.
From the estimation results, we can infer that weight and price are positively related,
that the foreign cars are on average more expensive by $3673, and the mileage is not
significantly related to price in this model. The R2 is reasonable at 0.50, but
not particularly high. Let us refer to this model as Model 1. We can save the estimation
results in Stata, too:
As with most Stata commands (and as with all estimation commands), you can add
if/in qualifiers to restrict the estimation subsample:
In general, the robust correction is available for most models estimated by maximum likelihood;
see help for _robust.
Linear regression is an example of such model if one assumes normality of residuals. The
robust correction in this case gives consistent estimates of the standard errors
if the distribution of the residuals is different from the normal with fixed parameters.
For models such as probit, the corrections is for the violations of the assumed
probability distribution (say the underlying link between Xβ and the probability of a
positive outcome is logit rather than probit).
If you have reasons to believe that some observations are grouped together by the data
generating mechanism, you can specify cluster() option to account for this effect:
The correction for clustering is a must for survey data where the clustering reflects the
multi-stage sampling design. The correction to be made then is for clustering at the very first
level of sampling.
. regress depvar[ varlist]
which means a regression where the dependent variable is depvar, and the list
of regressors is in varlist. Let's try it off:
. reg pri mpg wei for
. estimates store Model1
. estimates store Model3
Note that we have the same subsample as in Model 1. The point estimates
are the same, but the standard errors are slightly larger, except for foreign variable.
This increase is a typical outcome of the correction.
Self Check: Run the regression of mpg on weight, foreign and length. Do the results conform to your expectation? What is the R2 of the model? Which variables are significant? |
There is also subscripting notation for the coefficients and their standard errors:
Self Check: Conduct the test of hypothesis that _b[weight]=0. You would need one display command to get the statistic, and another one, to compute the p-value. See help on probability distributions to find the necessary function that returns the CDF/tail probability of the normal or t-distribution. |
The first test command shows the basic syntax, and the results (in terms of the p-value) should be identical to the results in the corresponding column of the regression output. The joint test for two parameters is performed in the second test. Test of equality of the parameters only requires specifying a linear combination, as in the third test command. Note how Stata reformulated the specified condition by taking mpg to the left hand side of the equality. Finally, some crazy combinations of parameters can be specified, and multiple tests conducted by wrapping individual conditions with parentheses.
For multiple equation systems, you would need to specify the name of the equation as explained above.
The nonlinear hypotheses can be tested with testnl:
Self Check: From your previous regression of mpg on weight, foreign and length, test all the pairwise combinations of parameters. Do you need to test all of the parameters together? If yes, do so. If not, explain where you can find the test results. |
For a general combination of estimated coefficients, you can use lincom and nlcom to get linear and nonlinear combinations of the coefficients, respectively, as long as the standard errors and confidence intervals.
Self Check: From your previous regression of mpg on weight, foreign and length, predict residuals and fitted values. summarize them along with mpg and comment on the results. |
Self Check:
From your previous regression of mpg on weight, foreign and length,
|
Self Check:
Estimate the probit regression of foreign on mpg, weight and log of price.
Test the hypothesis about individual coefficients, pairs of coefficients, and the model as a whole.
Predict the probabilities of a positive outcome.
A note on interpretation: economists tend to think about LHS variables as endogeneous ones, so that there really is a process that generates the LHS variables conditional on the values of the RHS variables. This is not the only possible interpretation of the regression models, including the probit model. Think about classification problems: if you are given a set of characteristics, would you recognize which category does the object belong to? Application of such problems range from spam filtering to credit card applications. Here, given the values of mpg, weight and log of price, is it more likely that a given car was made in the US or abroad? |
Let us stop here and review what we've done in this class:
More on regression: web book at UCLA Academic Technologies Server
Your further steps in mastering Stata estimation commands would be to check the following:
On to the next class, to learn about how repeating things in Stata does not have to be boring.
Back to the list of classes and Stata resources
Questions, comments? E-mail me!.
Stas Kolenikov