Class 1: Basics of Stata
A few words...
Stata is a general purpose statistical software
with the main focus on econometrics (including panel models, limited dependent variables,
and systems of equations), biometrics (including survival analysis), and survey methods.
We shall use a bunch of resources to learn it, but the most
important component is your own active participation.
To start Stata, locate Stata icon on your desktop, or go to
Start > Programs > Stata (it may be Stata 8, Small Stata, Intercooled Stata 8, or something
else that does have a word "Stata" in it!). On UNIX machines, type
xstata
or
stata
at the command prompt to start a graphic or text interface of Stata.
Click on different areas of the above picture to see a short description of the area.
You can change the font in a specific windows by leftclicking at the top left corner
of the window where the font is to be changed.
You might want to save your preferences for the later sessions:
To exit Stata, type
. exit
in the command prompt, or click the Xbutton at the top right corner, or doubleclick the
top left corner icon . If there are any unsaved data, Stata will
resist exiting.
To search for a concept, go to Help menu > Search:
or type something like
. search linear regression
in the command prompt. By default, search goes over the builtin
help files available locally on your machine. If you want to also search over
the Internet (which will bring you some goodies from other Stata users as well), enter
search phrase, all
or
findit phrase
To find a help on a command you already know, go to Help menu > Command:
or type
. help regress
in the command prompt for the help file to show in the Results window, or
. whelp regress
for the help file to show in the Viewer window (more convenient).
All of the Stata help files are duplicated on the web.
Self Check:
What command performs linear regression?
What command performs logistic regression?

Stata has a somewhat idiosyncratic system of references to its manuals. In the bottom of pretty much
every help file you'll see something like this:
The blue entries are clickable (well, not on this page, but back in Stata), and the Manual refers
to various pieces of Stata documentation:
 [U] refers to the User's Guide;
 [R] refers to the 4 volume Reference Manual;
 [P] refers to the Programming Reference Manual;
 [G] refers to the Graphics Reference Manual;
 [XT] refers to the CrossSectional TimeSeries Reference Manual;
 [ST] refers to the Survival Analysis and Epidemiological Tables Manual;
 [TS] refers to the TimeSeries Reference Manual;
 [SVY] refers to the Survey Data Reference Manual.
Where am I?
Stata shows the name of the current directory in the status line. To display it in the Results window, type
. pwd
(shortcut of present working directory). To change it, type
. cd h:/stata
(shortcut of change directory).
More on working directories later, with the discussion of the dofiles.
Do you still remember?..
Make sure you keep the log of what you are doing! Stata has two sorts of logs:
. log using filename
to copy everything that goes to the Results window into a specified file, and
. cmdlog using filename
to copy the commands only (useful for converting an interactive session into a dofile; more on dofiles
later). See help log
for useful options replace, append, and formats smcl
(hyperreferenced) and text (plain text)! You can also use the streetlight button
on the toolbar menu to control the log flow. The status bar at the bottom of the Results window
shows the current log status.
You can close the log file explicitly by typing
. log close
but Stata will close the file when you exit automatically.
Advice: open a command log in the startup script, and open / close a log file in every dofile you write.
More on that later.
Most Stata commands have the following syntax:
command [variable(s)] [if expression ] [in obs. range
] [[weights]] [using filename], [options]
The command is performed on the variables if applicable, a particular
subset of observations may be selected by the if and in modifiers, and the specific
ways the command should behave are controlled by options. The most complex and powerful
commands may have several dozen different options.
See also a note at UCLA Academic Technologies
Services Stata website.
Stata has its own format of data files. Stata data files have extension .dta. To load
a data file into memory, type
. use auto
Stata will look for the file named auto.dta in the current directory.
In fact, this is a file that comes with the distribution of Stata
and is available in Stata system directory. Try this out:
. sysuse auto
use is a more versatile command that it may seem, as it allows loading a portion of the
data set with if and in modifiers (more on that later), as well as loading
the data over the Internet:
. use http://wps.aw.com/wps/media/objects/284/291498/caschool.dta
You can come across some errors while loading the data. Stata will show error messages
in red, and the typical messages will be:
no; data in memory would be lost
You tried to load the data on top of the existing data files that was not saved. If you are
sure you want to lose the changes you made to your current data, type
. clear
to clear the memory, or specify use ..., clear option.
no room to add more observations
Stata did not have enough memory to load all of the observations. The next few lines
give some suggestions on how to proceed, and the most important one is how to set the amount
of the memory that Stata should request from the operating system.
Memory issues
To see how much memory Stata uses now, and how this memory is being used, type
. memory
To change the amount of memory Stata can address, type
. set memory 40m
to request 40 Mbytes of memory.
You can usually figure out how much memory Stata may need by looking at the sizes
of the data files you are planning to work with. You can do that by typing
. ls
or better
. ls *.dta
to request a listing of dtafiles.
Stata also needs some overhead for programs and temporary objects; most of the time,
adding some 1525% is enough. Keep in mind however that when the amount of memory reaches
the physical memory of a computer (256 or 512 Mbytes on contemporary PCs), it begins swapping
the data into a temporary files on the hard disk, and the operation slows down by a factor of
about 100 to 1000. (This is not Stata's fault, this is the way virtual memory is organized
in Windows. UNIX operation is usually much smoother.)
There is also a neat trick of loading only the data you need. Recalling the general syntax
of a Stata command, you might try
use varlist if exp using filename
to load only those observations and variables you actually need. It does save you
a lot of trouble.
Another trick you might want to explore is to take a subsample of your data with
sample #, [count]
and design your analysis with this subsample. Once you have a clear plan of what you would
want to do, write a dofile (i.e., a sequence of Stata commands that can be run from
within Stata) and leave it overnight, or over weekend... or over vacation if you are
going to have any at all :).
To save the data, type
. save filename
You might need to specify save ..., replace option if the specified file
already exists. USE WITH CAUTION! You certainly must have your original data
in a safe place, and be sure not to ever, ever overwrite it.
See also UCLA Academic Technologies
Services Stata web site.
Converting the data from other formats
Unfortunately, Stata cannot read formats other than its own .dta or text files.
However, you can nicely convert the data between different formats by the third party utilities
such as StatTransfer or
DBMS/COPY.
It supposedly works with SAS XPT files (see
fdause), but I've never tried it.
Sometimes, the raw data comes in a plain text format, either
a fixed format (where the data from a particular column should
form a variable), or comma/tab separated format (where the data for a
single observation are listed in a line and are separated by a comma or
a tab character). Such data can be read into Stata by
. infile variables using filename, [clear]
This is also the way to export data from Excel: the data should be saved
as a plain text (commaseparated format, csv), and then they can be read
into Stata by infile or infix.
See more extensive description at
UCLA Academic Technologies Services web site.
From Stata, you can also save the data into a text file (to be exported to other
applications):
. outfile [variables using] filename, [replace]
Stata comes with a bunch of toy data sets that are often used for educational purposes. You
can retrieve the list of those data set by
. sysuse dir
and load the most popular data set on a few cars and their characteristics by
. sysuse auto
Let us
. sysuse auto
to begin our session with the data. What kind of data is there? Type
. describe
to get an overall picture. In fact, it suffices to type
. d
for Stata to recognize that this is an abbreviation of describe. We shall denote the minimal
abbreviation by underlining it:
. describe
Aside: Stata colors
Stata uses several different colors in its output with pretty much invariable meaning.
White text shows the input Stata receives from the user (or from dofiles).
Green text shows the invariable text Stata outputs. Yellow text shows the variable text
Stata outputs (e.g., variable names, numbers, file names, etc.). Red text is used for errors.
Blue text shows clickable elements: by clicking on it, you will invoke a certain Stata command,
mostly displaying a help file, or launching a browser if the highlighted element is an URL.
Aside: Stata guts: storage types
One important piece of information we can take from the above description of the data set
are the storage types of the variables in the data set. Stata distinguishes discrete,
floating point (roughly speaking, continuous) and string data types. Within each of those
types, there are also distinctions by the range of the values and accuracy. Most importantly,
for many applications the accuracy of the float type is insufficient, and it is a good
idea to use double precision in your data files and programs.
Also, the unique identifiers in the microdatasets
tend to be huge numbers, and it might be a good idea to specify those as
long. (Technically speaking, those are the data types available in C++ programming language
in which the core of Stata is written.)
See online help for more details.
Self Check:
Which of the following will produce an accurate answer, and which one will not?
Taking a square root of 2, and then raising it to the second power
Raising 2 to power 2, and then taking a square root
Check in Stata. (You would need to find appropriate functions by search.)

As one of the memory considerations, you can
. compress
your data by bringing all the variables to the minimum needed type (say from float
to byte if it is just a dummy variable).
Aside: missing values
Stata has a concept of a missing value that is used when a particular variable is not
recorded in a given observation. In fact, it has 27 different missing values, and those are
denoted by ., .a, ..., .z. They are essentially
equal plus infinity, and this may be a catch for some if expressions (more on those later).
All operations involving missing value (adding a number to it, or multiplying it by a number) return
a missing value. The exception are the comparison operations.
Note that rep78 variable has a few missing values. We'll see how this may be of importance
in data handling and analysis.
Can we have a sneak preview on what is in there? Folks who are more used to a command line would type
. list
to get a listing of existing variables and observations. Note the more
thing in the bottom: you need to hit Enter for the next line, and hit Space or click this
more condition for the next screen. Press Q or Ctrl+C or Ctrl+Break (Windows)
or hit the Break button .
Looks like too much information, huh?
Try this:
. li make price wei
Or may be this:
. li make price wei in 1/10
We used two components of the general Stata command syntax: we have specified
certain variables, and we have indicated the desired range of observations. Note that we can abbreviate
variable names, too: I typed wei, and Stata unabbreviated that into weight. Even though
handy, this may cause some complications:
. li m
You must either specify an unambiguous abbreviation, or a list of variables with wildcards:
. li m*
Self Check:
How many variables are there in each of the following varlists?
tr*
make  head
t* l*
Check in Stata. (You would need to find appropriate functions by search.)

Condition qualifiers
Note how we selected the fist ten observations three commands ago. The range is specified
either as a number of the observation, or as the begin/end interval.
Another way to select specific observations is by if qualifier. It takes a form
if logical expression.
A logical expression is an expression that can be either true or false, and usually it constitutes
a comparison of two numeric expressions. If it sounds too complicated, consider this:
. li make mpg pri if mpg>25
Here is the notation for the available comparison operators. Note the notation for the
equality comparison. We shall see a bit later that the single equal sign, =, is used
for assigning the values to new variables, specifying weights, in some cycle operators, etc.,
while the double equal sign introduced now is used exclusively in the if expressions.
Symbol  Meaning

>  Strictly greater

<  Strictly less

>=  Greater or equal

<=  Less or equal

==  Equal

!= or ~=  Not equal

Logical expressions can be combined:
. li make mpg pri if mpg>25 & weight > 2500
The logical operations that can be performed are logical AND (the result is true
only if both arguments are true; denoted in Stata by &), logical OR (the result
is true if any of the arguments is true; denoted in Stata by ), and logical NOT
(works on a single argument changing true to false, and vice versa; denoted in Stata by ~
or by !).
Small print (but still useful to know): in fact, Stata uses numeric values for true and
false, and that again comes from C++ programming language (where, in turn, it reflects the basic
electric impulses the computer operates with). Logical true is denoted by 1, and logical false, by 0.
So with those numeric values, we can set up a table that shows the results of logical operations.
Even smaller print: in fact, Stata can interpret any numeric expression
as a logical (true/false) one. If an expression evaluates to zero, then Stata takes
it as a "false"; if it is a nonzero, Stata takes it as a "true". Here's an important catch:
if expression evaluates to missing, it is still not zero, and Stata takes it as a "true".
Also, as long as the missing values are greater than any number, "greater" and "greater or
equal" comparisons select missing observations, too.
Hence, in all of your comparison commands you might to explicitly filter for the
missing values with a function mi(exp) that returns 1 when
the exp evaluates to missing, and zero otherwise. Compare the results of
. li make rep78 if rep78 >= 5
and
. li make rep78 if rep78 >= 5 & !mi(rep78)
The first one obviously selected too many undesired observations.
Self Check:
What is the value of the variable trunk in 37th observation?
Of the variable rep78 in 45th observation?
In what observations is the variable rep78 missing? What are
the makes of the corresponding automobiles?
What does the command li make if for do? Unabbreviate everything,
and explain how Stata interprets this.
Check in Stata.

If you type
. browse
or hit the Browse button on the toolbar:
Stata will show you the current data. What is more that we might want to take from this?
We see that observations are represented by rows, and variables, by columns. Those are two
important concepts Stata operates with. Sometimes you might
hear the jargon about "rowwise" and "columnwise" operations. We shall also see those concepts
in action when we shall be dealing with the panel data later in the course and talk about
"wide" and "long" formats.
Stata is very fast when
it needs to perform an operation on the whole column (such as creating a new variable,
which we'll do in a second, or running a regression that uses several variables  which, as
you would recognize, involve some matrix operations), and it is painfully slow when you
instruct Stata to do operations on single observations by in 1, in 2, etc.
manner. It is never a good idea to do the latter unless it is absolutely inevitable. Most of the
time, there should be a trick or two to do what you need!
For many operations, like computing the means or running a regression, the sort order of
the data is pretty much irrelevant. There are occasions, however, where you do want one
observation to follow another in a strict order. An obvious example is time series data,
where the sort ordering is simply by time. Other, somewhat less obvious examples, might be
panel data, where you would want to have observations pertaining to the same individual, but
differing in time, to be next to each other; or welfare distribution analysis, where measures
like poverty indices or Gini coefficient of inequality require sorting by income (or another
welfare measure).
Stata's command for sorting is sort. Have a look through this sequence:
. li pri make for in 1/5
. sort price
. li pri make for in 1/5
. sort for make
. li pri make for in 1/5
sort sorts the data sequentially in the ascending order by the first variable specified, then
within the range of the same values of the first variable, by the second variable, etc.
If you need a descending order, the command you would want to use is gsort. If you need
to order the variables in the Variable window, use aorder to order the variables
alphabetically, or order to move specific variables to the front.
Small print: what will Stata do if the values of the sorting variable are not unique?
The order of the observations for nonunique values is going to be random. Most of the time,
this is an undesirable feature. We'll learn how to deal with that when we know how to create new
variables. So far, let's just concentrate on dealing with the existing ones.
Summary statistics
OK, we now know how to have a look at the data set. Can we get a more condensed summary
of information contained in it? Here you go:
summarize is a fairly powerful command.
It can take variable lists and if/in qualifiers with it. And it can do much
more with detail option; see
UCLA Academic Technologies Services Stata website.
Self Check:
What is the median price of a car in this data set?
What is the mean price of a foreign car?
Which of the variables is least skewed?

So, means, medians and such are measures for a single variable. What might we be interested for
two variables? Of course, a correlation:
. correlate wei len
Is it reasonable to think that longer vehicles are heavier? It looks so, at least judging
by the sign and the magnitude of the correlation coefficient.
If we look at the help file for correlate, we shall find a number of suggestions
at the bottom (you will always find something under See also section of the help file,
but now I'd like to draw your attention to it) to look at the pwcorr command. Unlike
correlate that first restricts the data set for the nonmissing observations, and then
computes those correlations (available case scenario), pwcorr computes correlations
for maximum number of available observations for each pair (pairwise scenario). Compare
corr rep78 wei len
and
pwcorr rep78 wei len
You can ask pwcorr to tell you how many observations it is used, and test the null
hypothesis of zero correlation, by specifying obs and sig options, respectively.
See some more details at
UCLA Academic Technologies Services Stata website.
Self Check:
Which pair of variables has the highest correlation?
(Think carefully if you would want to use corr or pwcorr for that.)
Is there a significant correlation between rep78 and price?
Between weight and rep78 among the foreign cars only?

To really learn something about your data, however, you would want to plot it, as
a picture says more than a thousand words. We shall start with plots for a single variable,
and two plots can be suggested: a histogram and boxwhisker plots.
You can find everything you need about Stata graphics from the help file graph.
Type
. whelp graph
... and see how easily one can get lost in it. Histograms are not even mentioned there,
but if you search histogram or just make an educated guess, you'd find histogram
command to perform what you would need:
Take a couple of minutes to see the options of the histogram in the help file to
figure out how my command worked. (Try
. hist price
first to see how the default options perform.) From this graph, we see that the
price variable is rather heavily skewed to the right, with a mode near $5000.
Self Check:
Make a histogram with 10 bins of the variable mpg.
How should a sensible histogram for rep78 or foreign look like?
Make one in Stata.

Another type of a single variable graphs that may be even more useful in summarizing
the data are boxandwhisker plots. They may go under different names in the literature, but
the idea is to bound the inner half of the distribution (i.e., the range between
the 25th and the 75th percentiles) by the box, and then give some
idea how far the remainder of the distribution is likely to protract (whiskers; the most
common use is to have them three times as long as the distance between the median and the
lower/upper quartile).
Here, an important option that some other Stata commands (not necessarily graphic ones)
also admit is by(). The byvariables specify the distinct group over which a command
is to be performed. In this example, we can compare the distributions of prices for domestic
and imported car. We may note that say the median price of a foreign car in this data set
is higher than that of a domestic car.
Self Check:
Make a sidebyside boxplots of car prices for different values of the repair record
variable. Are there any notable differences in prices?

Other, sometimes interesting and valuable, summaries are provided by inspect and lv
commands.
For two variables, one of the best graphical summaries is a scatter plot:
There are so many options controlling the look of the graph that we can spend a week learning
just that. Getting through the dense help files on Stata graphics is not easy, but the results
may be quite rewarding. Suppose you wanted to figure out if there any structure according
to the country of origin of a car. Prior experience with graph box will hint you to try
. scatter pri mpg , by(for)
and that is a legal syntax to do (try it now!).
There is more versatility to those graphic commands, however, and a much nicer plot may be produced
as this:
Note the complex syntax here: twoway is a wrapper command for two scatter commands
(see more at UCLA ATS Stata website
about twoway graphs and
ways
to combine them);
one of the scatter commands has an option controlling the shape of the marker (m(T)
produces a triangle rather than a circle; see
help symbolstyle on that, and also a note at UCLA ATS Stata website
on making
scatter plots that look cool); and the scatter command had some options controlling the legend
(see help legend_option on that). The two
scatter commands also differed in their scope due to if qualifier (selfcheck:
what does it refer to?) Note also that the command became so long Stata had to wrap it to the next line,
which is shown with the > sign. (Really interesting and informative scatter commands
are likely to protrude to five or so lines, and there is simply no chance to get them straight the first
time; it took me 12 trials to produce the above graph!)
One of the nice twists you can make to your graphs is to put some identifying information by the side
of the observation. This is done by mlabel(variable) option; mlabsize()
control the size of the graph. xsc() makes sure there is enough space in the left side of the
graph for the most economic Volkswagen on the roster.
For your educational entertainment, you might want to delete some of the options in the above command
and see what changes.
Let us stop here and review again what we've learnt in this tutorial.
 Starting and stopping Stata
 Logging one's Stata sessions
 Using help and search facilities (see more at
UCLA ATS Stata website)
 Reading the data into Stata
 Listing the data
 Getting basic statistical summaries (see more at
UCLA ATS Stata website)
 Making basic plots (see more at
UCLA ATS Stata website)
 Specifying specific sets of observations for a Stata command to work on (see more
at UCLA ATS Stata website,
but be aware that their tutorial was written for an old version of Stata that only had one missing
value, the dot, so the commands like if rep78!=. should better be entered as
if !mi(rep78).)
 Developing an understanding of the numeric accuracy and missing values
On to the next class, to learn about the ways of modifying the data
and some Stataspecial tricks to quickly solve seemingly difficult data management problems.
Questions, comments? Email me!.
Stas Kolenikov