STATA is a statistical software popularly used in the field of social science. The interface is user friendly and can be done by point and click. However, like many others software programs, it is based on packages provided by contributors. Of course, you can macro program for your own purpose, but this requires sophisticated knowledge about matrix algebra. Compared with R, STATA is limited for you to have working environment such as “can’t work with multiple data sets all at the same time” (c.f. The SAS program will allow you to do this, but the starters may find its use quite difficult).
For a starter, you can purchase the program from Stata corp. downloadable at http://www.stata.com/. I recommend to use the IC (interclooer), not the SE (special edition) nor the MP verison. The Stata program is compatible with both Windows and Mac OS.
Part I. Setting up the Notepad ++ for a better use of do-file editor for Stata.
Normally, Stata includes a do-file editor on the menu-bar. However, I recommend to use Notepad ++ for do-file editor. You do not have to have the Stata program if you successfully install plugin function in Notepad ++. When you run the line in Notepad++, it will automatically open Stata output window for you.
Instruction on integrating Notepad++ to STATA (cited from http://opensourceeconomics.wordpress.com/2009/10/13/notepad-and-stata-a-better-do-file-editor/):
Step 1) Download and install Notepad++ (http://notepad-plus-plus.org/) –> This is a free software.
Step 2) To get context highlighting, go here: download the link to the stata XML file here http://notepad-plus.sourceforge.net/commun/userDefinedLang/userDefineLang_stata.xml . Click start, run, type (or paste in) %APPDATA%\Notepad++ then click ok. Assuming you just installed Notepad++, just copy the file you downloaded to this directory, delete the existing userDefineLang.xml and rename your file userDefineLang.xml.
Step 3) To enable running code from Notepad++, follow steps 1-4 of the following page:
These STEPS look very complicated but they are not really. Try out!
“A sample screen shot from my computer”:
Part II. How to Use Stata for Econometric Concepts….
1) Monte Carlo Simulation for a better understanding of Central Limit Theorem (Lager Samples’ advantage).
set more off
set mem 300m
* Monte Carlo Experiments for the Bivariate Ordinary Least Square (OLS)
* Yi = B1 + B2*Xi + ui, where ui~N(0,9)
** Programing begins [Step1 – Step7]
program olssim, rclass
*Step 1. Create a random sample and generate observations from nothing
set obs 10
*Step 2. Return uniformly distributed pseudo-random numbers on interval [80,260) <- This is an range example from Gujarati’s book.
*Make sure that you create integers based on weekly consumption expenditure, $.
gen sample_x = int(uniform()*(260-80)+80)
*Step 3. Check the properties such as max and min / You may also want to use “list” command to verify your whole samples.
*Step 4. Return normally distributed random numbers “u” with mean 0 and deviation 3 (i.e. variance=9)
gen u = invnormal(uniform())*3
*Step 5. Now conduct DGP (Data Generating Process) as Yi = B1 + B2*Xi + ui with true parameter values B1=25 and B2=0.5
gen y = 25 + 0.5 *sample_x + u
*Step 6. Tell me what the estimates B1_hat and B2_hat are, which predicts y using sample_x
regress y sample_x
*Step 7. Return B1_hat and B2_hat
return scalar B1_hat = _coef[_cons]
return scalar B2_hat = _coef[sample_x]
** Program ends here
** After you set up this program[Step1 ~ Step7], you need to replicate it over 100 times
** This produces a new dataset with 100 observations of B1_hat and B2_hat
** See return value –> r(B1_hat) for example. The full expression is the following:
simulate B1_hat = r(B1_hat) B2_hat = r(B2_hat), reps(100): olssim
** Now use descriptive statistics such as summarize and histogram
hist B1_hat, xtitle(“B1_hat”) xline(25, lcolor(red)) saving(part1, replace)
hist B2_hat, xtitle(“B2_hat”) xline(0.5, lcolor(red)) saving(part2, replace)
graph combine part1.gph part2.gph
/* Try reps (100) and reps (1000). You will see the mean of the B1_hat and B2_hat sampling distributions getting closer
to the true B1(=25) and B2(=0.5), respectively */
Part III. Data Transformation (Handy Commands – link to http://www.ls3.soziologie.uni-muenchen.de/downloads/lehre/lehre_alt/statacommands.pdf)
The attached, I found it very useful to think about how to save the time as well as make the dataset ready for analysis.
Part IV. How to Take Average Values for Panel Data
I often find it helpful to develop my data file transformed into average values between certain years. This is a quite popular way to reduce data missing problems as well (at least I am aware of in comparative politics). Let’s say I want to have 5 year average values for countries’ Trade (% of GDP). How do I do this? Do I go to an excel file and do it one by one? Well, that is possible, but if you have a huge dataset, it probably consumes your whole day or two beside the errors you might make anyhow. So I recommend that you use a simple program to do this in Stata.
First, let’s use a dataset from Word Development Indicators. My sample dataset looks like as follows:
This looks like a typical panel dataset format in your studies. Suppose you are interested in testing 5 years average values (Trade in my example variable). You can write the following a short script on your dofile:
Notice I use “if” command to create year2 (5 year period identification). Then I use a “collapse command” to calculate 5 year average (which can be use for many other purposes such as standard deviation, it should say “collapse (sd) trade” – perhaps helpful for econ students when they need to figure out how things affect the volatility of trade in application). From my dofile, pay close attention option after a comma in line 2o. Notice I use year2, a variable created in Step 2. Year2 works like a new group identification (coded 1, 2, 3, 4) . This means that you tell STATA to take an average by id over year2. Since year2 has an interval of 5 years. Thus, you will have 5 year average value for trade after running the dofile above. You need to modify according to the range of years though. However, the rationale remains the same. If you run no trouble so far, then check out your dataset to see how things changed after you run this dofile.
As you can tell, the dataset show 5 year averages for trade. Use this trick to produce average values for your large sized data.