This guide provides instructions to generate basic figures/graphs using Stata that are useful for exploratory data analysis.
You can type codes in the Stata command window or use a do-file.
If you use a do-file, set your working directory by typing the following:
After setting the working directory, open a do-file by clicking the "New Do-file Editor" icon in the Stata window.
Save the do file for later use.
First, open a Stata data file by typing the following codes:
Inspect the data to get a better idea about the data. Type:
browse
describe
summarize
Generate a line graph for the variables unemp unempf unempm for the United States. Type:
line unemp unempf unempm year if country=="United States"
Stata will give us the following line graph:
The graph does not look good. Let's check the variables more carefully. Type:
summarize unemp unempf unempm
We see each of the variables contains 0% unemployment rates. Let us remove zeros for each variable to get a nicer-looking graph. To remove zeros, type:
replace unemp=. if unemp==0
replace unempf=. if unempf==0
replace unempm=. if unempm==0
Check the summary stat again. Type:
summarize unemp unempf unempm
Let us generate the line graph again using the cleaner data. Type:
line unemp unempf unempm year if country=="United States"
Stata now gives us a better-looking graph.
Let us add a legend, line pattern, y-axis title, and graph title to make the graph more beautiful. Type the following codes:
line unemp unempf unempm year if country=="United States", ///
title("Unemployment rate in the US, 1980-2012") ///
legend(label(1 "Total") label(2 "Females") label(3 "Males")) ///
lpattern(solid dash dash_dot) ///
ytitle("Percentage")
Stata will give us the following graph.
We can present the above graph by connecting the lines and adding symbols (circle, diamond, square, etc.) to the lines. Type:
twoway connected unemp unempf unempm year if country=="United States", ///
title("Unemployment rate in the US, 1980-2012") ///
legend(label(1 "Total") label(2 "Females") label(3 "Males")) ///
msymbol(circle diamond square) ///
ytitle("Percentage")
Stata will give us the following graph.
Line Graphs by Country Names
We can use Stata's two-way connected command to create separate line graphs for a selected set of countries. In this case, we have to provide the country name and use by country command. Use the following codes:
twoway connected unemp year if country=="United States" | ///
country=="United Kingdom" | ///
country=="Australia" | ///
country=="Qatar", ///
by(country, title("Unemployment Rate")) ///
msymbol(circle_hollow)
Stata will give us the following graph.
We can present lines for each country above in a single graph. Type the following codes:
twoway (connected unemp year if country=="United States", msymbol(diamond_hollow)) ///
(connected unemp year if country=="United Kingdom", msymbol(triangle_hollow)) ///
(connected unemp year if country=="Australia", msymbol(square_hollow)) ///
(connected unemp year if country=="Qatar", ///
title("Unemployment Rate") ///
msymbol(circle_hollow) ///
legend(label(1 "USA") label(2 "UK") label(3 "Australia") label(4 "Qatar")))
Stata will give us the following graph.
Let us now generate a similar graph for the variable gdppc .
twoway connected gdppc year if gdppc>40000, by(country) msymbol(diamond)
Stata will give us the following graph.
We can add more than one line in each of the graphs of a panel graph. Let us create two new variables, gdppc_mean and gdppc_median. Type:
bysort year: egen gdppc_mean=mean(gdppc)
bysort year: egen gdppc_median=median(gdppc)
Let us now generate line graphs for the variables gdppc_mean and gdppc_median for selected countries. Type:
twoway connected gdppc gdppc_mean year if country=="United States" | ///
country=="United Kingdom" | ///
country=="Australia" | ///
country=="Qatar", ///
by(country, title("GDP pc (PPP, 2005=100)")) ///
legend(label(1 "GDP-PC") label(2 "Mean GDP-PC")) ///
msymbol(circle_hollow)
Stata will give us the following graph.
Line Graphs by Country Names in Panel Data Setting
To declare the dataset as a panel data, type:
xtset country year
Running the codes gives us an error message as the country variable is strings. To assign numeric values to the string variable country , type:
encode country, gen(country1)
To declare the dataset as a panel again, type:
xtset country1 year
Now, let us create a line graph for the countries with per capita GDP greater than $35,000. Type:
xtline gdppc if gdppc>35000, overlay ///
title(Per Capita GDP for the Richest Countries)
Stata will give us the following graph.
NOTE : To get an idea about different kinds of graph markers type:
palette symbolpalette
palette linepalette
palette color green
help palette
This section describes how to generate bar graphs.
First, get the data. Type:
Let us create a horizontal bar graph for the variable gdppc for each country in the dataset. Type:
graph hbar (mean) gdppc, over(country, sort(1) descending label(labsize(*0.50)))
Stata will give us the following graph.
Country names in the graph are unclear. To make the graph clearer, we may keep the countries with a mean per capita GDP greater than $18,000. To do this, type:
graph hbar (mean) gdppc if gdppc>18000, ///
over(country, sort(1) descending label(labsize(*0.7))) ///
bar(1, color(ebblue))
Stata will give us the following graph.
For the countries with per capita GDP less than $1500, type:
graph hbar (mean) gdppc if gdppc over(country, sort(1) descending label(labsize(*0.6))) ///
bar(1, color(ebblue))
Stata will give us the following graph.
We can compare mean per capita GDP with the median per capita GDP for the countries with gdppc>18000 . Type:
graph hbar (mean) gdppc (median) gdppc if gdppc>18000, ///
over(country, sort(1) descending label(labsize(*0.8))) ///
legend(label(1 "GDPpc (mean)") label(2 "GDPpc (median)")) ///
bar(1, color(blue)) ///
bar(2, color(brown))
Stata will give us the following graph.
For more information about bar graphs, type:
Boxplot is a valuable tool to detect outliers in a dataset. This sub-section provides instructions on creating basic boxplots using Stata.
First, open a Stata data file. Type:
Let us create a basic boxplot for the variable gdppc . Type:
Stata will give us the following graph.
In the above graph, we see lots of outliers when gdppc is greater than 40,000. We can set the maximum value for gdppc to get a better idea about the min, max, median, and quartile values . To do this, type:
graph hbox gdppc if gdppc
Stata will give us the following graph.
We will now create a boxplot for the variable gdppc with respect to a categorical variable. Let us recode the polity2 variable and make a categorical variable regime based on it. Currently, polity2 ranges between -10 and 10. We will create the regime variable with three categories by defining Autocracy with a score of -10 and -6, Anocracy with a score of -5 and 6, and Democracy with a score of 7 to 10. Use the following codes:
tab polity2
recode polity2 (-10/-6=1 "Autocracy") ///
(-5/6=2 "Anocracy") ///
(7/10=3 "Democracy") ///
(else=.), ///
gen(regime) label(polity_rec)
To inspect the newly created regime variable, type:
tab regime
tab regime, nolabel
tab country regime
tab country regime, row
To generate a boxplot for gdppc with respect to the categorical variable regime , type:
graph box gdppc, over(regime) yline(9482.966) ///
title("Regime Type and Per capita GDP")
Stata will give us the following graph.
Note: we used mean gdppc to plot the dotted y line.
We can create the above graph with the horizontal boxplot. Type:
graph box gdppc, over(regime) horizontal yline(9482.966) ///
title("Regime Type and Per capita GDP")
Stata will give us the following graph.
We can create a boxplot for two numerical variables ( gdppc and trade ) with respect to a categorical variable ( regime ). Change the scales of the variables gdppc and trade by taking logs, which would provide a nicer boxplot. Type:
gen log_gdppc = log(gdppc)
gen log_trade = log(trade)
To make the boxplot, type:
graph box log_gdppc log_trade, over(regime) ///
title("Regime Type, Per capita GDP, and International Trade")
Stata will give us the following graph.
For more information about boxplots, type: