Download PDF

We are working to support a site-wide PDF but it is not yet available. You can download PDFs for individual lectures through the download badge on each lecture page.

Update: New build system
QuantEcon is migrating to a new build system - please report any errors to contact@quantecon.org

How to read this lecture...

Code should execute sequentially if run in a Jupyter notebook

  • See the set up page to install Jupyter, Julia (1.0+) and all necessary libraries
  • Please direct feedback to contact@quantecon.org or the discourse forum
  • For some notebooks, enable content with "Trust" on the command tab of Jupyter lab
  • If using QuantEcon lectures for the first time on a computer, execute ] add InstantiateFromURL inside of a notebook or the REPL

Data and Statistics Packages

Overview

This lecture explores some of the key packages for working with data and doing statistics in Julia

In particular, we will examine the DataFrame object in detail (i.e., construction, manipulation, querying, visualization, and nuances like missing data)

While Julia is not an ideal language for pure cookie-cutter statistical analysis, it has many useful packages to provide those tools as part of a more general solution

Examples include GLM.jl and FixedEffectModels.jl, which we discuss

This list is not exhaustive, and others can be found in organizations such as JuliaStats, JuliaData, and QueryVerse

Setup

In [1]:
using InstantiateFromURL

# activate the QuantEcon environment
activate_github("QuantEcon/QuantEconLectureAllPackages", tag = "v0.9.5");
In [2]:
using LinearAlgebra, Statistics, Compat
using DataFrames, RDatasets, DataFramesMeta, CategoricalArrays, Query, VegaLite
using DataVoyager, GLM, RegressionTables, FixedEffectModels

DataFrames

A useful package for working with data is DataFrames.jl

The most important data type provided is a DataFrame, a two dimensional array for storing heterogeneous data

Although data can be heterogeneous within a DataFrame, the contents of the columns must be homogeneous (of the same type)

This is analogous to a data.frame in R, a DataFrame in Pandas (Python) or, more loosely, a spreadsheet in Excel

There are a few different ways to create a DataFrame

Constructing a DataFrame

The first is to set up columns and construct a dataframe by assigning names

In [3]:
using DataFrames, RDatasets  # RDatasets provides good standard data examples from R

# note use of missing
commodities = ["crude", "gas", "gold", "silver"]
last_price = [4.2, 11.3, 12.1, missing]
df = DataFrame(commod = commodities, price = last_price)
Out[3]:

4 rows × 2 columns

commodprice
StringFloat64⍰
1crude4.2
2gas11.3
3gold12.1
4silvermissing

Columns of the DataFrame can be accessed by name using a symbol df[:row] or a struct-style df.row, as below

In [4]:
df[:price]
Out[4]:
4-element Array{Union{Missing, Float64},1}:
  4.2     
 11.3     
 12.1     
   missing
In [5]:
df.price
Out[5]:
4-element Array{Union{Missing, Float64},1}:
  4.2     
 11.3     
 12.1     
   missing

Note that the type of this array has values Union{Missing, Float64} since it was created with a missing value

In [6]:
df.commod
Out[6]:
4-element Array{String,1}:
 "crude" 
 "gas"   
 "gold"  
 "silver"

The DataFrames.jl package provides a number of methods for acting on DataFrame’s, such as describe

In [7]:
describe(df)
Out[7]:

2 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…Union…DataType
1commodcrudesilver4String
2price9.24.211.312.11Float64

While often data will be generated all at once, or read from a file, you can add to a DataFrame by providing the key parameters

In [8]:
nt = (commod = "nickel", price= 5.1)
push!(df, nt)
Out[8]:

5 rows × 2 columns

commodprice
StringFloat64⍰
1crude4.2
2gas11.3
3gold12.1
4silvermissing
5nickel5.1

Named tuples can also be used to construct a DataFrame, and have it properly deduce all types

In [9]:
nt = (t = 1, col1 = 3.0)
df2 = DataFrame([nt])
push!(df2, (t=2, col1 = 4.0))
Out[9]:

2 rows × 2 columns

tcol1
Int64Float64
113.0
224.0

Working with Missing

As we discussed in fundamental types, the semantics of missing are that mathematical operations will not silently ignore it

In order to allow missing in a column, you can create/load the DataFrame from a source with missing’s, or call allowmissing! on a column

In [10]:
allowmissing!(df2, :col1) # necessary to add in a for col1
push!(df2, (t=3, col1 = missing))
push!(df2, (t=4, col1 = 5.1))
Out[10]:

4 rows × 2 columns

tcol1
Int64Float64⍰
113.0
224.0
33missing
445.1

We can see the propagation of missing to caller functions, as well as a way to efficiently calculate with non-missing data

In [11]:
@show mean(df2.col1)
@show mean(skipmissing(df2.col1))
mean(df2.col1) = missing
mean(skipmissing(df2.col1)) = 4.033333333333333
Out[11]:
4.033333333333333

And to replace the missing

In [12]:
df2.col1  .= coalesce.(df2.col1, 0.0) # replace all missing with 0.0
Out[12]:
4-element Array{Union{Missing, Float64},1}:
 3.0
 4.0
 0.0
 5.1

Manipulating and Transforming DataFrames

One way to do an additional calculation with a DataFrame is to tuse the @transform macro from DataFramesMeta.jl

In [13]:
using DataFramesMeta
f(x) = x^2
df2 = @transform(df2, col2 = f.(:col1))
Out[13]:

4 rows × 3 columns

tcol1col2
Int64Float64⍰Float64
113.09.0
224.016.0
330.00.0
445.126.01

Categorical Data

For data that is categorical

In [14]:
using CategoricalArrays
id = [1, 2, 3, 4]
y = ["old", "young", "young", "old"]
y = CategoricalArray(y)
df = DataFrame(id=id, y=y)
Out[14]:

4 rows × 2 columns

idy
Int64Categorical…
11old
22young
33young
44old
In [15]:
levels(df.y)
Out[15]:
2-element Array{String,1}:
 "old"  
 "young"

Visualization, Querying, and Plots

The DataFrame (and similar types that fulfill a standard generic interface) can fit into a variety of packages

One set of them is the QueryVerse

Note: The QueryVerse, in the same spirit as R’s tidyverse, makes heavy use of the pipeline syntax |>

In [16]:
x = 3.0
f(x) = x^2
g(x) = log(x)

@show g(f(x))
@show x |> f |> g; # pipes nest function calls
g(f(x)) = 2.1972245773362196
(x |> f) |> g = 2.1972245773362196

To give an example directly from the source of the LINQ inspired Query.jl

In [17]:
using Query

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

x = @from i in df begin
    @where i.age>50
    @select {i.name, i.children}
    @collect DataFrame
end
Out[17]:

1 rows × 2 columns

namechildren
StringInt64
1Kirk2

While it is possible to just use the Plots.jl library, there may be better options for displaying tabular data – such as VegaLite.jl

In [18]:
using RDatasets, VegaLite
iris = dataset("datasets", "iris")

iris |> @vlplot(
    :point,
    x=:PetalLength,
    y=:PetalWidth,
    color=:Species
)
Out[18]:
setosaversicolorvirginicaSpecies01234567PetalLength0.00.51.01.52.02.5PetalWidth

Another useful tool for exploring tabular data is DataVoyager.jl

using DataVoyager
iris |> Voyager()

The Voyager() function creates a separate window for analysis

Statistics and Econometrics

While Julia is not intended as a replacement for R, Stata, and similar specialty languages, it has a growing number of packages aimed at statistics and econometrics

Many of the packages live in the JuliaStats organization

A few to point out

  • StatsBase has basic statistical functions such as geometric and harmonic means, auto-correlations, robust statistics, etc.
  • StatsFuns has a variety of mathematical functions and constants such as pdf and cdf of many distributions, softmax, etc.

General Linear Models

To run linear regressions and similar statistics, use the GLM package

In [19]:
using GLM

x = randn(100)
y = 0.9 .* x + 0.5 * rand(100)
df = DataFrame(x=x, y=y)
ols = lm(@formula(y ~ x), df) # R-style notation
┌ Warning: In the future eachcol will have names argument set to false by default
│   caller = evalcontrasts(::DataFrame, ::Dict{Any,Any}) at modelframe.jl:124
└ @ StatsModels /home/quantecon/.julia/packages/StatsModels/AYB2E/src/modelframe.jl:124
Out[19]:
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)  0.241572 0.0143601 16.8225   <1e-29
x            0.903883 0.0142488 63.4359   <1e-80

To display the results in a useful tables for LaTeX and the REPL, use RegressionTables for output similar to the Stata package esttab and the R package stargazer

In [20]:
using RegressionTables
regtable(ols)
# regtable(ols,  renderSettings = latexOutput()) # for LaTex output
----------------------
                  y   
              --------
                   (1)
----------------------
(Intercept)   0.242***
               (0.014)
x             0.904***
               (0.014)
----------------------
Estimator          OLS
----------------------
N                  100
R2               0.976
----------------------


Fixed Effects

While Julia may be overkill for estimating a simple linear regression, fixed-effects estimation with dummies for multiple variables are much more computationally intensive

For a 2-way fixed-effect, taking the example directly from the documentation using cigarette consumption data

In [21]:
using FixedEffectModels
cigar = dataset("plm", "Cigar")
cigar.StateCategorical =  categorical(cigar.State)
cigar.YearCategorical =  categorical(cigar.Year)
fixedeffectresults = reg(cigar, @model(Sales ~ NDI, fe = StateCategorical + YearCategorical,
                            weights = Pop, vcov = cluster(StateCategorical)))
regtable(fixedeffectresults)
----------------------------
                     Sales  
                   ---------
                         (1)
----------------------------
NDI                -0.005***
                     (0.001)
----------------------------
StateCategorical         Yes
YearCategorical          Yes
----------------------------
Estimator                OLS
----------------------------
N                      1,380
R2                     0.804
----------------------------


To explore the data use the interactive DataVoyager and VegaLite

In [22]:
# cigar |> Voyager()

cigar |> @vlplot(
    :point,
    x=:Price,
    y=:Sales,
    color=:Year,
    size=:NDI
)
Out[22]:
708090Year5,00010,00015,00020,000NDI050100150200Price050100150200250300Sales