How to read this lecture...

Code should execute sequentially if run in a Jupyter notebook

The Need for Speed


Higher level languages such as Python are optimized for humans

This means that the programmer can leave many details to the runtime environment

  • specifying variable types
  • memory allocation/deallocation, etc.

The upside is that, compared to low-level languages, Python is typically faster to write, less error prone and easier to debug

The downside is that Python is harder to optimize — that is, turn into fast machine code — than languages like C or Fortran

Indeed, the standard implementation of Python (called CPython) cannot match the speed of compiled languages such as C or Fortran

Does that mean that we should just switch to C or Fortran for everything?

The answer is no, no and one hundred times no

High productivity languages should be chosen over high speed languages for the great majority of scientific computing tasks

This is because

  1. Of any given program, relatively few lines are ever going to be time-critical
  2. For those lines of code that are time-critical, we can achieve C-like speed with minor modifications

This lecture will walk you through some of the most popular options for implementing this last step

(A number of other useful options are mentioned below)

Where are the Bottlenecks?

Let’s start by trying to understand why high level languages like Python are slower than compiled code

Dynamic Typing

Consider this Python operation

a, b = 10, 10
a + b

Even for this simple operation, the Python interpreter has a fair bit of work to do

For example, in the statement a + b, the interpreter has to know which operation to invoke

If a and b are strings, then a + b requires string concatenation

a, b = 'foo', 'bar'
a + b

If a and b are lists, then a + b requires list concatenation

a, b = ['foo'], ['bar']
a + b
['foo', 'bar']

(We say that the operator + is overloaded — its action depends on the type of the objects on which it acts)

As a result, Python must check the type of the objects and then call the correct operation

This involves substantial overheads

Static Types

Compiled languages avoid these overheads with explicit, static types

For example, consider the following C code, which sums the integers from 1 to 10

#include <stdio.h>

int main(void) {
    int i;
    int sum = 0;
    for (i = 1; i <= 10; i++) {
        sum = sum + i;
    printf("sum = %d\n", sum);
    return 0;

The variables i and sum are explicitly declared to be integers

Hence, the meaning of addition here is completely unambiguous

Data Access

Another drag on speed for high level languages is data access

To illustrate, let’s consider the problem of summing some data — say, a collection of integers

Summing with Compiled Code

In C or Fortran, these integers would typically be stored in an array, which is a simple data structure for storing homogeneous data

Such an array is stored in a single contiguous block of memory

  • In modern computers, memory addresses are allocated to each byte (one byte = 8 bits)
  • For example, a 64 bit integer is stored in 8 bytes of memory
  • An array of \(n\) such integers occupies \(8n\) consecutive memory slots

Moreover, the compiler is made aware of the data type by the programmer

  • In this case 64 bit integers

Hence, each successive data point can be accessed by shifting forward in memory space by a known and fixed amount

  • In this case 8 bytes

Summing in Pure Python

Python tries to replicate these ideas to some degree

For example, in the standard Python implementation (CPython), list elements are placed in memory locations that are in a sense contiguous

However, these list elements are more like pointers to data rather than actual data

Hence, there is still overhead involved in accessing the data values themselves

This is a considerable drag on speed

In fact, it’s generally true that memory traffic is a major culprit when it comes to slow execution

Let’s look at some ways around these problems


Vectorization is about sending batches of related operations to native machine code

  • The machine code itself is typically compiled from carefully optimized C or Fortran

This can greatly accelerate many (but not all) numerical computations

Operations on Arrays

First let’s run some imports

import random
import numpy as np
import quantecon as qe

Now let’s try this non-vectorized code

qe.util.tic()   # Start timing
n = 100_000
sum = 0
for i in range(n):
    x = random.uniform(0, 1)
    sum += x**2
qe.util.toc()   # End timing
TOC: Elapsed: 0.055155277252197266 seconds.

Now compare this vectorized code

n = 100_000
x = np.random.uniform(0, 1, n)
TOC: Elapsed: 0.0016531944274902344 seconds.

The second code block — which achieves the same thing as the first — runs much faster

The reason is that in the second implementation we have broken the loop down into three basic operations

  1. draw n uniforms
  2. square them
  3. sum them

These are sent as batch operators to optimized machine code

Apart from minor overheads associated with sending data back and forth, the result is C or Fortran-like speed

When we run batch operations on arrays like this, we say that the code is vectorized

Vectorized code is typically fast and efficient

It is also surprisingly flexible, in the sense that many operations can be vectorized

The next section illustrates this point

Universal Functions

Many functions provided by NumPy are so-called universal functions — also called ufuncs

This means that they

  • map scalars into scalars, as expected
  • map arrays into arrays, acting element-wise

For example, np.cos is a ufunc:

np.cos(np.linspace(0, 1, 3))
array([ 1.,  0.87758256,  0.54030231])

By exploiting ufuncs, many operations can be vectorized

For example, consider the problem of maximizing a function \(f\) of two variables \((x,y)\) over the square \([-a, a] \times [-a, a]\)

For \(f\) and \(a\) let’s choose

\[f(x,y) = \frac{\cos(x^2 + y^2)}{1 + x^2 + y^2} \quad \text{and} \quad a = 3\]

Here’s a plot of \(f\)


To maximize it, we’re going to use a naive grid search:

  1. Evaluate \(f\) for all \((x,y)\) in a grid on the square
  2. Return the maximum of observed values

Here’s a non-vectorized version that uses Python loops

def f(x, y):
    return np.cos(x**2 + y**2) / (1 + x**2 + y**2)

grid = np.linspace(-3, 3, 1000)
m = -np.inf

for x in grid:
    for y in grid:
        z = f(x, y)
        if z > m:
            m = z

TOC: Elapsed: 2.508265256881714 seconds.

And here’s a vectorized version

def f(x, y):
    return np.cos(x**2 + y**2) / (1 + x**2 + y**2)

grid = np.linspace(-3, 3, 1000)
x, y = np.meshgrid(grid, grid)

np.max(f(x, y))
TOC: Elapsed: 0.048872947692871094 seconds.

In the vectorized version, all the looping takes place in compiled code

As you can see, the second version is much faster

(We’ll make it even faster again below, when we discuss Numba)

Pros and Cons of Vectorization

At its best, vectorization yields fast, simple code

However, it’s not without disadvantages

One issue is that it can be highly memory intensive

For example, the vectorized maximization routine above is far more memory intensive than the non-vectorized version that preceded it

Another issue is that not all algorithms can be vectorized

In these kinds of settings, we need to go back to loops

Fortunately, there are nice ways to speed up Python loops


One exciting development in this direction is Numba

Numba aims to automatically compile functions to native machine code instructions on the fly

The process isn’t flawless, since Numba needs to infer type information on all variables to generate pure machine instructions

Such inference isn’t possible in every setting

But for simple routines Numba infers types very well

Moreover, the “hot loops” at the heart of our code that we need to speed up are often such simple routines


If you followed our set up instructions, then Numba should be installed

Make sure you have the latest version of Anaconda by running conda update anaconda from a terminal (Mac, Linux) / cmd prompt (Windows)

An Example

Let’s consider some problems that are difficult to vectorize

One is generating the trajectory of a difference equation given an initial condition

Let’s take the difference equation to be the quadratic map

\[x_{t+1} = 4 x_t (1 - x_t)\]

Here’s the plot of a typical trajectory, starting from \(x_0 = 0.1\), with \(t\) on the x-axis


First let’s import Numba’s jit function

from numba import jit

Now, here’s a function to generate a trajectory of a given length from a given initial condition

def qm(x0, n):
    x = np.empty(n+1)
    x[0] = x0
    for t in range(n):
        x[t+1] = 4 * x[t] * (1 - x[t])
    return x

To speed this up using Numba is trivial

qm_numba = jit(qm)  # qm_numba is now a 'compiled' version of qm

Let’s time and compare identical function calls across these two versions:

qm(0.1, int(10**5))
time1 = qe.util.toc()
TOC: Elapsed: 0.07170653343200684 seconds.
qm_numba(0.1, int(10**5))
time2 = qe.util.toc()
TOC: Elapsed: 0.06515693664550781 seconds.

The first execution is relatively slow because of JIT compilation (see below)

Next time and all subsequent times it runs much faster:

qm_numba(0.1, int(10**5))
time2 = qe.util.toc()
TOC: Elapsed: 0.0003921985626220703 seconds.
time1 / time2  # Calculate speed gain

That’s a speed increase of two orders of magnitude!

Your mileage will of course vary depending on hardware and so on

Nonetheless, two orders of magnitude is huge relative to how simple and clear the implementation is

Decorator Notation

If you don’t need a separate name for the “numbafied” version of qm, you can just put @jit before the function

def qm(x0, n):
    x = np.empty(n+1)
    x[0] = x0
    for t in range(n):
        x[t+1] = 4 * x[t] * (1 - x[t])
    return x

This is equivalent to qm = jit(qm)

How and When it Works

Numba attempts to generate fast machine code using the infrastructure provided by the LLVM Project

It does this by inferring type information on the fly

As you can imagine, this is easier for simple Python objects (simple scalar data types, such as floats, integers, etc.)

Numba also plays well with NumPy arrays, which it treats as typed memory regions

In an ideal setting, Numba can infer all necessary type information

This allows it to generate native machine code, without having to call the Python runtime environment

In such a setting, Numba will be on par with machine code from low level languages

When Numba cannot infer all type information, some Python objects are given generic object status, and some code is generated using the Python runtime

In this second setting, Numba typically provides only minor speed gains — or none at all

Hence, it’s prudent when using Numba to focus on speeding up small, time-critical snippets of code

This will give you much better performance than blanketing your Python programs with @jit statements

A Gotcha: Global Variables

Consider the following example

a = 1

def add_x(x):
    return a + x

a = 2


Notice that changing the global had no effect on the value returned by the function

When Numba compiles machine code for functions, it treats global variables as constants to ensure type stability

Numba for vectorization

Numba can also be used to create custom ufuncs with the @vectorize decorator

To illustrate the advantage of using Numba to vectorize a function, we return to a maximization problem discussed above

from numba import vectorize

def f_vec(x, y):
    return np.cos(x**2 + y**2) / (1 + x**2 + y**2)
np.max(f_vec(x, y))  # Run once to compile

np.max(f_vec(x, y))
TOC: Elapsed: 0.011091232299804688 seconds.

This is faster than our vectorized version using NumPy’s ufuncs

Why should that be? After all, anything vectorized with NumPy will be running in fast C or Fortran code

The reason is that it’s much less memory intensive

For example, when NumPy computes np.cos(x**2 + y**2) it first creates the intermediate arrays x**2 and y**2, then it creates the array np.cos(x**2 + y**2)

In our @vectorize version using Numba, the entire operator is reduced to a single vectorized process and none of these intermediate arrays are created

We can gain further speed improvements using Numba’s automatic parallelization feature by specifying target=’parallel’

In this case, we need to specify the types of our inputs and outputs

@vectorize('float64(float64, float64)', target='parallel')
def f_vec(x, y):
    return np.cos(x**2 + y**2) / (1 + x**2 + y**2)
np.max(f_vec(x, y))  # Run once to compile

np.max(f_vec(x, y))
TOC: Elapsed: 0.008301734924316406 seconds.

This is a striking speed up with very little effort


Like Numba, Cython provides an approach to generating fast compiled code that can be used from Python

As was the case with Numba, a key problem is the fact that Python is dynamically typed

As you’ll recall, Numba solves this problem (where possible) by inferring type

Cython’s approach is different — programmers add type definitions directly to their “Python” code

As such, the Cython language can be thought of as Python with type definitions

In addition to a language specification, Cython is also a language translator, transforming Cython code into optimized C and C++ code

Cython also takes care of building language extentions — the wrapper code that interfaces between the resulting compiled code and Python

Important Note:

In what follows code is executed in a Jupyter notebook

This is to take advantage of a Cython cell magic that makes Cython particularly easy to use

Some modifications are required to run the code outside a notebook

A First Example

Let’s start with a rather artificial example

Suppose that we want to compute the sum \(\sum_{i=0}^n \alpha^i\) for given \(\alpha, n\)

Suppose further that we’ve forgotten the basic formula

\[\sum_{i=0}^n \alpha^i = \frac{1 - \alpha^{n+1}}{1 - \alpha}\]

for a geometric progression and hence have resolved to rely on a loop

Python vs C

Here’s a pure Python function that does the job

def geo_prog(alpha, n):
    current = 1.0
    sum = current
    for i in range(n):
        current = current * alpha
        sum = sum + current
    return sum

This works fine but for large \(n\) it is slow

Here’s a C function that will do the same thing

double geo_prog(double alpha, int n) {
    double current = 1.0;
    double sum = current;
    int i;
    for (i = 1; i <= n; i++) {
        current = current * alpha;
        sum = sum + current;
    return sum;

If you’re not familiar with C, the main thing you should take notice of is the type definitions

  • int means integer
  • double means double precision floating point number
  • the double in double geo_prog(... indicates that the function will return a double

Not surprisingly, the C code is faster than the Python code

A Cython Implementation

Cython implementations look like a convex combination of Python and C

We’re going to run our Cython code in the Jupyter notebook, so we’ll start by loading the Cython extension in a notebook cell

%load_ext Cython

In the next cell, we execute the following

def geo_prog_cython(double alpha, int n):
    cdef double current = 1.0
    cdef double sum = current
    cdef int i
    for i in range(n):
        current = current * alpha
        sum = sum + current
    return sum

Here cdef is a Cython keyword indicating a variable declaration, and is followed by a type

The %%cython line at the top is not actually Cython code — it’s a Jupyter cell magic indicating the start of Cython code

After executing the cell, you can now call the function geo_prog_cython from within Python

What you are in fact calling is compiled C code with a Python call interface

geo_prog(0.99, int(10**6))
TOC: Elapsed: 0.11026620864868164 seconds.
geo_prog_cython(0.99, int(10**6))
TOC: Elapsed: 0.038515567779541016 seconds.

Example 2: Cython with NumPy Arrays

Let’s go back to the first problem that we worked with: generating the iterates of the quadratic map

\[x_{t+1} = 4 x_t (1 - x_t)\]

The problem of computing iterates and returning a time series requires us to work with arrays

The natural array type to work with is NumPy arrays

Here’s a Cython implemention that initializes, populates and returns a NumPy array

import numpy as np

def qm_cython_first_pass(double x0, int n):
    cdef int t
    x = np.zeros(n+1, float)
    x[0] = x0
    for t in range(n):
        x[t+1] = 4.0 * x[t] * (1 - x[t])
    return np.asarray(x)

If you run this code and time it, you will see that it’s performance is disappointing — nothing like the speed gain we got from Numba

qm_cython_first_pass(0.1, int(10**5))
TOC: Elapsed: 0.03628730773925781 seconds.
qm_numba(0.1, int(10**5))
TOC: Elapsed: 0.0003917217254638672 seconds.

The reason is that working with NumPy arrays incurs substantial Python overheads

We can do better by using Cython’s typed memoryviews, which provide more direct access to arrays in memory

When using them, the first step is to create a NumPy array

Next, we declare a memoryview and bind it to the NumPy array

Here’s an example:

import numpy as np
from numpy cimport float_t

def qm_cython(double x0, int n):
    cdef int t
    x_np_array = np.zeros(n+1, dtype=float)
    cdef float_t [:] x = x_np_array
    x[0] = x0
    for t in range(n):
        x[t+1] = 4.0 * x[t] * (1 - x[t])
    return np.asarray(x)


  • cimport pulls in some compile-time information from NumPy
  • cdef float_t [:] x = x_np_array creates a memoryview on the NumPy array x_np_array
  • the return statement uses np.asarray(x) to convert the memoryview back to a NumPy array

Let’s time it:

qm_cython(0.1, int(10**5))
TOC: Elapsed: 0.0007178783416748047 seconds.

This is fast, although still slightly slower than qm_numba


Cython requires more expertise than Numba, and is a little more fiddly in terms of getting good performance

In fact, it’s surprising how difficult it is to beat the speed improvements provided by Numba


  • Cython is a very mature, stable and widely used tool
  • Cython can be more useful than Numba when working with larger, more sophisticated applications


Perhaps, like us, you sometimes run a long computation that simulates a model at a given set of parameters — to generate a figure, say, or a table

20 minutes later you realize that you want to tweak the figure and now you have to do it all again

What caching will do is automatically store results at each parameterization

Ideally, results are compressed and stored on file, and automatically served back up to you when you repeat the calculation

This is a more traditional and generic way to speed up code that can nonetheless be very useful for economic modeling


Our caching will use the joblib library, which you need to install to run the code below

This can be done at a terminal / cmd prompt by typing

!pip install joblib

An Example

Let’s look at a toy example, related to the quadratic map model discussed above

Let’s say we want to generate a long trajectory from a certain initial condition \(x_0\) and see what fraction of the sample is below 0.1

(We’ll omit JIT compilation or other speed ups for simplicity)

Here’s our code

from joblib import Memory

memory = Memory(cachedir='./joblib_cache')

def qm(x0, n):
    x = np.empty(n+1)
    x[0] = x0
    for t in range(n):
        x[t+1] = 4 * x[t] * (1 - x[t])
    return np.mean(x < 0.1)

We are using joblib to cache the result of calling qm at a given set of parameters

With the argument cachedir=’./joblib_cache’, any call to this function results in both the input values and output values being stored a subdirectory joblib_cache of the present working directory

(In UNIX shells, . refers to the present working directory)

The first time we call the function with a given set of parameters we see some extra output that notes information being cached

n = int(1e7)
qm(0.2, n)
[Memory] Calling __main__ [truncated]
______________________________________qm - 6.9s, 0.1min

TOC: Elapsed: 6.922961473464966 seconds.

The next time we call the function with the same set of parameters, the result is returned almost instantaneously

n = int(1e7)
qm(0.2, n)
TOC: Elapsed: 0.0009872913360595703 seconds.

Other Options

There are in fact many other approaches to speeding up your Python code

We mention only a few of the most popular methods

Interfacing with Fortran

If you are comfortable writing Fortran you will find it very easy to create extention modules from Fortran code using F2Py

F2Py is a Fortran-to-Python interface generator that is particularly simple to use

Robert Johansson provides a very nice introduction to F2Py, among other things

Recently, a Jupyter cell magic for Fortran has been developed — you might want to give it a try

Parallel and Cloud Computing

This is a big topic that we won’t address in detail yet

However, you might find the following links a useful starting point


Exercise 1

Later we’ll learn all about finite state Markov chains

For now, let’s just concentrate on simulating a very simple example of such a chain

Suppose that the volatility of returns on an asset can be in one of two regimes — high or low

The transition probabilities across states are as follows


For example, let the period length be one month, and suppose the current state is high

We see from the graph that the state next month will be

  • high with probability 0.8
  • low with probability 0.2

Your task is to simulate a sequence of monthly volatility states according to this rule

Set the length of the sequence to n = 100000 and start in the high state

Implement a pure Python version, a Numba version and a Cython version, and compare speeds

To test your code, evaluate the fraction of time that the chain spends in the low state

If your code is correct, it should be about 2/3


Exercise 1

We let

  • 0 represent “low”
  • 1 represent “high”
p, q = 0.1, 0.2  # Prob of leaving low and high state respectively

Here’s a pure Python version of the function

def compute_series(n):
    x = np.empty(n, dtype=int)
    x[0] = 1  # Start in state 1
    U = np.random.uniform(0, 1, size=n)
    for t in range(1, n):
        current_x = x[t-1]
        if current_x == 0:
            x[t] = U[t] < p
            x[t] = U[t] > q
    return x

Let’s run this code and check that the fraction of time spent in the low state is about 0.666

n = 100000
x = compute_series(n)
print(np.mean(x == 0))  # Fraction of time x is in state 0

Now let’s time it

TOC: Elapsed: 0.07770729064941406 seconds.

Next let’s implement a Numba version, which is easy

compute_series_numba = jit(compute_series)

Let’s check we still get the right numbers

x = compute_series_numba(n)
print(np.mean(x == 0))

Let’s see the time

TOC: Elapsed: 0.0017528533935546875 seconds.

This is a nice speed improvement for one line of code

Now let’s implement a Cython version

%load_ext Cython
import numpy as np
from numpy cimport int_t, float_t

def compute_series_cy(int n):
    # == Create NumPy arrays first == #
    x_np = np.empty(n, dtype=int)
    U_np = np.random.uniform(0, 1, size=n)
    # == Now create memoryviews of the arrays == #
    cdef int_t [:] x = x_np
    cdef float_t [:] U = U_np
    # == Other variable declarations == #
    cdef float p = 0.1
    cdef float q = 0.2
    cdef int t
    # == Main loop == #
    x[0] = 1
    for t in range(1, n):
        current_x = x[t-1]
        if current_x == 0:
            x[t] = U[t] < p
            x[t] = U[t] > q
    return np.asarray(x)
array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
x = compute_series_cy(n)
print(np.mean(x == 0))
TOC: Elapsed: 0.0025839805603027344 seconds.

The Cython implementation is fast, but not as fast as Numba

Appendix — Other Options

There are other important projects aimed at speeding up Python

These include but are not limited to

  • Pythran : A Python to C++ compiler
  • PyPy : Runtime environment using just-in-time compiler
  • Nuitka : Another Python compiler
  • Pyjion : JIT compilation, under development