Code should execute sequentially if run in a Jupyter notebook

# The Need for Speed¶

## Overview¶

Computer scientists often classify programming languages according to the following two categories

High level languages aim to maximize productivity by

• being easy to read, write and debug
• automating standard tasks (e.g., memory management)
• being interactive, etc.

Low level languages aim for speed and control, which they achieve by

• being closer to the metal (direct access to CPU, memory, etc.)
• requiring a relatively large amount of information from the user (e.g., all data types must be specified)

• high productivity or high performance
• optimized for humans or optimized for machines

One of the great strengths of Julia is that it pushes out the curve, achieving both high productivity and high performance with relatively little fuss

The word “relatively” is important here, however...

In simple programs, excellent performance is often trivial to achieve

For longer, more sophisticated programs, you need to be aware of potential stumbling blocks

This lecture covers the key points

### Requirements¶

You should read our earlier lecture on types, methods and multiple dispatch before this one

## Foundations¶

Let’s think about how quickly code runs, taking as given

• hardware configuration
• algorithm (i.e., set of instructions to be executed)

We’ll start by discussing the kinds of instructions that machines understand

### Machine Code¶

All instructions for computers end up as machine code

Writing fast code — expressing a given algorithm so that it runs quickly — boils down to producing efficient machine code

You can do this yourself, by hand, if you want to

Typically this is done by writing assembly, which is a symbolic representation of machine code

Here’s some assembly code implementing a function that takes arguments $$a, b$$ and returns $$2a + 8b$$

    pushq   %rbp
movq    %rsp, %rbp
leaq    (%rdi,%rsi,8), %rax
popq    %rbp
retq
nopl    (%rax)


Note that this code is specific to one particular piece of hardware that we use — different machines require different machine code

If you ever feel tempted to start rewriting your economic model in assembly, please restrain yourself

It’s far more sensible to give these instructions in a language like Julia, where they can be easily written and understood

function f(a, b)
y = 2a + 8b
return y
end


or Python

def f(a, b):
y = 2 * a + 8 * b
return y


or even C

int f(int a, int b) {
int y = 2 * a + 8 * b;
return y;
}


In any of these languages we end up with code that is much easier for humans to write, read, share and debug

We leave it up to the machine itself to turn our code into machine code

How exactly does this happen?

### Generating Machine Code¶

The process for turning high level code into machine code differs across languages

Let’s look at some of the options and how they differ from one another

#### AOT Compiled Languages¶

Traditional compiled languages like Fortran, C and C++ are a reasonable option for writing fast code

Indeed, the standard benchmark for performance is still well-written C or Fortran

These languages compile down to efficient machine code because users are forced to provide a lot of detail on data types and how the code will execute

The compiler therefore has ample information for building the corresponding machine code ahead of time (AOT) in a way that

• organizes the data optimally in memory and
• implements efficient operations as required for the task in hand

At the same time, the syntax and semantics of C and Fortran are verbose and unwieldy when compared to something like Julia

Moreover, these low level languages lack the interactivity that’s so crucial for scientific work

#### Interpreted Languages¶

Interpreted languages like Python generate machine code “on the fly”, during program execution

This allows them to be flexible and interactive

Moreover, programmers can leave many tedious details to the runtime environment, such as

• specifying variable types
• memory allocation/deallocation, etc.

But all this convenience and flexibility comes at a cost: it’s hard to turn instructions written in these languages into efficient machine code

For example, consider what happens when Python adds a long list of numbers together

Typically the runtime environment has to check the type of these objects one by one before it figures out how to add them

There are also significant overheads associated with accessing the data values themselves, which might not be stored contiguously in memory

The resulting machine code is often complex and slow

#### Just-in-time compilation¶

Just-in-time (JIT) compilation is an alternative approach that marries some of the advantages of AOT compilation and interpreted languages

The basic idea is that functions for specific tasks are compiled as requested

As long as the compiler has enough information about what the function does, it can in principle generate efficient machine code

In some instances, all the information is supplied by the programmer

In other cases, the compiler will attempt to infer missing information on the fly based on usage

Through this approach, computing environments built around JIT compilers aim to

• provide all the benefits of high level languages discussed above and, at the same time,
• produce efficient instruction sets when functions are compiled down to machine code

## JIT Compilation in Julia¶

JIT compilation is the approach used by Julia

In an ideal setting, all information necessary to generate efficient native machine code is supplied or inferred

In such a setting, Julia will be on par with machine code from low level languages

### An Example¶

Consider the function

function f(a, b)
y = (a + 8b)^2
return 7y
end


Suppose we call f with integer arguments (e.g., z = f(1, 2))

The JIT compiler now knows the types of a and b

Moreover, it can infer types for other variables inside the function

• e.g., y will also be an integer

It then compiles a specialized version of the function to handle integers and stores it in memory

We can view the corresponding machine code using the @code_native macro

@code_native f(1, 2)

pushq   %rbp
movq    %rsp, %rbp
leaq    (%rdi,%rsi,8), %rdi
movabsq $power_by_squaring, %rax movl$2, %esi
callq   %rax
imulq   $7, %rax, %rax popq %rbp retq nop  If we now call f again, but this time with floating point arguments, the JIT compiler will once more infer types for the other variables inside the function • e.g., y will also be a float It then compiles a new version to handle this type of argument @code_native f(1.0, 2.0)  pushq %rbp movq %rsp, %rbp movabsq$139613711993752, %rax  # imm = 0x7EFA59B58B98
mulsd   (%rax), %xmm1
mulsd   %xmm1, %xmm1
movabsq $139613711993760, %rax # imm = 0x7EFA59B58BA0 mulsd (%rax), %xmm1 movapd %xmm1, %xmm0 popq %rbp retq nop  Subsequent calls using either floats or integers are now routed to the appropriate compiled code ### Potential Problems¶ In some senses, what we saw above was a best case scenario Sometimes the JIT compiler produces messy, slow machine code This happens when type inference fails or the compiler has insufficient information to optimize effectively The next section looks at situations where these problems arise and how to get around them ## Fast and Slow Julia Code¶ To summarize what we’ve learned so far, Julia provides a platform for generating highly efficient machine code with relatively little effort by combining 1. JIT compilation 2. Optional type declarations and type inference to pin down the types of variables and hence compile efficient code 3. Multiple dispatch to facilitate specialization and optimization of compiled code for different data types But the process is not flawless, and hiccups can occur The purpose of this section is to highlight potential issues and show you how to circumvent them ### Global Variables¶ Global variables are names assigned to values outside of any function or type definition The are convenient and novice programmers typically use them with abandon But global variables are also dangerous, especially in medium to large size programs, since • they can affect what happens in any part of your program • they can be changed by any function This makes it much harder to be certain about what some small part of a given piece of code actually commands Here’s a useful discussion on the topic When it comes to JIT compilation, global variables create further problems The reason is that the compiler can never be sure of the type of the global variable, or even that the type will stay constant while a given function runs To illustrate, consider this code, where b is global b = 1.0 function g(a) for i in 1:1_000_000 tmp = a + b end end  The code executes relatively slowly and uses a huge amount of memory @time g(1.0)  0.023039 seconds (2.00 M allocations: 30.518 MB, 12.90% gc time)  If you look at the corresponding machine code you will see that it’s a mess @code_native g(1.0)  pushq %rbp movq %rsp, %rbp pushq %r15 pushq %r14 pushq %r13 pushq %r12 pushq %rbx subq$56, %rsp
movsd   %xmm0, -88(%rbp)
movq    %fs:0, %r15
addq    $-2672, %r15 # imm = 0xFFFFFFFFFFFFF590 xorpd %xmm0, %xmm0 movupd %xmm0, -64(%rbp) movq$0, -48(%rbp)
movq    $6, -80(%rbp) movq (%r15), %rax movq %rax, -72(%rbp) leaq -80(%rbp), %rax movq %rax, (%r15) movabsq$140242067578672, %r12  # imm = 0x7F8CA69EDF30
movl    $1000000, %ebx # imm = 0xF4240 leaq 5596992(%r12), %r13 movabsq$jl_apply_generic, %r14
nop
movq    75966904(%r12), %rax
movq    %rax, -48(%rbp)
movq    %r13, -64(%rbp)
movl    $1432, %esi # imm = 0x598 movl$16, %edx
movq    %r15, %rdi
movabsq $jl_gc_pool_alloc, %rax callq %rax movq %r12, -8(%rax) movsd -88(%rbp), %xmm0 # xmm0 = mem[0],zero movsd %xmm0, (%rax) movq %rax, -56(%rbp) movl$3, %esi
leaq    -64(%rbp), %rdi
callq   %r14
decq    %rbx
jne     L112
movq    -72(%rbp), %rax
movq    %rax, (%r15)
addq    $56, %rsp popq %rbx popq %r12 popq %r13 popq %r14 popq %r15 popq %rbp retq nopw %cs:(%rax,%rax)  If we eliminate the global variable like so function g(a, b) for i in 1:1_000_000 tmp = a + b end end  then execution speed improves dramatically @time g(1.0, 1.0)  0.002876 seconds (1.31 k allocations: 61.374 KB)  @time g(1.0, 1.0)  0.000001 seconds (4 allocations: 160 bytes)  Note that the second run was dramatically faster than the first That’s because the first call included the time for JIT compilaiton Notice also how small the memory footprint of the execution is Also, the machine code is simple and clean @code_native g(1.0, 1.0)  pushq %rbp movq %rsp, %rbp popq %rbp retq nopw %cs:(%rax,%rax)  Now the compiler is certain of types throughout execution of the function and hence can optimize accordingly #### The const keyword¶ Another way to stabilize the code above is to maintain the global variable but prepend it with const const b_const = 1.0 function g(a) for i in 1:1_000_000 tmp = a + b_const end end  Now the compiler can again generate efficient machine code We’ll leave you to experiment with it ### Composite Types with Abstract Field Types¶ Another scenario that trips up the JIT compiler is when composite types have fields with abstract types We met this issue earlier, when we discussed AR(1) models Let’s experiment, using, respectively, • an untyped field • a field with abstract type, and • parametric typing As we’ll see, the last of options these gives us the best performance, while still maintaining significant flexibility Here’s the untyped case struct Foo_generic a end  Here’s the case of an abstract type on the field a struct Foo_abstract a::Real end  Finally, here’s the parametrically typed case struct Foo_concrete{T <: Real} a::T end  Now we generate instances fg = Foo_generic(1.0) fa = Foo_abstract(1.0) fc = Foo_concrete(1.0)  In the last case, concrete type information for the fields is embedded in the object typeof(fc)  Foo_concrete{Float64}  This is significant because such information is detected by the compiler #### Timing¶ Here’s a function that uses the field a of our objects function f(foo) for i in 1:1_000_000 tmp = i + foo.a end end  Let’s try timing our code, starting with the generic case: @time f(fg)  0.029499 seconds (2.00 M allocations: 30.510 MB, 21.81% gc time)  The timing is not very impressive Here’s the nasty looking machine code @code_native f(fg)  pushq %rbp movq %rsp, %rbp pushq %r15 pushq %r14 pushq %r13 pushq %r12 pushq %rbx subq$56, %rsp
movq    %rdi, %r15
movq    %fs:0, %rcx
addq    $-2672, %rcx # imm = 0xFFFFFFFFFFFFF590 movq %rcx, -88(%rbp) xorps %xmm0, %xmm0 movups %xmm0, -64(%rbp) movq$0, -48(%rbp)
movq    $6, -80(%rbp) movq (%rcx), %rax movq %rax, -72(%rbp) leaq -80(%rbp), %rax movq %rax, (%rcx) movl$1, %ebx
movabsq $140242073175664, %r13 # imm = 0x7F8CA6F44670 movabsq$jl_box_int64, %r12
movabsq $jl_apply_generic, %r14 movq (%r15), %rax movq %rax, -48(%rbp) movq %r13, -64(%rbp) movq %rbx, %rdi leaq 1(%rbx), %rbx callq %r12 movq %rax, -56(%rbp) movl$3, %esi
leaq    -64(%rbp), %rdi
callq   %r14
cmpq    $1000001, %rbx # imm = 0xF4241 jne L112 movq -72(%rbp), %rax movq -88(%rbp), %rcx movq %rax, (%rcx) addq$56, %rsp
popq    %rbx
popq    %r12
popq    %r13
popq    %r14
popq    %r15
popq    %rbp
retq
nopl    (%rax,%rax)


The abstract case is similar

@time f(fa)

0.030892 seconds (2.00 M allocations: 30.585 MB, 7.18% gc time)


Note the large memory footprint

The machine code is also long and complex, although we omit details

Finally, let’s look at the parametrically typed version

@time f(fc)

0.002850 seconds (1.45 k allocations: 67.642 KB)


Some of this time is JIT compilation, and one more execution gets us down to

0.000001 seconds (3 allocations: 144 bytes)


Here’s the corresponding machine code

@code_native f(fc)

pushq   %rbp
movq    %rsp, %rbp
popq    %rbp
retq
nopw    %cs:(%rax,%rax)


Much nicer...

### Abstract Containers¶

Another way we can run into trouble is with abstract container types

Consider the following function, which essentially does the same job as Julia’s sum() function but acts only on floating point data

function sum_float_array(x::Array{Float64, 1})
sum = 0.0
for i in 1:length(x)
sum += x[i]
end
return sum
end


Calls to this function run very quickly

x = linspace(0, 1, 1e6)
x = collect(x)
typeof(x)

Array{Float64,1}

@time sum_float_array(x)

0.005524 seconds (1.74 k allocations: 82.486 KB)


When Julia compiles this function, it knows that the data passed in as x will be an array of 64 bit floats

Hence it’s known to the compiler that the relevant method for + is always addition of floating point numbers

Moreover, the data can be arranged into continuous 64 bit blocks of memory to simplify memory access

Finally, data types are stable — for example, the local variable sum starts off as a float and remains a float throughout

#### Type Inferences¶

Here’s the same function minus the type annotation in the function signature

function sum_array(x)
sum = 0.0
for i in 1:length(x)
sum += x[i]
end
return sum
end


When we run it with the same array of floating point numbers it executes at a similar speed as the function with type information

@time sum_array(x)

0.005556 seconds (1.61 k allocations: 75.002 KB)


The reason is that when sum_array() is first called on a vector of a given data type, a newly compiled version of the function is produced to handle that type

In this case, since we’re calling the function on a vector of floats, we get a compiled version of the function with essentially the same internal representation as sum_float_array()

#### An Abstract Container¶

Things get tougher for the interpreter when the data type within the array is imprecise

For example, the following snippet creates an array where the element type is Any

x = Any[1/i for i in 1:1e6];

eltype(x)

Any


Now summation is much slower and memory management is less efficient

@time sum_array(x)

0.021680 seconds (1.00 M allocations: 15.332 MB)


Here are some final comments on performance

### Explicit Typing¶

Writing fast Julia code amounts to writing Julia from which the compiler can generate efficient machine code

For this, Julia needs to know about the type of data it’s processing as early as possible

We could hard code the type of all variables and function arguments but this comes at a cost

Our code becomes more cumbersome and less generic

We are starting to loose the advantages that drew us to Julia in the first place

Moreover, explicitly typing everything is not necessary for optimal performance

The Julia compiler is smart and can often infer types perfectly well, without any performance cost

What we really want to do is

• keep our code simple, elegant and generic
• help the compiler out in situations where it’s liable to get tripped up

### Summary and Tips¶

Use functions to segregate operations into logically distinct blocks

Data types will be determined at function boundaries

If types are not supplied then they will be inferred

If types are stable and can be inferred effectively your functions will run fast