Multiple Dispatch#

In this notebook, we show the usage of static multiple dispatch using a simple example. Static multiple dispatch is used in BackpropTools to provide different implementations of primitives used in deep learning, reinforcement learning etc. that are tailored to the particular device the code is executed on. In comparison to the Julia programming language which popularized dynamic/runtime multiple dispatch, in BackpropTools the dispatch to a particular implementation is done at compile-time, enabling the compiler to heavily optimize the code for a particular device.

First, we set up the environment as detailed in Containers

[2]:

#include <backprop_tools/operations/cpu.h>

[3]:

namespace bpt = backprop_tools;
using DEVICE = bpt::devices::DefaultCPU;
using T = float;
using TI = typename DEVICE::index_t;

Additionally, we create a new device (a hypothetical Accelerator) that is derived from the default CPU device. If we don’t overload any functions using this device will lead to automatic dispatch to the functions for the original device.

[4]:

struct Accelerator: DEVICE{};

[5]:

DEVICE device_1;
Accelerator device_2;
TI seed = 1;
auto rng = bpt::random::default_engine(DEVICE::SPEC::RANDOM(), seed);
bpt::MatrixDynamic<bpt::matrix::Specification<T, TI, 3, 3>> m;

We allocate the matrix using the first device. In this case it makes no difference if we allocate it using the first or second device but if we were e.g. using a GPU with separate, on-device memory we have to allocate containers for the particular device they will be used on. After allocating containers on different devices they can be copied between devices using backprop_tools::copy.

[6]:

bpt::malloc(device_1, m);
bpt::set_all(device_1, m, 0);
bpt::print(device_1, m);

    0.000000     0.000000     0.000000
    0.000000     0.000000     0.000000
    0.000000     0.000000     0.000000

Now we define a new operation that takes a matrix and increments the first element by 10000000. Not that this function can deal with device_1 and device_2. Additionally, because of the template metaprogramming allowing us to pass around the matrix’s Specification at compile-time, we can use static_assert to make sure the operator can not be used on smaller matrices. This shows how static multiple dispatch allows for bounds checking at compile-time without any run-time overhead. On another note we use the index type TI to count because in BackpropTools we never hardcode any integer or floating point types, so that the optimal ones can be used depending on the device we are compiling for.

[7]:

template <typename SPEC>
void increment_first_element(DEVICE& device, backprop_tools::Matrix<SPEC>& matrix){
    using TI = DEVICE::index_t;
    static_assert(SPEC::ROWS >= 1);
    static_assert(SPEC::COLS >= 1);
    for(TI i=0; i < 10000000; i++){
        bpt::increment(matrix, 0, 0, 1);
    }
}

Now we can benchmark the runtime of this, admittably horribly inefficient implementation of the incrementation operation:

[8]:

#include <chrono>
#include <iostream>
auto start = std::chrono::high_resolution_clock::now();
increment_first_element(device_1, m);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> elapsed = end - start;
std::cout << "Time taken: " << elapsed.count() << " ms\n";

Time taken: 117.943163 ms

We can override the previous implementation for the new Accelerator device and provide an implementation that is tailored to the hardware. In this hypothetical case we just provide a more efficient implementation of the same incrementation operation:

[9]:

template <typename SPEC>
void increment_first_element(Accelerator& device, backprop_tools::Matrix<SPEC>& matrix){
    using TI = DEVICE::index_t;
    static_assert(SPEC::ROWS >= 1);
    static_assert(SPEC::COLS >= 1);
    bpt::increment(matrix, 0, 0, 10000000);
}

Executing this implementation on the same datastructure but using device_2 yields a significant speedup:

[10]:

bpt::set_all(device_2, m, 0);
auto start = std::chrono::high_resolution_clock::now();
increment_first_element(device_2, m);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> elapsed = end - start;
std::cout << "Time taken: " << elapsed.count() << " ms\n";

Time taken: 0.000480 ms

Now that we have seen how primitives can be tailored to devices using multiple dispatch and specific implementations, we want to use them in higher-level, more abstract algorithms that are agnostic to the hardware they are running on as long as the primitive operations behave the same:

[11]:

template <typename DEVICE, typename SPEC>
void algorithm(DEVICE& device, backprop_tools::Matrix<SPEC>& matrix){
    using TI = typename DEVICE::index_t;
    for(TI i=0; i < 5; i++){
        increment_first_element(device, matrix);
    }
}

[12]:

auto start = std::chrono::high_resolution_clock::now();
algorithm(device_1, m);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> elapsed = end - start;
std::cout << "Time taken: " << elapsed.count() << " ms\n";

Time taken: 588.734812 ms

[13]:

auto start = std::chrono::high_resolution_clock::now();
algorithm(device_2, m);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> elapsed = end - start;
std::cout << "Time taken: " << elapsed.count() << " ms\n";

Time taken: 0.000806 ms

In practice we provide generic implementations in pure C++ for all deep learning and reinforcement learning operators that do not depend on specific device capabilities. These naive implementations can be quite slow (e.g. for large matrix multiplication) hence we provide specialized implementations that are dispatched to by including the appropriate operations for that device and then calling all algorithms with the specialized device type. For example the Intel MKL library provides implementations of matrix multiplication that is tailored to Intel processors and their vector extensions (e.g. AVX). Hence in that case we would includ #include <backprop_tools/operations/cpu_mkl.h> which uses all the generic or CPU implementations available in BackpropTools but overloads the forward and backward passes of neural networks to dispatch to the fast matrix multiplication implementations. Moreover, it also overloads the backprop_tools::malloc to align the container memory to 64 byte boundaries which makes the loading and storing from and to memory more efficient through aggregation of multiple loads and stores.

We prefer static multiple dispatch in the way shown before over C++ method overriding because the latter requires an implicit method lookup through the Virtual Method Table (VMT) at runtime. In contrast, static multiple dispatch allows the compiler to do the dispatch at compile time and hence incur no runtime overhead while also being able to aggressively integrate and optimize the modules. This is especially beneficial on accelerators like GPUs.