CPU Acceleration#

Binder

The generic implementation of matrix multiplication using a triple nested loop is reasonably fast due to the design of BackpropTools allowing the compiler to heavily optimizer the code (especially for smaller matrices). For larger matrices it is beneficial to use the CPUs SIMD instructions (like e.g. AVX). This can be done through BLAS libraries like Intel MKL or OpenBLAS. We found Intel MKL to be the fastest, but it does not work reliably in Cling (the C/C++ interpreter as a backend of the Jupyter notebook used for this tutorial). Hence we use OpenBLAS in this and the following examples. BackpropTools has a multiplexing header cpu_mux.h that automatically includes the CPU operations dispatching to the available backend. The available backend is selected by defining e.g. BACKPROP_TOOLS_BACKEND_ENABLE_OPENBLAS or BACKPROP_TOOLS_BACKEND_ENABLE_MKL (these options can also be used during the configuration phase of CMake when using BackpropTools natively by passing them on the CLI using -D).

[1]:
#define BACKPROP_TOOLS_BACKEND_ENABLE_OPENBLAS
#include <backprop_tools/operations/cpu_mux.h>
namespace bpt = backprop_tools;

We also include iostream to print the computation times later on

[2]:
#include <iostream>

OpenBLAS contains pre-compiled matrix multiplication kernels and hence needs to be linked (or loaded in case of the Cling interpreter):

[3]:
#pragma cling load("openblas")

We define two devices to compare the computation time later on. All CPU_* devices use the main memory to store containers and allocated matrices are hence compatible between them. For maximum performance it is recommended to allocate the matrices with the device that they are used on afterwards because e.g. Intel MKL allows to allocate chunks of memory at certain boundaries which allow faster loading using SIMD instructions. This is integrated into the particular bpt::malloc specialization which is dispatched to by simply calling with the desired device.

Because we don’t know the outcome of the (or rather we don’t want to hard-code it) the cpu_mux.h returns a DEVICE_FACTORY template that generates the found CPU_* device when passing a CPU specification. In this case (since we know OpenBLAS is available and that BACKPROP_TOOLS_BACKEND_ENABLE_OPENBLAS is defined) this is equal to using DEVICE_BLAS = bpt::devices::CPU_OPENBLAS<bpt::devices::DefaultCPUSpecification>.

You can play around with commenting #define BACKPROP_TOOLS_BACKEND_ENABLE_OPENBLAS out in the first cell and re-running the notebook. In that case, you can see that it will use a normal CPU for DEVICE_BLAS and result in equally slow computation times for both devices in the benchmark lateron.

[4]:
using DEVICE = bpt::devices::DefaultCPU;
using DEVICE_BLAS = bpt::DEVICE_FACTORY<bpt::devices::DefaultCPUSpecification>;
using T = double;
using TI = typename DEVICE::index_t;

We allocate \(A\), \(B\), \(C\) and \(C_{blas}\) matrices to evaluate the computation:

\[C = A \cdot B\]
[5]:
DEVICE device;
DEVICE_BLAS device_blas;
constexpr TI SIZE = 500;
bpt::MatrixDynamic<bpt::matrix::Specification<T, TI, SIZE, SIZE>> A, B, C, C_blas;
auto rng = bpt::random::default_engine(typename DEVICE::SPEC::RANDOM(), 1);

We allocate all the matrices and fill \(A\) and \(B\) with random numbers:

[6]:
bpt::malloc(device_blas, A);
bpt::malloc(device_blas, B);
bpt::malloc(device_blas, C);
bpt::malloc(device_blas, C_blas);
bpt::randn(device, A, rng);
bpt::randn(device, B, rng);

Now we can benchmark the matrix multiplication using the generic implementation:

[7]:
{
    auto start = std::chrono::high_resolution_clock::now();
    bpt::multiply(device, A, B, C);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration<double>(end - start);
    std::cout << "Time Cling (C/C++ interpreter): " << duration.count() << " seconds" << std::endl;
}
Time Cling (C/C++ interpreter): 2.21825 seconds

Equivalently we can run the same computation using OpenBLAS and get the result in much less time:

[8]:
{
    auto start = std::chrono::high_resolution_clock::now();
    bpt::multiply(device_blas, A, B, C_blas);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration<double>(end - start);
    std::cout << "Time Cling (C/C++ interpreter): " << duration.count() << " seconds" << std::endl;
}
Time Cling (C/C++ interpreter): 0.0061241 seconds

Now we check the resulting matrices against each other:

[9]:
std::cout << "Absolute difference between the resulting matrices: " << bpt::abs_diff(device, C, C_blas) << std::endl;
Absolute difference between the resulting matrices: 2.9357e-09

Depending on the machine, compilers and library (versions) used, we might find that they are exactly equal but this is not necessarily the case. By changing T to float the deviations should be bigger but also for double this could happen because of floating point inaccuracies which entail that the same mathematical expression does not necessarily lead to the same result if you reorder the (finite precision) computations.

Another sanity check is printing the matrices, which is infeasible for their full size, hence we only print a subview of size \(5\). This can be done using the bpt::view operator which yields a submatrix that is a view of the original matrix. The view is an ordinary bpt::Matrix and hence can be used in all operations that take matrices as input. Since it is a view it is cheap (zero-copy) because it only carries a single pointer at runtime as well as compile-time information about the shape (dimensions and pitch).

[10]:
constexpr TI VIEW_SIZE = 5;
using VIEW = bpt::matrix::ViewSpec<VIEW_SIZE, VIEW_SIZE>;
auto vC = bpt::view(device, C, VIEW{}, 0, 0);
auto vC_blas = bpt::view(device, C_blas, VIEW{}, 0, 0);

std::cout << "Result comparison (top-left " << VIEW_SIZE << "x" << VIEW_SIZE << " for brevity)" << std::endl;
std::cout << std::endl << "C: " << std::endl;
bpt::print(device, vC);
std::cout << std::endl << "C_blas: " << std::endl;
bpt::print(device, vC_blas);
Result comparison (top-left 5x5 for brevity)

C:
   -6.968835    20.802622    -5.761290    -4.823612   -12.391784
    3.728964    -7.716708    23.316995    11.880904     0.222639
   -4.089002    17.673674    36.733070     6.680188    17.680281
    7.351404    12.769181   -28.025761    26.561350    15.772016
  -53.675959    10.501975   -33.569589    21.157145    57.310697

C_blas:
   -6.968835    20.802622    -5.761290    -4.823612   -12.391784
    3.728964    -7.716708    23.316995    11.880904     0.222639
   -4.089002    17.673674    36.733070     6.680188    17.680281
    7.351404    12.769181   -28.025761    26.561350    15.772016
  -53.675959    10.501975   -33.569589    21.157145    57.310697