# CPU Acceleration#

The generic implementation of matrix multiplication using a triple nested loop is reasonably fast due to the design of **BackpropTools** allowing the compiler to heavily optimizer the code (especially for smaller matrices). For larger matrices it is beneficial to use the CPUs SIMD instructions (like e.g. AVX). This can be done through BLAS libraries like
Intel MKL or OpenBLAS. We found Intel MKL to be the fastest, but it does not work reliably in Cling (the C/C++ interpreter as a backend of the Jupyter notebook used for this tutorial). Hence we use OpenBLAS in this and the following examples. **BackpropTools** has a multiplexing header `cpu_mux.h`

that automatically includes the CPU operations dispatching to the available backend. The available backend is selected by defining e.g. `BACKPROP_TOOLS_BACKEND_ENABLE_OPENBLAS`

or
`BACKPROP_TOOLS_BACKEND_ENABLE_MKL`

(these options can also be used during the configuration phase of CMake when using **BackpropTools** natively by passing them on the CLI using `-D`

).

```
[1]:
```

```
#define BACKPROP_TOOLS_BACKEND_ENABLE_OPENBLAS
#include <backprop_tools/operations/cpu_mux.h>
namespace bpt = backprop_tools;
```

We also include `iostream`

to print the computation times later on

```
[2]:
```

```
#include <iostream>
```

OpenBLAS contains pre-compiled matrix multiplication kernels and hence needs to be linked (or loaded in case of the Cling interpreter):

```
[3]:
```

```
#pragma cling load("openblas")
```

We define two devices to compare the computation time later on. All `CPU_*`

devices use the main memory to store containers and allocated matrices are hence compatible between them. For maximum performance it is recommended to allocate the matrices with the device that they are used on afterwards because e.g. Intel MKL allows to allocate chunks of memory at certain boundaries which allow faster loading using SIMD instructions. This is integrated into the particular `bpt::malloc`

specialization which is dispatched to by simply calling with the desired device.

Because we don’t know the outcome of the (or rather we don’t want to hard-code it) the `cpu_mux.h`

returns a `DEVICE_FACTORY`

template that generates the found `CPU_*`

device when passing a `CPU`

specification. In this case (since we know OpenBLAS is available and that `BACKPROP_TOOLS_BACKEND_ENABLE_OPENBLAS`

is defined) this is equal to `using DEVICE_BLAS = bpt::devices::CPU_OPENBLAS<bpt::devices::DefaultCPUSpecification>`

.

You can play around with commenting `#define BACKPROP_TOOLS_BACKEND_ENABLE_OPENBLAS`

out in the first cell and re-running the notebook. In that case, you can see that it will use a normal `CPU`

for `DEVICE_BLAS`

and result in equally slow computation times for both devices in the benchmark lateron.

```
[4]:
```

```
using DEVICE = bpt::devices::DefaultCPU;
using DEVICE_BLAS = bpt::DEVICE_FACTORY<bpt::devices::DefaultCPUSpecification>;
using T = double;
using TI = typename DEVICE::index_t;
```

We allocate \(A\), \(B\), \(C\) and \(C_{blas}\) matrices to evaluate the computation:

```
[5]:
```

```
DEVICE device;
DEVICE_BLAS device_blas;
constexpr TI SIZE = 500;
bpt::MatrixDynamic<bpt::matrix::Specification<T, TI, SIZE, SIZE>> A, B, C, C_blas;
auto rng = bpt::random::default_engine(typename DEVICE::SPEC::RANDOM(), 1);
```

We allocate all the matrices and fill \(A\) and \(B\) with random numbers:

```
[6]:
```

```
bpt::malloc(device_blas, A);
bpt::malloc(device_blas, B);
bpt::malloc(device_blas, C);
bpt::malloc(device_blas, C_blas);
bpt::randn(device, A, rng);
bpt::randn(device, B, rng);
```

Now we can benchmark the matrix multiplication using the generic implementation:

```
[7]:
```

```
{
auto start = std::chrono::high_resolution_clock::now();
bpt::multiply(device, A, B, C);
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration<double>(end - start);
std::cout << "Time Cling (C/C++ interpreter): " << duration.count() << " seconds" << std::endl;
}
```

```
Time Cling (C/C++ interpreter): 2.21825 seconds
```

Equivalently we can run the same computation using OpenBLAS and get the result in much less time:

```
[8]:
```

```
{
auto start = std::chrono::high_resolution_clock::now();
bpt::multiply(device_blas, A, B, C_blas);
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration<double>(end - start);
std::cout << "Time Cling (C/C++ interpreter): " << duration.count() << " seconds" << std::endl;
}
```

```
Time Cling (C/C++ interpreter): 0.0061241 seconds
```

Now we check the resulting matrices against each other:

```
[9]:
```

```
std::cout << "Absolute difference between the resulting matrices: " << bpt::abs_diff(device, C, C_blas) << std::endl;
```

```
Absolute difference between the resulting matrices: 2.9357e-09
```

Depending on the machine, compilers and library (versions) used, we might find that they are exactly equal but this is not necessarily the case. By changing `T`

to `float`

the deviations should be bigger but also for `double`

this could happen because of floating point inaccuracies which entail that the same mathematical expression does not necessarily lead to the same result if you reorder the (finite precision) computations.

Another sanity check is printing the matrices, which is infeasible for their full size, hence we only print a subview of size \(5\). This can be done using the `bpt::view`

operator which yields a submatrix that is a view of the original matrix. The view is an ordinary `bpt::Matrix`

and hence can be used in all operations that take matrices as input. Since it is a view it is cheap (zero-copy) because it only carries a single pointer at runtime as well as compile-time information about the
shape (dimensions and pitch).

```
[10]:
```

```
constexpr TI VIEW_SIZE = 5;
using VIEW = bpt::matrix::ViewSpec<VIEW_SIZE, VIEW_SIZE>;
auto vC = bpt::view(device, C, VIEW{}, 0, 0);
auto vC_blas = bpt::view(device, C_blas, VIEW{}, 0, 0);
std::cout << "Result comparison (top-left " << VIEW_SIZE << "x" << VIEW_SIZE << " for brevity)" << std::endl;
std::cout << std::endl << "C: " << std::endl;
bpt::print(device, vC);
std::cout << std::endl << "C_blas: " << std::endl;
bpt::print(device, vC_blas);
```

```
Result comparison (top-left 5x5 for brevity)
C:
-6.968835 20.802622 -5.761290 -4.823612 -12.391784
3.728964 -7.716708 23.316995 11.880904 0.222639
-4.089002 17.673674 36.733070 6.680188 17.680281
7.351404 12.769181 -28.025761 26.561350 15.772016
-53.675959 10.501975 -33.569589 21.157145 57.310697
C_blas:
-6.968835 20.802622 -5.761290 -4.823612 -12.391784
3.728964 -7.716708 23.316995 11.880904 0.222639
-4.089002 17.673674 36.733070 6.680188 17.680281
7.351404 12.769181 -28.025761 26.561350 15.772016
-53.675959 10.501975 -33.569589 21.157145 57.310697
```