Deep Learning#

Because of the static multiple dispatch paradigm layed out in Multiple Dispatch, we need to first include the primitive operations for the device(s) we are inteding on using such that the algorithms (and datastructures) we later include for deep learning can use them.

[1]:

#include <backprop_tools/operations/cpu.h>

[2]:

#include <backprop_tools/nn/layers/dense/operations_cpu.h>

We set up the environment as described in Containers:

[3]:

namespace bpt = backprop_tools;
using DEVICE = bpt::devices::DefaultCPU;
using T = float;
using TI = typename DEVICE::index_t;
DEVICE device;
TI seed = 1;
auto rng = bpt::random::default_engine(DEVICE::SPEC::RANDOM(), seed);

As justified by our analysis of the reinforcement learnign for continuous control landscape (in the paper) in the beginning BackpropTools only supports fully connected neural networks. But we are planning on adding more architectures (especially recurrent neural networks) in the future.

We can instantiate a simple layer by first defining its hyperparameters (which are compile-time constexpr and types):

[4]:

constexpr TI INPUT_DIM = 5;
constexpr TI OUTPUT_DIM = 5;
constexpr auto ACTIVATION_FUNCTION = bpt::nn::activation_functions::RELU;
using PARAMETER_TYPE = bpt::nn::parameters::Plain;

We will explain the role of the PARAMETER_TYPE later on.

These hyperparameters and other options are combined into a specification type such that it is easier to pass it around and such that we don’t need to write out all hyperparameters and options as template parameters when a function takes the datastructure as an argument:

[5]:

using LAYER_SPEC = bpt::nn::layers::dense::Specification<T, TI, INPUT_DIM, OUTPUT_DIM, ACTIVATION_FUNCTION, PARAMETER_TYPE>;

Using this specification we can declare an actual layer:

[6]:

bpt::nn::layers::dense::Layer<LAYER_SPEC> layer;

A fully connected neural network consists of layers each implementing:

\[y = f(Wx + b)\]

where \(x\) is the input (external or from the previous layer), \(W\) and \(b\) are the weight matrix and biases respectively and \(f\) is an element-wise non-linear function. Hence the data structure of a layer should contain at least \(W\) and \(b\). Because these parameters are containers they need to be allocated:

[7]:

bpt::malloc(device, layer);

Now that the memory is allocated we need to initialize it (because it may contain arbitrary values). We use the standard Kaiming initialization scheme:

[8]:

bpt::init_kaiming(device, layer, rng);

We can print \(W\) and \(b\):

[9]:

bpt::print(device, layer.weights.parameters)

   -0.329563     0.228620    -0.036984     0.029308    -0.251371
    0.159981     0.160368     0.388801    -0.104199     0.017367
   -0.416291    -0.399396     0.026565     0.153081    -0.440328
   -0.387428    -0.073803     0.167055     0.079583     0.384994
    0.024086    -0.364958     0.137669    -0.075132     0.179950

[10]:

bpt::print(device, layer.biases.parameters)

   -0.447207    -0.405136     0.296024    -0.104276     0.309621

Now that the layer is initialized we can run inference using a random input. We first declare and allocate input and output matrices and then randomly initialize the input:

[11]:

constexpr TI BATCH_SIZE = 1;
bpt::MatrixDynamic<bpt::matrix::Specification<T, TI, BATCH_SIZE, INPUT_DIM>> input;
bpt::MatrixDynamic<bpt::matrix::Specification<T, TI, BATCH_SIZE, OUTPUT_DIM>> output;
bpt::malloc(device, input);
bpt::malloc(device, output);
bpt::randn(device, input, rng);
bpt::print(device, input);

    0.175199    -0.863064     1.316539     0.942564    -0.589718

Now we can evaluate output of the layer:

[12]:

bpt::evaluate(device, layer, input, output);
bpt::print(device, output);

    0.000000     0.000000     1.006727     0.000000     0.633133

Now we are revisiting the PARAMETER_TYPE template argument. For inference storing \(W\) and \(b\) is sufficient but for training we at least need to also store the gradient of the loss \(L\) wrt. \(W\) and \(b\): \(\frac{\mathrm{d}L}{\mathrm{d}W}\) and \(\frac{\mathrm{d}L}{\mathrm{d}b}\). Because depending on the optimizer type we might need to store more information per parameter (like the first and second-order moment in the case of Adam), we abstract the storage for the weights and biases using a PARAMETER_TYPE that can e.b. be Plain, Gradient, Adam or any other type extended by the user. For this illustration we are using Gradient:

[13]:

using PARAMETER_TYPE_2 = bpt::nn::parameters::Gradient;
using LAYER_2_SPEC = bpt::nn::layers::dense::Specification<T, TI, INPUT_DIM, OUTPUT_DIM, ACTIVATION_FUNCTION, PARAMETER_TYPE_2>;
bpt::nn::layers::dense::LayerBackwardGradient<LAYER_2_SPEC> layer_2;
bpt::malloc(device, layer_2);
bpt::copy(device, device, layer_2, layer);
bpt::zero_gradient(device, layer_2);

Note that we now use the bpt::nn::layers::dense::LayerBackwardGradient datastructure which is supported by the functions implementing the backpropagation algorithm. Additionally, similar to PyTorch we are setting the gradient to zero because it is accumulated with subsequent backward passes.

Now we can backpropagate the derivative of the loss wrt. the output to calculate the derivative of the loss wrt. the input. Hence the derivative of the loss wrt. the output: d_output is actually an input to the bpt::backward operator. The operator also accumulates the derivative of the loss wrt. the weights and biases in the layer. We first allocate containers for d_input and d_output and randomly set d_output (a hypothetical gradient of the input of some upstream layers)

[14]:

bpt::MatrixDynamic<bpt::matrix::Specification<T, TI, BATCH_SIZE, OUTPUT_DIM>> d_output;
bpt::MatrixDynamic<bpt::matrix::Specification<T, TI, BATCH_SIZE, INPUT_DIM>> d_input;
bpt::malloc(device, d_input);
bpt::malloc(device, d_output);
bpt::randn(device, d_output, rng);

Now we execute the backpropagation and display the gradient of the loss wrt. the input:

[15]:

bpt::backward(device, layer_2, input, d_output, d_input);
bpt::print(device, d_input);

   -0.383490     0.504066    -0.357525     0.152552    -0.412863

This also accumulates the gradient in the weights and biases:

[16]:

bpt::print(device, layer_2.weights.gradient);

    0.150751    -0.742629     1.132824     0.811034    -0.507426
   -0.081795     0.402941    -0.614657    -0.440058     0.275324
    0.000000     0.000000     0.000000     0.000000     0.000000
    0.000000     0.000000     0.000000     0.000000     0.000000
   -0.183485     0.903885    -1.378809    -0.987145     0.617611

[17]:

bpt::print(device, layer_2.biases.gradient);

    0.860456    -0.466873     0.000000     0.000000    -1.047298

[18]:

// bpt::free(device, layer);
// bpt::free(device, layer_2);
// bpt::free(device, input);
// bpt::free(device, output);
// bpt::free(device, d_input);
// bpt::free(device, d_output);

Until now we showed the behavior of a single, fully-connected layer. BackpropTools contains an Multilayer Perceptron (MLP) that conveniently integrates an arbitrary number of layers into a single data structure with algorithms to perform forward passes and backpropagation across the whole model. The MLP is locate under the namespace backprop_tools::nn_models hence we include its CPU operations:

[19]:

#include <backprop_tools/nn_models/operations_cpu.h>

Next we define the hyperparameters:

[20]:

constexpr TI INPUT_DIM_MLP = 5;
constexpr TI OUTPUT_DIM_MLP = 1;
constexpr TI NUM_LAYERS = 3;
constexpr TI HIDDEN_DIM = 10;
constexpr auto ACTIVATION_FUNCTION_MLP = bpt::nn::activation_functions::RELU;
constexpr auto OUTPUT_ACTIVATION_FUNCTION_MLP = bpt::nn::activation_functions::IDENTITY;

Note that the MLP supports architectures with an arbitrary depth but each layer has to have the same dimensionality. This is because the layers are stored in an array and hence all need to have the same type. If we would allow for different hidden dimensions, we would have to give up on having arbitrary depths.

We aggregate the hyperparameters into a specification again (first just for the structure, later for the full network, incorporating the structure):

[21]:

using STRUCTURE_SPEC = bpt::nn_models::mlp::StructureSpecification<T, DEVICE::index_t, INPUT_DIM_MLP, OUTPUT_DIM_MLP, NUM_LAYERS, HIDDEN_DIM, ACTIVATION_FUNCTION_MLP, OUTPUT_ACTIVATION_FUNCTION_MLP, BATCH_SIZE>;

We use the default Adam parameters (taken from TensorFlow) and set up the optimizer type using these parameters. Moreover, we create a full network specification for a network that can be trained with Adam which takes the structure specification as an input. Finally we define the full network type:

[22]:

using OPTIMIZER_PARAMETERS = bpt::nn::optimizers::adam::DefaultParametersTF<T>;
using OPTIMIZER = bpt::nn::optimizers::Adam<OPTIMIZER_PARAMETERS>;
using MODEL_SPEC = bpt::nn_models::mlp::AdamSpecification<STRUCTURE_SPEC>;
using MODEL_TYPE = bpt::nn_models::mlp::NeuralNetworkAdam<MODEL_SPEC>;

Using these type definitions we can now declare the optimizer and the model. All the optimizer state is contained in the PARAMETER_TYPE of the model (and an additional age integer in the model in the case of Adam). In comparison to PyTorch which stores the optimizer state in the optimizer, we prefer to store the first and second-order moment next to the parameters like it is the case for the gradient anyways (in PyTorch as well). Hence the optimizer is stateless in this case (does not need to be for user-defined optimizers) and we only need to allocate the model.

The backpropagation algorithm needs to store the intermediate gradients. To save memory we do not add a d_input or d_output to each layer but rather use a double buffer with the maximum size of the hidden representation needed.

[23]:

OPTIMIZER optimizer;
MODEL_TYPE model;
typename MODEL_TYPE::Buffers<BATCH_SIZE> buffers;

We allocate the model and set initialize its weights randomly like in the case for the single layer. We are again zeroing the gradient of all parameters of all layers as well as resetting the optimizer state of all parameters of all layers (e.g. in the case of Adam the first and second order moments are set to zero). Finally we also allocate the buffers

[24]:

bpt::malloc(device, model);
bpt::init_weights(device, model, rng); // recursively initializes all layers using kaiming initialization
bpt::zero_gradient(device, model); // recursively zeros all gradients in the layers
bpt::reset_optimizer_state(device, model, optimizer);
bpt::malloc(device, buffers);

In this example we showcase an MLP with a five dimensional input and a one dimensional output (remember the OUTPUT_ACTIVATION_FUNCTION_MLP is IDENTITY so it can also output negative values). For these new shapes we declare and allocate the input and output containers:

[25]:

bpt::MatrixDynamic<bpt::matrix::Specification<T, TI, BATCH_SIZE, INPUT_DIM_MLP>> input_mlp, d_input_mlp;
bpt::MatrixDynamic<bpt::matrix::Specification<T, TI, BATCH_SIZE, OUTPUT_DIM_MLP>> d_output_mlp;
bpt::malloc(device, input_mlp);
bpt::malloc(device, d_input_mlp);
bpt::malloc(device, d_output_mlp);

Now, like in the case of the single layer, we can run a forward pass using the input. Because the model is a Adam model (which is a subclass of bpt::nn_models::mlp::NeuralNetworkBackwardGradient), it stores the intermediate (and final) outputs.

[26]:

bpt::randn(device, input_mlp, rng);
bpt::forward(device, model, input_mlp);
T output_value = get(model.output_layer.output, 0, 0);
output_value

[26]:

0.128710f

Now imagine we want the output of the model (for this input) to be \(1\). We calculate the error and feed it back through the model using backpropagation. d_output_mlp should be the derivative of the loss function, hence it gives the direction of the output that would increase the loss. Our error is the opposite, if we would move the output into the direction of the error we would come closer to our target value and hence decrease the loss. Because of this, we feed back -error. This procedure also corresponds to using a squared loss because error is (up to a constant) the derivative of the squared loss.

[27]:

T target_output_value = 1;
T error = target_output_value - output_value;
bpt::set(d_output_mlp, 0, 0, -error);
bpt::backward(device, model, input_mlp, d_output_mlp, d_input_mlp, buffers);

The backward pass populates the gradient in all parameters of the model. Using this gradient we can apply the bpt::update operator which updates the first and second order moments of the gradient of all parameters and afterwards applies the Adam update rule to update the parameters:

[28]:

bpt::update(device, model, optimizer);

Now the next forward pass should be closer to the target value:

[29]:

bpt::forward(device, model, input_mlp);
get(model.output_layer.output, 0, 0)

[29]:

0.139336f

Next we will train the network to actually perform a function (not only trying to output a constant value as before). With the following training loop we train it to behave like the bpt::max operator which outputs the max of the five inputs. We run the forward and backward pass for \(32\) iterations while accumulating the gradient which effectively leads to a batch size of \(32\)

[30]:

for(TI i=0; i < 10000; i++){
    bpt::zero_gradient(device, model);
    T mse = 0;
    for(TI batch_i=0; batch_i < 32; batch_i++){
        bpt::randn(device, input_mlp, rng);
        bpt::forward(device, model, input_mlp);
        T output_value = get(model.output_layer.output, 0, 0);
        T target_output_value = bpt::max(device, input_mlp);
        T error = target_output_value - output_value;
        bpt::set(d_output_mlp, 0, 0, -error);
        bpt::backward(device, model, input_mlp, d_output_mlp, d_input_mlp, buffers);
        mse += error * error;
    }
    bpt::update(device, model, optimizer);
    if(i % 1000 == 0)
    std::cout << "Squared error: " << mse/32 << std::endl;
}

Squared error: 1.693208
Squared error: 0.081613
Squared error: 0.018353
Squared error: 0.004015
Squared error: 0.014354
Squared error: 0.017132
Squared error: 0.003833
Squared error: 0.014308
Squared error: 0.002764
Squared error: 0.003838

Now we can test the model using some arbitrary input (which should be in the distribution of input values) and the model should output a value close to the maximum of the five input values:

[31]:

set(input_mlp, 0, 0, +0.0);
set(input_mlp, 0, 1, -0.1);
set(input_mlp, 0, 2, +0.5);
set(input_mlp, 0, 3, -0.4);
set(input_mlp, 0, 4, +0.1);

bpt::forward(device, model, input_mlp);
bpt::get(model.output_layer.output, 0, 0)

[31]:

0.458271f

We can also automatically test it with \(10\) random inputs:

[32]:

for(TI i=0; i < 10; i++){
    bpt::randn(device, input_mlp, rng);
    bpt::forward(device, model, input_mlp);
    std::cout << "max: " << bpt::max(device, input_mlp) << " output: " << bpt::get(model.output_layer.output, 0, 0) << std::endl;
}

max: 0.439409 output: 0.358949
max: 1.676039 output: 1.596896
max: 0.937396 output: 0.907515
max: 0.384358 output: 0.364056
max: 0.184520 output: 0.141506
max: -0.185583 output: -0.205863
max: 1.878006 output: 1.819072
max: 0.818198 output: 0.815126
max: 0.072115 output: 0.435097
max: 1.914209 output: 1.873126

If the values are not close the model might need some more training iterations.