Deep Reinforcement Learning#


In this chapter we use the previously demonstrated deep learning capabilities of BackpropTools in combination with a (inverted) pendulum simulator that is equivalent to the Pendulum-v1 in gym/gymnasium to train a swing-up control policy. For the training, we use the TD3 off-policy, deep-RL algorithm. TD3 and required supporting data structures and algorithms are integrated in BackpropTools.


Note that the training time in the animation refers to bare metal training, not using the Cling interpreter like these notebooks do. As you can see from the training later on even when dispatching to a BLAS library, Cling is much slower than optimized, bare-metal code. See the repository for more information on how to run the training directly on your hardware. You can also try the WASM based training in your browser at

First, as beforehand we include the necessary primitive operations (dispatching matrix multiplications to OpenBLAS). We also use the neural network operations (dense layer forward and backward pass) that take advantage of OpenBLAS through the nn/operations_cpu_mux.h multiplexer. The accelerated forward and backward pass are automatically used if the higher level operations (forward/backward pass on the full model) are called with the OpenBLAS device (coming from the DEVICE_FACTORY). To make the accelerated routines available to the higher-level functions, nn_models/operations_cpu.h has to be included after nn/operations_cpu_mux.h.

The pendulum environment is implemented in pure C++ without dependencies, hence it contains only generic operations and can be included by the collective rl/environments/operations_generic.h that includes all generic functions of all available environments.

For TD3 and all its related data structures and algorithms we just need to include rl/operations_generic.h because all the operations are higher-level and dispatch to the lower-level primitives imported beforehand. The RL operations call functions to interact with the environment as well as perform forward and backward passes on the neural network model which in turn calls the dense layer operations.

We also include the Xeus UI for the pendulum (to be rendered in the notebook when it is run live). Furthermore, we include rl/utils/evaluation.h so that we can easily execute deterministic rollouts (without exploration noise) and get average rewards in fixed intervals to monitor the training progress. The evaluation function can also take the UI as an input and render a live animation of the pendulum.

#include <backprop_tools/operations/cpu_mux.h>
#include <backprop_tools/nn/operations_cpu_mux.h>
#include <backprop_tools/rl/environments/operations_generic.h>
#include <backprop_tools/nn_models/operations_cpu.h>
#include <backprop_tools/rl/operations_generic.h>
#include <backprop_tools/rl/environments/pendulum/ui_xeus.h>
#include <backprop_tools/rl/utils/evaluation.h>
namespace bpt = backprop_tools;
#pragma cling load("openblas")

We set up the major types like before again. float is usually much faster while still being sufficient for deep and reinforcement learning. You can try switching to double and re-run the notebook to see the difference in training time.

using T = float;
using DEVICE = bpt::DEVICE_FACTORY<bpt::devices::DefaultCPUSpecification>;
using TI = typename DEVICE::index_t;

Next, we define the ENVIRONMENT type which acts as a compile-time interface between simulations and RL algorithms. In BackpropTools environments share a common interface that is similar to the gym/gymnasium interface but e.g. has the observation and state dimensionality as compile-time constants so that the compiler can maximally optimize each part of the code. The RL algorithms and the following training procedure are agnostic to the type of environment used as long as exposes the required interface.

using ENVIRONMENT_PARAMETERS = bpt::rl::environments::pendulum::DefaultParameters<T>;
using ENVIRONMENT_SPEC = bpt::rl::environments::pendulum::Specification<T, TI, ENVIRONMENT_PARAMETERS>;
using ENVIRONMENT = bpt::rl::environments::Pendulum<ENVIRONMENT_SPEC>;

Next we define some hyperparameters to train the pendulum swing-up. Note the very low STEP_LIMIT which is tribute to TD3 being relatively sample efficient (e.g. in comparison to PPO):

struct TD3_PENDULUM_PARAMETERS: bpt::rl::algorithms::td3::DefaultParameters<T, TI>{
    constexpr static typename DEVICE::index_t CRITIC_BATCH_SIZE = 100;
    constexpr static typename DEVICE::index_t ACTOR_BATCH_SIZE = 100;
constexpr TI STEP_LIMIT = 10000;
constexpr TI EPISODE_STEP_LIMIT = 200;
constexpr TI ACTOR_NUM_LAYERS = 3;
constexpr TI ACTOR_HIDDEN_DIM = 64;
constexpr TI CRITIC_NUM_LAYERS = 3;
constexpr TI CRITIC_HIDDEN_DIM = 64;
constexpr auto ACTOR_ACTIVATION_FUNCTION = bpt::nn::activation_functions::RELU;
constexpr auto CRITIC_ACTIVATION_FUNCTION = bpt::nn::activation_functions::RELU;
constexpr auto ACTOR_ACTIVATION_FUNCTION_OUTPUT = bpt::nn::activation_functions::TANH;
constexpr auto CRITIC_ACTIVATION_FUNCTION_OUTPUT = bpt::nn::activation_functions::IDENTITY;

In the following these hyperparameters are used to set up the actor and critic types and combine them into a combined actor-critic type that is used in the TD3 implementation. Furthermore, we are defining an off-policy runner type that contains a replay buffer and interacts with the environment. Initially, we were hiding this complexity in the actor critic structure but we found that exposing it is beneficial because the user has more agency and can swap out parts more easily. For example the actor and critic network types can be any type for which a bpt::forward and bpt::backward operation exist (these functions should be included before the RL operations like described above).

using OPTIMIZER_PARAMETERS = typename bpt::nn::optimizers::adam::DefaultParametersTorch<T>;

using OPTIMIZER = bpt::nn::optimizers::Adam<OPTIMIZER_PARAMETERS>;
using ACTOR_NETWORK_SPEC = bpt::nn_models::mlp::AdamSpecification<ACTOR_STRUCTURE_SPEC>;
using ACTOR_NETWORK_TYPE = bpt::nn_models::mlp::NeuralNetworkAdam<ACTOR_NETWORK_SPEC>;

using ACTOR_TARGET_NETWORK_SPEC = bpt::nn_models::mlp::InferenceSpecification<ACTOR_STRUCTURE_SPEC>;
using ACTOR_TARGET_NETWORK_TYPE = backprop_tools::nn_models::mlp::NeuralNetwork<ACTOR_TARGET_NETWORK_SPEC>;

using CRITIC_NETWORK_SPEC = bpt::nn_models::mlp::AdamSpecification<CRITIC_STRUCTURE_SPEC>;
using CRITIC_NETWORK_TYPE = backprop_tools::nn_models::mlp::NeuralNetworkAdam<CRITIC_NETWORK_SPEC>;

using CRITIC_TARGET_NETWORK_SPEC = backprop_tools::nn_models::mlp::InferenceSpecification<CRITIC_STRUCTURE_SPEC>;
using CRITIC_TARGET_NETWORK_TYPE = backprop_tools::nn_models::mlp::NeuralNetwork<CRITIC_TARGET_NETWORK_SPEC>;

using ACTOR_CRITIC_TYPE = bpt::rl::algorithms::td3::ActorCritic<TD3_SPEC>;

using OFF_POLICY_RUNNER_SPEC = bpt::rl::components::off_policy_runner::Specification<
using OFF_POLICY_RUNNER_TYPE = bpt::rl::components::OffPolicyRunner<OFF_POLICY_RUNNER_SPEC>;

In this tutorial we assume the actor and critic batch sizes are equal:


Next we instantiate the elementary data structures:

DEVICE device;
OPTIMIZER optimizer;
auto rng = bpt::random::default_engine(typename DEVICE::SPEC::RANDOM{}, 1);
bool ui = false; // this is used later to signal the bpt::evaluate to not use a UI

Next we declare and initialize the actor critic structure (containing the actors and critics). The bpt::init recursively initializes all submodules (e.g. the MLP using the Kaiming initialization):

ACTOR_CRITIC_TYPE actor_critic;
bpt::malloc(device, actor_critic);
bpt::init(device, actor_critic, optimizer, rng);

Furthermore the off-policy runner is instantiated and initialized with a single environment. Note that the off-policy runner contains the replay buffer which is allocated recursively with the bpt::malloc call.

OFF_POLICY_RUNNER_TYPE off_policy_runner;
bpt::malloc(device, off_policy_runner);
ENVIRONMENT envs[decltype(off_policy_runner)::N_ENVIRONMENTS];
bpt::init(device, off_policy_runner, envs);

We like to avoid memory allocations during the training, hence we pre-allocate batch containers for the actor and critic as well as two buffers for each. The *_training_buffers contain pre-allocated containers used during the training step in the TD3 algorithm. The *_buffers are used to hold intermediate results during the forward and backward pass of the MLP.

bpt::rl::algorithms::td3::CriticTrainingBuffers<ACTOR_CRITIC_TYPE::SPEC> critic_training_buffers;
bpt::malloc(device, critic_batch);
bpt::malloc(device, critic_training_buffers);
bpt::malloc(device, critic_buffers);

bpt::rl::algorithms::td3::ActorTrainingBuffers<ACTOR_CRITIC_TYPE::SPEC> actor_training_buffers;
bpt::malloc(device, actor_batch);
bpt::malloc(device, actor_training_buffers);
bpt::malloc(device, actor_buffers_eval);
bpt::malloc(device, actor_buffers);

For the pendulum training we don’t use observation normalization but the bpt::evaluate function expects mean and standard deviation so we initialize them to an isotropic, standard normal distribution:

bpt::MatrixDynamic<bpt::matrix::Specification<T, TI, 1, ENVIRONMENT::OBSERVATION_DIM>> observations_mean;
bpt::MatrixDynamic<bpt::matrix::Specification<T, TI, 1, ENVIRONMENT::OBSERVATION_DIM>> observations_std;
bpt::malloc(device, observations_mean);
bpt::malloc(device, observations_std);
bpt::set_all(device, observations_mean, 0);
bpt::set_all(device, observations_std, 1);

Now we can finally train the pendulum swing up. We iterate over STEP_LIMIT steps. Every 1000 steps we evaluate the averate return of the current policy (using deterministic rollouts without exploration noise). On every iteration we call bpt::step which uses the off-policy runner to execute one step using the current policy and save it in its internal replay buffer. After some warmup steps we can start training the actor and critic models. To train the critic, we sample target action noise (such that the training itself is deterministic), sample a batch from the replay buffer and train the critic. This is done for each critic individually. On every other step we use the current target critic to train the actor using another batch sampled from the replay buffer. We also update the target critics and actor on every other step. For more details on the TD3 training procedure you can look into the called functions and refer to the TD3 paper

auto start_time = std::chrono::high_resolution_clock::now();
for(int step_i = 0; step_i < STEP_LIMIT; step_i+=OFF_POLICY_RUNNER_SPEC::N_ENVIRONMENTS){
    // Taking the training time and evaluating the agent
    if(step_i % 1000 == 0 || step_i == STEP_LIMIT - 1){
        auto current_time = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> elapsed_seconds = current_time - start_time;
        auto result = bpt::evaluate(device, envs[0], ui,, bpt::rl::utils::evaluation::Specification<10, EPISODE_STEP_LIMIT>(), observations_mean, observations_std, rng);
        std::cout << "Step: " << step_i << "/" << (STEP_LIMIT-1) << " mean return: " << result.mean << " (" << elapsed_seconds.count() << "s)" << std::endl;
    // One environment step (saved in the replay buffer)
    bpt::step(device, off_policy_runner,, actor_buffers_eval, rng);

    // TD3 training using the replay buffer
    if(step_i > N_WARMUP_STEPS){
        // Critic training
        for(int critic_i = 0; critic_i < 2; critic_i++){
            bpt::target_action_noise(device, actor_critic, critic_training_buffers.target_next_action_noise, rng);
            bpt::gather_batch(device, off_policy_runner, critic_batch, rng);
            bpt::train_critic(device, actor_critic, critic_i == 0 ? actor_critic.critic_1 : actor_critic.critic_2, critic_batch, optimizer, actor_buffers, critic_buffers, critic_training_buffers);
        // Actor training
        if(step_i % 2 == 0){
                bpt::gather_batch(device, off_policy_runner, actor_batch, rng);
                bpt::train_actor(device, actor_critic, actor_batch, optimizer, actor_buffers, critic_buffers, actor_training_buffers);

            bpt::update_critic_targets(device, actor_critic);
            bpt::update_actor_target(device, actor_critic);

Step: 0/9999 mean return: -1479.62 (5.3447e-05s)
Step: 1000/9999 mean return: -1544.22 (29.8179s)
Step: 2000/9999 mean return: -1487.76 (62.066s)
Step: 3000/9999 mean return: -1340.32 (99.5004s)
Step: 4000/9999 mean return: -1065.59 (137.757s)
Step: 5000/9999 mean return: -897.391 (191.689s)
Step: 6000/9999 mean return: -805.275 (234.109s)
Step: 7000/9999 mean return: -770.334 (279.03s)
Step: 8000/9999 mean return: -274.956 (310.019s)
Step: 9000/9999 mean return: -173.292 (343.105s)
Step: 9999/9999 mean return: -214.823 (378.395s)

In the case of the pendulum a mean return of around -200 means that the policy learned to swing it up from any initial condition and stabilize it in the upright position.

Bonus (this only works when you are running this tutorial live because this draws to a temporary canvas)

We implemented a UI (pendulum::ui::Xeus) that can render to a canvas element in this notebook:

using UI_SPEC = bpt::rl::environments::pendulum::ui::xeus::Specification<T, TI, 400, 100>; // float type, index type, size, playback speed (in %)
using UI = bpt::rl::environments::pendulum::ui::xeus::UI<UI_SPEC>;

We declare it and give the canvas as the output value of the cell (last statement) to be displayed:

UI ui;

We can now pass this UI to the bpt::evaluate function which populates it with the state and renders it into the displayed canvas:

 bpt::evaluate(device, envs[0], ui,, bpt::rl::utils::evaluation::Specification<1, EPISODE_STEP_LIMIT>(), observations_mean, observations_std, rng);

The simulation runs in the kernel and pushes updates to the notebook, hence depending on the network speed the maximum playback speed might be less than realtime. The indicator at the bottom shows how much torque is applied to the joint by the policy. You can re-run this cell to run another episode with a different, random initial state.