September 3, 2025

Hello, Neural Network 2! Follow the White Rabbit.

Last time, we ended with a working naive implementation of a simple neural network. We also briefly touched on the performance difference between floats and doubles. But as you may remember, the ultimate goal is to train the network on a GPU using Vulkan Compute.

In Vulkan Compute shaders (I’ll use GLSL as the shader language), we don’t have anything comparable to the C++ standard library. In fact, we have very little - just basic arithmetic operations and a small set of built-in functions. Because of that, we need something that make the code porting easier.

As you know, GPUs are highly parallel devices, and we want to use algorithms that are embarrassingly parallelizable or at least close to that. Matrix multiplication maps well onto GPU architectures. As a preparation step, I’ll transform the naive implementation into one that uses basic linear algebra rules. For now, it will still run on the CPU, but to make it as fast as (or even faster than) the original, I’ll use the Eigen library, which makes extensive use of SIMD.

The “Follow the White Rabbit” part in the post title is a reference to The Matrix movie - because we’ll be working with matrices. Alternatively, it can also be read as a nod to Lewis Carroll’s Alice’s Adventures in Wonderland, where Alice literally follows a white rabbit into a fantastical world. Figuratively, it means to pursue an unknown clue, embark on an adventure, or explore a different reality.


Let’s start from the beginning

In a previous post, we discovered that the data layout in Neuron.cpp caused a significant performance loss. The first step, therefore, is to completely remove this intermediate object. Now, the weights and biases are stored directly in the Layer class.

class Layer {
public:
    using MatrixX = Eigen::Matrix<common::Float, Eigen::Dynamic, Eigen::Dynamic>;
    using VectorX = Eigen::Matrix<common::Float, Eigen::Dynamic, 1>;

public:
    Layer() noexcept = default;

    Layer(size_t neuronCount, size_t inputCount) noexcept;

public:
    [[nodiscard]] auto activate(Layer const& prevLayer,  //
                                std::function<auto(common::Float)->common::Float> const& activationFunction  //
                                ) noexcept -> bool;

    [[nodiscard]] auto update(Layer const& prevLayer,  //
                              common::Float learningRate,  //
                              MatrixX const& delta  //
                              ) noexcept -> bool;

    size_t size() const noexcept;

public:
    MatrixX weights{};
    MatrixX biases{};
    MatrixX values{};
};

Here, the weights form a regular rectangular matrix, while the biases and values are single-column matrices (or simply vectors). The weights always point to the previous layer.


Let’s see how we would calculate the values for the first hidden layer, $h0$.

$$ a^{h0}_0 = f(z^{h0}_0) $$

$$ a^{h0}_1 = f(z^{h0}_1) $$

$$ a^{h0}_2 = f(z^{h0}_2) $$

where $f(z_i)$ is the activation function - in our case, the sigmoid.

$$ z_0^{h0} = a_0^i \cdot w_\text{0_0}^{h0} + a_1^i \cdot w_\text{0_1}^{h0} + b_0^{h0} $$

$$ z_1^{h0} = a_0^i \cdot w_\text{1_0}^{h0} + a_1^i \cdot w_\text{1_1}^{h0} + b_1^{h0} $$

$$ z_2^{h0} = a_0^i \cdot w_\text{2_0}^{h0} + a_1^i \cdot w_\text{2_1}^{h0} + b_2^{h0} $$

Here, the superscript denotes the layer. The subscript in $a_n$ refers to the value of the $n$-th neuron, while the subscript in $w_\text{n_k}$ denotes the weight from the $n$-th neuron to the $k$-th neuron in the previous layer.

Now if we look closely, we can see a pattern that looks like a matrix multiplication followed by a matrix addition. If we mark:

$$ W^{h0} = \begin{bmatrix} w_\text{0_0}^{h0} & w_\text{0_1}^{h0} \\
w_\text{1_0}^{h0} & w_\text{1_1}^{h0} \\
w_\text{2_0}^{h0} & w_\text{2_1}^{h0} \end{bmatrix} $$

$$ A^{i} = \begin{bmatrix} a_0^{i} \\
a_1^{i} \end{bmatrix} $$

$$ B^{h0} = \begin{bmatrix} b_0^{h0} \\
b_1^{h0} \\
b_2^{h0} \end{bmatrix} $$

$$ Z^{h0} = \begin{bmatrix} z_0^{h0} \\
z_1^{h0} \\
z_2^{h0} \end{bmatrix} $$

then the layer activation becomes:

$$ Z^{h0} = W^{h0} \cdot A^{i} + B^{h0} $$

$$ A^{h0} = f(Z^{h0}) $$

which maps perfectly onto Eigen’s arithmetic operations:

// Layer::activate()
values.noalias() = weights * prevLayer.values;
values += biases;

values = values.unaryExpr(activationFunction);

Now you need to do this for every layer starting from the second one (skipping the input layer), i.e., propagating the changes forward - hence the name.

NOTE: In the beginning, I stored weights and biases in a single matrix, with biases placed in the last column. Layer values had an additional row with a constant value of 1. With this setup, I could perform both the multiplication and the bias addition in a single operation instead of two. However, the code quickly became messy, and the extra arithmetic operations still had to be executed. For a small neural network, these operations are negligible, but for huge networks with billions of parameters, the overhead could matter. I don’t plan to multiply matrices of that scale, but I decided to split the operations anyway for the sake of code clarity.

Back propagation

With matrix notation, the forward pass becomes trivially simple. For me, the biggest challenge was understanding how to apply a similar notation to backpropagation. As mentioned earlier, I’m not going to explain what backpropagation is or how it works in detail. Instead, I’ll just show it in matrix form. Personally, solving a small neural network from the image above by hand - using pen and paper - helped me much more than watching YouTube videos or asking ChatGPT for the ready solution. If you really want to follow my thinking process, and if this post collects 1,000 likes, I’ll create a separate post explaining it step by step.

After the forward pass, we have values in the output layer, which we compare against the expected results. From this, we calculate the layer’s delta (more on that shortly) and use it to update the layer:

auto Layer::update(Layer const& prevLayer,  //
                   common::Float learningRate,  //
                   MatrixX const& delta  //
                   ) noexcept -> bool {
    if (static_cast<size_t>(delta.rows()) != size()) {
        return false;  // Mismatch in delta size
    }

    MatrixX const gradient{delta * prevLayer.values.transpose()};

    weights -= learningRate * gradient;
    biases -= learningRate * delta;

    return true;
}

We treat weights and biases slightly differently, but this still maps perfectly onto matrix math. The same operation can be written in mathematical form as:

$$ G^n = \delta^n \cdot T(A^{n-1}) $$

where $T(A^{n-1})$ denotes the transpose of the values from the previous layer. In other words, it converts a column vector into a row vector - this is necessary to satisfy the rules of matrix multiplication.

$$ W^n = W^n - learningRate \cdot G^n $$

$$ B^n = B^n - learningRate \cdot \delta^n $$

We repeat this process for every layer, starting from the last one (skipping the first layer, since it’s the input).

Calculating this $\delta^n$ is, in my opinion, the trickiest part - partly because it is computed differently for different layers. For the last layer (the one we start backpropagation with), it is calculated as:

$$ \delta_i^o = \frac{\partial \text{C}}{\partial a_i} \cdot \frac{d\sigma}{dz_i} $$

Please refer to the previous post where I explained what these partial derivatives represent.

For the other layers, it is calculated as:

$$ \delta^n = (T(W^n) \cdot \delta^{n+1}) \odot sigmoidDerivative(A^n) $$

Here, $T(W^n)$ is the transpose of the weights for the $n$-th layer, $\delta^{n+1}$ is the delta from the next layer (e.g., for the second-to-last layer, it will be the delta of the last layer - $\delta^o$), $\odot$ denotes component-wise multiplication (not to be confused with standard matrix multiplication), and $sigmoidDerivative(A^n)$ is the derivative of the sigmoid function applied to each value in the layer.

As you can see, a layer calculates its $\delta^n$ using the $\delta$ from another layer - hence the term backpropagation.

I understand that this can all be complicated, and I want to emphasize once more that the goal of these posts is not to teach how backpropagation works, but to show how performance is affected by different implementation approaches.

Results

As a reminder, the naive implementation with 20 epochs, a single hidden layer, and float as the data type took 73’071.55 ms on my machine, or roughly 3’653 ms per epoch.

The same configuration using double was actually faster - 56’971.18 ms, or about 2’849 ms per epoch.

The new Eigen-based approach takes 30’624.57 ms, or roughly 1’531 ms per epoch for float, which is a significant improvement! The double data type behaves more predictably this time, taking 74’144.21 ms, or about 3’707 ms per epoch.

But wait - we used the -O3 compiler optimization flag, yet didn’t specify any vectorization flags. Let’s see what Eigen is using under the hood. I added the following code in main.cpp to print this information:

#if defined(EIGEN_VECTORIZE_AVX512)
    fmt::println("Using AVX-512");
#elif defined(EIGEN_VECTORIZE_AVX2)
    fmt::println("Using AVX2");
#elif defined(EIGEN_VECTORIZE_AVX)
    fmt::println("Using AVX\n");
#elif defined(EIGEN_VECTORIZE_SSE)
    fmt::println("Using SSE");
#else
    fmt::println("No vectorization");
#endif

By default, Eigen prints "Using SSE", which means it uses 128-bit registers. One such register can hold 4 float values or 2 double values.

My CPU supports AVX2 instructions, which are 256 bits wide - meaning one register can process 8 float values or 4 double values at a time. I recompiled the project with the -mavx2 flag, and here are the results:

  • float - 25’830.90 ms, or ~1’291 ms per epoch

  • double - 56’573.02 ms, or ~2’828 ms per epoch

I also tried an extreme - but extremely interesting - case with a 4-layer network. It used the same input and output as before, but each of the two hidden layers had 1’000 neurons. The naive float version took ~134’640 ms per epoch. With EIGEN_VECTORIZE_AVX2 and float, it now takes “only” ~88’930 ms per epoch! Not bad.

Summary

Configuration Time per Epoch (ms) Total Time per 20 Epochs (ms) Speedup
Naive float, 1 hidden layer, 100 neurons 3’653 73’071
Naive double, 1 hidden layer, 100 neurons 2’849 56’971 1.28×
Eigen float, 1 hidden layer, 100 neurons 1’531 30’625 2.38×
Eigen double, 1 hidden layer, 100 neurons 3’707 74’144 0.99×
Eigen float + AVX2, 1 hidden layer, 100 neurons 1’291 25’830 2.83×
Eigen double + AVX2, 1 hidden layer, 100 neurons 2’828 56’573 1.29×

Extreme case:

Configuration Time per Epoch (ms) Speedup
Naive float, 4-layer network, 1000 neurons per hidden layer 134’640
Eigen float + AVX2, 4-layer network, 1000 neurons per hidden layer 88’930 1.514×

Conclusion

Today, we achieved significant progress - we more than doubled performance by utilizing SIMD. Can we push it even further by training the network on a GPU?

The project’s source code is here.

I am using a static generator (Hugo) to build this site, so there is no comment section directly here. As a personal experiment, I published a short post on LinkedIn pointing to this article. If you have a question, you can ask it there. If you want to follow for updates, you can also follow me there.

If you like what I do you can buy me a coffee © nikitablack 2021

Powered by Hugo & Kiss.