Hello, Neural Network 3! GPU time.

Today is the day we finally move training of our simple model onto the GPU. As I’ve mentioned before, I’ll be using the Vulkan API for that. The shaders - the actual programs that run on the GPU - will be written in GLSL, a simple C-like language. Vulkan is famous for its complexity and verbosity, but in return it offers an incredible amount of control. Writing GPU code is tough, and writing GPU code in Vulkan is even tougher. For comparison: the Eigen version of this project took just 377 lines of code (LOC), while the Vulkan version ended up at 3’309 LOC - most of it boilerplate API calls unrelated to shaders or the neural network itself. But hey, these kinds of challenges are what make it interesting. This article will dive into technical details, so I assume the reader already knows at least the basics of Vulkan.

I decided to use plain Vulkan 1.0 without any extensions. No helper libraries or frameworks - just pure, handwritten Vulkan. The current Vulkan version as of this writing is 1.4, and the API has evolved a lot since 1.0. Many features were added to make a programmer’s life easier. But my main motivation was portability: I wanted the code to run on any device that supports Vulkan, even very old ones. That means no descriptor indexing, no new synchronization model, no VK_KHR_maintenanceN features, etc. The result is admittedly uglier code, but that was a deliberate decision. At least the performance remains consistent across versions.

For data types, I’m sticking to single-precision floats only. In my earlier implementations, I allowed a choice between doubles and floats, but here it’s unnecessary. I’m testing on a regular consumer GPU - an NVIDIA RTX A5500 Laptop GPU - and such hardware has poor double-precision throughput. Using doubles would just waste time and memory while being guaranteed to run slower than floats.

Intro

The basic structure stays the same - I still use Layer and NeuralNetwork classes - but now they’re enhanced with Vulkan-specific objects.

For testing, as you remember, I used a simple 3-layer neural network with 784 input neurons, 100 hidden neurons, and 10 output neurons. The diagram below shows how everything is connected. The layers are Input, Hidden, and Output.

There are two main steps: forward propagation and backward propagation. Backward propagation itself is split into two sub-phases: the gradient calculation phase and the update phase.

$A^i$ - the input, representing a hand-written digit. During training, this is the data we use to teach the network. It comes from outside the model and is never modified in our code.
$A^h$, $W^h$, $B^h$ - the activations, weights, and biases of the hidden layer. These values are updated continuously during every iteration of each epoch.
$\delta^o$ - the gradient of the output layer. This is also updated at every iteration of each epoch.
$\delta^h$ - the gradient of the hidden layer. Like the others, it is updated at every iteration of each epoch.

The arrows and colors in the diagram are not arbitrary - they represent data dependencies. For example, to calculate $A^h$ we need $A^i$, $W^h$, and $B^h$. These dependencies are crucial, and I will show later why. Keep that diagram in mind.

To store data on the GPU, I use storage buffers - buffers created with the VK_BUFFER_USAGE_STORAGE_BUFFER_BIT flag. Memory is allocated during the instantiation of the NeuralNetwork class: each layer allocates the necessary buffers, and additional buffers are allocated for the network as a whole. Memory is released manually once training is finished, the GPU is idle, and it is safe to destroy the buffers.

There are four compute shaders in total, one for each corresponding step:

forward.comp - computes the activations $A$ using the following formulas:

$$ Z^{n} = W^{n} \cdot A^{n-1} + B^{n} $$

$$ A^{n} = \sigma(Z^{n}) $$

Here, $n$ denotes the current layer being calculated, $n-1$ is the previous layer, and $\sigma()$ is the activation function - in this case, the sigmoid.
delta.comp - the compute shader that calculates the deltas for both the output and hidden layers:

$$ \delta_i^o = \frac{\partial \text{C}}{\partial a_i} \cdot \frac{d\sigma}{dz_i} $$

$$ \delta^n = (T(W^{n+1}) \cdot \delta^{n+1}) \odot sigmoidDerivative(A^n) $$

For a refresher on the underlying formulas, see the previous post.
update.comp - the compute shader that takes the deltas and updates the weights and biases for all layers (except the input layer, of course):

$$ W^n = W^n - learningRate \cdot \delta^n \cdot T(A^{n-1}) $$

$$ B^n = B^n - learningRate \cdot \delta^n $$

Here, $T(A^{n-1})$ denotes the transpose of the previous layer’s activations. Since these activations are stored as a single-row matrix (a vector), the transpose is effectively a no-op.

I mentioned that there are four shaders in total. The fourth shader is special, and I will explain it in more detail later.

The shaders themselves are fairly straightforward. I use shared memory to store a tile of a matrix in fast memory - standard GPU optimization. The shaders are also optimized for vector operations; if you examine the formulas closely, one of the operands is a vector. I won’t go into the details here, but the source code is linked at the end of the post.

Next, let’s recall how the input data is organized. The MNIST dataset consists of the following parts:

60’000 training images, or batches.
60’000 training targets - that is, each image has a corresponding “correct” label, a single number (for example, 5).
20’000 test images.
20’000 test targets.

This means that for training, we perform the same steps 60’000 times per epoch, and we have multiple epochs. For my timing measurements, I used 20 epochs.

With all this information, I began the implementation.

Denial

In my first attempt, I tried to replicate the approach used in the Eigen implementation. The algorithm can be summarized in the following pseudo-code:

for every epoch
    for every batch
        allocate command buffer
        upload input batch to the GPU
        upload target to the GPU
        forward()
        backward()
        submit()
        queue wait idle
        free command buffer

    shuffle batches

forward()
    for each layer except input
        activate layer

backward()
    calculate output delta

    for each hidden layer
        calculate hidden delta

    for each layer except input
        update layer

Now, let’s get more Vulkan-specific. For each batch, I allocated a command buffer. The input and target buffers I made host-visible - primarily to avoid unnecessary copies, since these buffers are short-lived.

I was a bit lazy when it came to descriptor sets, so I created a single set with five VK_DESCRIPTOR_TYPE_STORAGE_BUFFER bindings - enough to hold the necessary amount of buffers for a shader. Some shaders use all the bindings, while others use only a subset.

By the way, in GLSL it’s possible to do the following:

layout(set = 0, binding = 2) readonly buffer ExpectedOutput
{
    // used only by output layer
    float expectedOutput[]; // neuronCount x 1
};

// uses the same binding slot. Depending on dispatch either this or the other will be bound
layout(set = 0, binding = 2) readonly buffer NeighbourDelta
{
    // used only by hidden layer
    float neighbourDelta[]; // neighbourLayerNeuronCount x 1
};

That is, you can bind different data to the same slot. Here, I’m using lists of floats, but it could be completely different data types. I applied this in the delta.comp shader, which calculates both output and hidden deltas. Depending on the step, I updated the descriptor set at slot 2 with the appropriate data.

To calculate the output delta, I need the target expanded into an array. Recall that a batch target is just a single number, but for the calculations, I need a list equal in size to the output layer. For example, if the target value is 5 and the output layer has 10 neurons, the expanded list looks like this:

[0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 1.0f, 0.0f, 0.0f, 0.0f, 0.0f]

That is, an array of size 10 with the fifth element set to 1. As in previous implementations, I expanded this on the CPU and uploaded it to a host-visible buffer.

The forward() and backward() steps updated the corresponding descriptor sets, added the necessary memory barriers, and recorded the dispatch calls into the command buffer. Then I submitted everything, waited for the GPU to finish processing the batch, and proceeded to the next batch - repeating this 60’000 times per epoch.

For reference, the best Eigen implementation using AVX2 instructions takes on average 1’291 ms per epoch on my machine. After compiling the Vulkan code and verifying correctness, I ran it - and it took… ~5’500 ms per epoch. Wait, what? The GPU is slower? I reran it several times, confirmed that the NVIDIA GPU was selected, and still got 5’500 ms. I went to bed very sad and unsatisfied.

A note on how I check correctness

For Vulkan API usage, I always test with validation layers. They are a lifesaver and, in my opinion, a strict requirement when developing a Vulkan application. Yes, they are slow - especially noticeable when making thousands of calls inside a 60’000-iteration loop. Sometimes they are so slow that I had to reduce the number of iterations. But having an application run without a single warning is a goal for me.

To verify the correctness of the algorithm, I compare the results of each step with my previous implementations - naive and Eigen. I use a random number generator with identical seeds to ensure reproducibility. As the algorithm runs, small differences accumulate due to floating-point non-associativity, and after 60’000 iterations the numbers diverge significantly. However, the first iterations are sufficient to check if the results are matching, since each iteration applies the same calculations repeatedly. Final tests using the test dataset confirm that the results are identical.

A note on how I debug GPU code

For this program, I didn’t need complex debugging. Usually, for more complicated shaders, I use RenderDoc - it allows capturing Vulkan calls via the RenderDoc API and supports debugging compute shaders, including selecting specific workgroups and threads for step-by-step inspection.

However, since my shaders are relatively simple, I relied on the good old printf, together with the GL_EXT_debug_printf extension. It works out of the box when you launch the vkconfig tool - you just need to enable the corresponding checkbox.

Anger

Waiting for the queue to become idle is definitely a bad idea, since it stalls the CPU. Could this be bad enough to completely ruin my performance? After all, I’m calling it 60’000 times per epoch, so there’s obviously some overhead.

I decided to implement a technique we commonly use in graphics: multiple frames in flight. This allows the CPU to continue working while the GPU is busy. I assume you’re familiar with the concept, so I won’t go into details.

My approach was straightforward: instead of using a single instance of the descriptor sets and host-visible buffers, I created N instances, where N is the number of frames in flight. In graphics, this is usually 2 or 3, but for compute I could use any number - even 50. I also created N fences to ensure synchronization when necessary.

After making all the adjustments and setting N to 10, I ran the code… 5’000 ms. Sligtly faster then before, but still bad. Hmm, maybe 10 frames isn’t enough. How about 50? Still 5’000 ms. I’m cursed.

Depression

Ok, I decided to optimize aggressively. Why do I need to synchronize in the first place? If you looked at the pseudo-code, you might have noticed these lines:

upload input batch to the GPU
upload target to the GPU

Why upload data every batch? CPU-GPU communication is slow. Instead, I can upload all 60’000 input batches and their corresponding targets just once. It’s not that much data - the full input is approximately 180 MB, and the target data is only 2.3 MB. Even older devices can handle this easily.

But wait, there’s this line in the pseudo-code that’s called once per epoch:

shuffle batches

What’s that for? Well, it’s good practice to shuffle the input data between epochs. I can’t explain exactly why, but the textbooks recommend it.

Of course, shuffling the input alone doesn’t make sense, since each input entry has a corresponding target. How can we shuffle both while keeping them aligned? I solve this with a batch indices list. At the start of training, it contains the integers 0 through 59’999: [0, 1, 2, 3, ...].

When reading the input data, I use the current index from this list to sample both input and target. After an epoch, I shuffle the indices list. For example, it might become [5365, 897, 56147, ...], so the first batch in the next epoch samples the 5’365th input and target, and so on.

To make this work efficiently on the GPU, I modified the code as follows:

upload input batch to the GPU
upload target to the GPU

for every epoch
    upload shuffled batch indices to the GPU
    reset batch index
    allocate command buffer

    for every batch
        updateBatchIndex()
        forward()
        backward()
        submit()
        
    free command buffer
    shuffle batches

Here, I removed the CPU-GPU communication from the long loop: the input data is now uploaded only once, and the shuffled indices are uploaded once per epoch. There’s probably a way to shuffle them entirely on the GPU, but I haven’t explored that.

Additionally, I added a new updateBatchIndex() function. Since almost everything now runs on the GPU, I needed a way to provide the correct batch index for each iteration. The complete compute shader for this function is as follows:

#version 450

layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

layout (set = 0, binding = 0) buffer BatchIndex {
    uint currBatchIndex[]; // 0 - counter, 1 - current index
};

layout(set = 0, binding = 1) readonly buffer BatchIndices
{
    uint batchIndices[];
};

void main() {
    uint idx = currBatchIndex[0];
    currBatchIndex[1] = batchIndices[idx];

    currBatchIndex[0] += 1;
}

batchIndices is our shuffled list of indices, e.g., [5365, 897, 56147, ...]. currBatchIndex is a two-element array, where the first element is a counter and the second is the batch index. Here’s how it works:

At the start of each epoch, I upload batchIndices to the GPU and reset currBatchIndex to [0, 0].
For every batch, I call vkCmdDispatch(commandBuffer, 1, 1, 1), i.e., a single-thread compute shader. I know this wastes some threads in a subgroup, but the overhead is negligible compared to CPU-GPU transfers.
Each call to this shader reads the corresponding batch index. In our example, the first call has counter = 0, which points to batchIndices[0] = 5365.
Each call also increments the counter. After the first call, the counter becomes 1, so the next call reads batchIndices[1] = 897, and so on.

This is the fourth shader I mentioned earlier.

I also adjusted all the other shaders to use this new batch index and updated the necessary memory barriers.

The attentive reader may have already noticed something: why am I recording commands to the command buffer 60’000 times per epoch if nothing changes between iterations? There’s absolutely no need to do that.

Instead, I can record a small command buffer once and replay it 60’000 times. To allow a command buffer to be used simultaneously, it must be created with the VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT flag. Other than that, no special steps are required - it’s perfectly legal in Vulkan.

With this optimization, the pseudo-code simplifies to:

upload input batch to the GPU
upload target to the GPU
record command buffer

for every epoch
    upload shuffled batch indices to the GPU
    reset batch index

    for every batch
        submit()

    shuffle batches

free command buffer

I was already anticipating victory. I compiled and ran the code… 5’000 ms.

Bargaining

Previously, I mentioned that my shaders are trivial. That’s true - they are relatively small, and I’m not doing anything particularly complex. But maybe I’m wrong? I decided to measure shader performance.

Measuring performance in Vulkan isn’t straightforward. Wrapping a vkCmdDispatch() call with a CPU timer doesn’t make sense, because this call merely records a command in the command buffer - it doesn’t execute it immediately on the GPU. There are vendor-specific profiling tools, but this time I chose to use timestamp queries.

With timestamp queries, you insert a start query, record your commands, insert a stop query, and later read the results. The difference between the two timestamps gives you the GPU execution time. In pseudo-code, it looks like this:

vkCmdWriteTimestamp()
vkCmdDispatch()
vkCmdWriteTimestamp()

It looks simple and should work, right? Right? As it turned out - no. Timestamp queries have limited resolution, and if the start and stop queries are too close, the GPU can write incorrect values. It took me a while to figure this out. Previously, a shader reported 0.05 ms, which doesn’t seem like much - but multiplied by 60’000, that’s 3’000 ms for a single shader. And remember, I have multiple layers running multiple shaders.

The solution is straightforward: instead of measuring a single dispatch, I measured 10’000 dispatches of the same shader:

vkCmdWriteTimestamp()

for 0 to 1000
    vkCmdDispatch()

vkCmdWriteTimestamp()

This approach actually worked. Using it, I measured the three main shaders, and here are the results. Recall that my network has three layers - input (784), hidden (100), and output (10). Since the input layer does not participate in calculations, there are six shader calls per batch (I did not measure the batch index update shader).

Here are the measured times on my device:

Shader	Hidden layer (100)	Output layer (10)
forward.comp	0,000762982 ms	0,000166298 ms
delta.comp	0,000169882 ms	0,000167731 ms
update.comp	0,001256346 ms	0,000171213 ms

For 60’000 calls, this corresponds to approximately 161.649 ms total (one hundred sixty one milliseconds).

So the problem isn’t the shaders - as expected, the GPU handles computation very quickly. Looking at pure computation, it actually outperforms the Eigen version by an order of magnitude. Sure, there’s room for improvement: fine-tuning the shaders themselves, adjusting the number of workgroups and threads for my machine, and so on. But even if I doubled the shader performance, it wouldn’t make the overall code faster. Clearly, there’s something else causing the slowdown…

Acceptance

To understand what’s really happening, let’s revisit the diagram I shared earlier. I’ll drop it here again for reference:

Let’s examine the forward pass in our three-layer network. To calculate the values for a batch, we need the layer’s weights and biases as well as the activations from the previous layer. This means the output layer must wait until the hidden layer has fully completed its computations.

Moreover, the weights and biases are updated during the backpropagation phase. As a result, the forward pass for the next batch cannot begin until the previous batch is fully processed. The other phases follow a similar pattern.

In other words, the computational pipeline is fully serial - no phase can run in parallel with another. This also implies that if some layers are small (for example, the output layer with only 10 neurons), the GPU will be mostly idle during that phase.

But the most crucial aspect is memory. GPUs have a multi-layer memory hierarchy: global “big” memory, L2 cache, and multiple L1 caches. Unlike CPUs, these memories are not coherent - if one thread writes a value, there is no guarantee that another thread (on a different compute unit) will see the updated value. It is the programmer’s responsibility to ensure that writes are visible and reads see the latest data.

In Vulkan, this is handled with memory barriers. For example, when the hidden layer writes its activations and the output layer reads them, a barrier must be placed between these steps. Similarly, to guarantee that the latest weights and biases are read, a barrier must be added before using them.

It’s well known that barriers are slow - the official Vulkan specification even mentions this and provides recommendations. But I didn’t expect them to be this slow. In fact, if I remove all barriers in my application, the epoch time drops to roughly ~3’200 ms. Of course, the results are completely incorrect, but this clearly demonstrates the performance impact that barriers have.

Can we do something about that? Actually, not really. We can optimize barrier usage - try to combine multiple barriers into a single call or find steps that can be overlapped - but this won’t change the overall picture.

Then why does everyone use GPUs for training? As we’ve seen, raw compute power is much higher on GPUs than on CPUs. My tests were on a laptop, where the CPU is comparable to a desktop, while the GPU is smaller, more energy-efficient, and slower than a desktop variant. The key factor, however, is the network size. With this tiny neural network, there’s not much work to do, and the GPU spends most of its time idle.

Previously, I tested an “extreme case” - a neural network with four layers: 784, 1000, 1000, 10. The best result so far with the AVX2 Eigen version was 88’930 ms per epoch. I’m happy to report that the Vulkan version outperforms it at 21’112 ms per epoch - roughly four times faster. With a larger workload, the GPU stays busy almost all the time. Sure, there are still memory transfer bottlenecks, but their relative impact decreases as the data size increases. And the larger the network, the greater the performance gap between CPU and GPU.

Conclusion

Today we learned that the CPU can actually be faster for small neural networks. However, in the era of big data, the GPU reigns supreme. With Vulkan, we can train and run inference on neural networks efficiently across a wide range of platforms - from small devices like a Raspberry Pi to servers with AMD GPUs. Using C++ and Vulkan makes it easier to develop truly cross-platform code.

The project’s source code is here.

I am using a static generator (Hugo) to build this site, so there is no comment section directly here. As a personal experiment, I published a short post on LinkedIn pointing to this article. If you have a question, you can ask it there. If you want to follow for updates, you can also follow me there.

September 26, 2025