I don’t remember when AI became a big thing. It didn’t happen suddenly, but it did happen fast. Three or four years ago, my friend - a big AI fan but not a programmer - told me he didn’t need to learn programming languages because ChatGPT
could write programs. I asked him to create a simple ping-pong game with it; the language didn’t matter. He couldn’t - the tool didn’t understand what he wanted, produced incorrect code, and it was impossible to tell it to fix just that specific part. I told my friend that AI was a cool toy, nothing more. Oh boy, how wrong I was. Now, in 2025, my friend launches multiple agents that do tons of work for him. So I thought: now it’s time to learn how this thing works. Sure, I had some idea of how it’s done, and of course I’m a ChatGPT
(and other network) user. But I want to understand exactly how the numbers work together to produce an answer. The first step in my journey - as usual in programming - starts with writing a “Hello, World!” program. For a neural network, that’s handwritten digit recognition, which I’ll implement in C++. I will compare different technologies and measure their performance - pure C++
, SIMD
, CUDA
, and, if I don’t burn out, Python
. Ultimately, though, my goal is to run it with Vulkan
. Why Vulkan
? Well, because:
- It’s VERY low-level, which gives precise GPU control. I can select any type of memory I want and fine-tune synchronization (though in this app I’ll use just the basics).
- It runs on a huge number of devices —
NVIDIA
,AMD
,Intel
,Qualcomm
, etc. - and operating systems. Even theRaspberry Pi
supports it out of the box. - It’s certified for critical safety systems (though that’s not the focus of this app).
NOTE: And yeah, I will not use AI to generate code for me. Sure, I’ll ask how to do certain things, but all the code will be written the old-school way - by hand and with love.
Another MNIST dataset recognition tutorial?
No, this will not be a tutorial on how a network works. I will not explain what forward or backward propagation is, why we need biases, or what an activation function does. There are tons - no, TOOOONS - of educational materials on YouTube
. As usually happens with math explanations, different people build different abstractions in their heads, and if examples with apples work for somebody, for others it’s a complete miss. I see it with my children - an attempt to explain some concept the way I see it simply does not work; I need to find an individual approach. That’s the reason I will not try to explain these concepts - at least not today. Instead, I’ll document the high-level process of implementing it in C++
and how to run it on a GPU
.
Another goal is to understand why neural network training and GPUs almost always come together. I’ll implement multiple versions of the same program and measure their execution time.
Naive attempt
I usually start with a naive, brute-force approach - just to make the thing work. I’ll do the same this time. In my first version, I’ll use standard std::vector
containers and perform the math by iterating over them directly. On top of that, I’ll take a straightforward OOP route: a Neuron
class, a Layer
class, and finally a NeuralNetwork
class.
NOTE: This implementation is NOT meant to be a general-purpose library. It’s a one-off application tailored for a single dataset, which, by the way, is embedded directly into the executable to avoid any runtime loading overhead.
And here’s the funny part - I asked ChatGPT
a LOT about how to write a network. So, a human created AI… and now AI is helping a human create AI.


And one of the first things it recommended was to use double
for high precision. I’m not sure why - the input values for both the training and test datasets are in the [0.0–1.0]
range, and so are the weights and biases. A 32-bit float
should be more than enough. Besides, GPU
s really like floats
. In fact, as far as I know, there’s no way to feed 64-bit data into NVIDIA
’s tensor cores.
Still, for the sake of science (and possible “I told you so” moments), I’ll make precision a compile-time option so I can switch between float
and double
as needed:
#ifdef USE_DOUBLE
using Float = double;
#else
using Float = float;
#endif
With that, the Neuron
class has the following signature:
class Neuron {
public:
Neuron() = default;
Neuron(size_t inputCountArg) noexcept;
public:
Float value{0.0};
Float bias{0.0};
std::vector<Float> weights{};
};
It’s pretty self-explanatory. The only thing worth mentioning is that a neuron stores weights for the previous layer.
Let’s also agree on the indexing. For weights, I’ll use from_to indexing. That is, $w^{h1}_{1\_2}$ represents the weight from the second neuron (first number in the subscript) in the second hidden layer (superscript) to the third neuron (second number in the subscript) of the previous layer (which is h0
in this case).


Upon creation, a Neuron
assigns random numbers to the inputCount
weights and the bias, uniformly distributed over the [-1.0, 1.0]
range. For the input layer, no memory for weights is allocated, since it does not have a previous layer and therefore does not need weights.
The next level in the abstraction is the Layer
class.
class Layer {
public:
Layer(size_t neuronCount, size_t inputCountArg) noexcept;
public:
[[nodiscard]] auto activate(Layer const& prevLayer, //
std::function<auto(Float)->Float> const& activationFunction //
) noexcept -> bool;
[[nodiscard]] auto update(Layer const& prevLayer, //
Float learningRate, //
std::vector<Float> const& delta //
) noexcept -> bool;
public:
std::vector<Neuron> neurons{};
};
It stores neurons and has two methods for updating them. In the Layer::activate()
function, I pass the activation function as a parameter, which I added in case I want to try different functions. For now, though, I’ll use the sigmoid:
$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
When created, a layer instantiates neuronCount
neurons.
The NeuralNetwork
class is also straightforward:
class NeuralNetwork {
public:
NeuralNetwork(std::vector<size_t> const& layerSizes);
[[nodiscard]] auto forward(std::vector<Float> const& inputValues, //
std::vector<Float>& outputValues //
) noexcept -> bool;
[[nodiscard]] auto train(std::vector<std::vector<Float>> const& input, //
std::vector<uint8_t> const& target, //
size_t epochCount, //
Float learningRate //
) noexcept -> bool;
private:
[[nodiscard]] auto backward(std::vector<Float> const& output, //
std::vector<Float> const& expectedOutput, //
Float learningRate //
) noexcept -> bool;
auto print() const noexcept -> void;
private:
static auto sigmoid(Float v) noexcept -> Float;
static auto sigmoidDerivative(Float sigmoidResult) noexcept -> Float;
public:
std::vector<Layer> layers{};
};
It takes the vector layerSizes
, where the size of the vector represents the number of layers and each element corresponds to the layer size. That is, the call impl::NeuralNetwork nn{std::vector<size_t>{784, 100, 10}};
creates a network with 3 layers containing 784, 100, and 10 neurons. It also has public functions NeuralNetwork::forward()
- for network evaluation - and NeuralNetwork::train()
- for training. The NeuralNetwork::train()
function calls NeuralNetwork::backward()
- the heart of the training. The activation function NeuralNetwork::sigmoid()
and its derivative NeuralNetwork::sigmoidDerivative()
are also defined here.
Let’s start by looking at the implementation of the simplest function, NeuralNetwork::forward()
. It takes the input values - in our case, hand-written digits, where each digit is represented as a 28x28 array of floating-point numbers in the [0.0-1.0]
range, with values closer to 1.0 indicating that the pixel is filled. To avoid unnecessary allocations, the output values vector outputValues
is passed by reference - it will be allocated once and reused across multiple calls. The function returns true
if everything went well. Since all the layers and sizes are guaranteed to be valid, the only thing that can go wrong is incorrect input. All this function does is run through every layer starting from the second (remember - the first layer is the input), calling the corresponding Layer::activate()
function. After that, it reads values from the last layer:
// NeuralNetwork::forward()
for (size_t i{1}; i < layers.size(); ++i) {
auto& currLayer{layers[i]};
auto& prevLayer{layers[i - 1]};
if (!currLayer.activate(prevLayer, sigmoid)) {
return false;
}
}
auto const& outputLayer{layers.back()};
outputValues.resize(outputLayer.neurons.size());
for (size_t i{0}; i < outputLayer.neurons.size(); ++i) {
outputValues[i] = outputLayer.neurons[i].value;
}
Let’s look at the Layer::activate()
function. All it does is run through every neuron, compute its inner product (or dot product) with the previous layer’s neuron values, and apply the activation (sigmoid) function:
// Layer::activate()
for (auto& currNeuron : neurons) {
Float z{currNeuron.bias};
for (size_t j{0}; j < prevLayer.neurons.size(); ++j) {
if (prevLayer.neurons.size() < currNeuron.weights.size()) {
return false; // Not enough weights for the neuron
}
auto const& prevNeuron{prevLayer.neurons[j]};
z += currNeuron.weights[j] * prevNeuron.value;
}
currNeuron.value = activationFunction(z);
}
In the picture below, the loop result is marked as $z_i$, and the result of activationFunction(z)
is assigned to the neuron’s value. Strictly speaking, the $z_i$ part is not important for the code, but it is necessary when deriving the back-propagation formulas.


After this function, the layer neurons will have updated values ($a_i$ on the picture). As mentioned earlier, the values from the last output ($o$) layer are the result of the network evaluation. For our case, the output layer will have 10 neurons, representing the integers from 0 to 9. The neuron with the highest value is the decision for the forward pass. For example, the output {0.1, 0.2, 0.1, 0.8, 0.0, 0.1, 0.2, 0.3, 0.2, 0.5}
means that the decision is the 4th neuron, which represents the number 3
(we start with 0
).
Of course, running the NeuralNetwork::forward()
function on an untrained network will return the wrong result, so first we need to train it with the NeuralNetwork::train()
function. Let’s look at its signature:
auto NeuralNetwork::train(std::vector<std::vector<Float>> const& input, // 0.0-1.0
std::vector<uint8_t> const& target, // 0-9
size_t epochCount, //
Float learningRate //
) noexcept -> bool;
It takes images as an input
parameter. In our case, it’s a vector of 60’000 vectors. The target
parameter is the labeled data. In other words, for each input element there’s a corresponding target digit. For the very first image, it’s 5.
NOTE: If we print the very first image, assigning
X
for every value greater than0.5
and.
otherwise, we’ll see:
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . X X . X X X . . . . .
. . . . . . . . . . . X X X X X X X X X X X X . . . . .
. . . . . . . . X X X X X X X X X X . . . . . . . . . .
. . . . . . . . X X X X X X X X X X . . . . . . . . . .
. . . . . . . . . X . X X X . . . X . . . . . . . . . .
. . . . . . . . . . . X X . . . . . . . . . . . . . . .
. . . . . . . . . . . X X X . . . . . . . . . . . . . .
. . . . . . . . . . . . X X . . . . . . . . . . . . . .
. . . . . . . . . . . . . X X X . . . . . . . . . . . .
. . . . . . . . . . . . . . X X X . . . . . . . . . . .
. . . . . . . . . . . . . . . X X X X . . . . . . . . .
. . . . . . . . . . . . . . . . . X X X . . . . . . . .
. . . . . . . . . . . . . . . . . X X X . . . . . . . .
. . . . . . . . . . . . . . . X X X X X . . . . . . . .
. . . . . . . . . . . . . X X X X X X X . . . . . . . .
. . . . . . . . . . . . X X X X X X . . . . . . . . . .
. . . . . . . . . . X X X X X X . . . . . . . . . . . .
. . . . . . . X X X X X X X . . . . . . . . . . . . . .
. . . . . X X X X X X X X . . . . . . . . . . . . . . .
. . . . X X X X X X X . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
As you can imagine, the sizes of input
and target
should match.
The epochCount
parameter tells how many training iterations we want to run. A low number may not be enough, while a very high number can cause redundant training. We’ll choose this parameter by running the program multiple times and selecting the best.
The learningRate
- well, the learning rate - determines how fast we want to move along the gradient descent. In this app, the learningRate
is a static parameter, which we’ll choose by running the program multiple times and selecting the best.
At the beginning of the function, I do some input validity checks, allocate memory, and create timers to measure performance. Then I run epochCount
training sessions, which do the following:
// NeuralNetwork::train()
Float epochLoss{0.0_F};
for (size_t i{0}; i < indices.size(); ++i) {
auto const idx{indices[i]};
if (!forward(input[idx], output)) {
fmt::println("Failed to compute forward pass.");
return false;
}
auto const& outputLayer{layers.back()};
expectedOutput.resize(outputLayer.neurons.size());
expectedOutput.assign(expectedOutput.size(), 0.0_F);
if (target[idx] >= expectedOutput.size()) {
fmt::println("Target index {} out of bounds for output size {}.", target[idx], expectedOutput.size());
return false;
}
expectedOutput[target[idx]] = 1.0_F;
Float mse{0.0_F};
for (size_t j{0}; j < output.size(); ++j) {
Float const diff{expectedOutput[j] - output[j]};
mse += diff * diff;
}
mse /= output.size(); // Mean squared error
epochLoss += mse;
if (!backward(output, expectedOutput, learningRate)) {
return false; // Backward pass failed
}
}
shuffle_indices(indices);
Here I introduced the indices
vector. The idea is that for every epoch we train the network with the same data, but we don’t want this data to be in the same order every time, so we need to shuffle it. Shuffling the input
directly is an expensive process and won’t work, since input
and target
come together: the element input[42]
should correspond to target[42]
. If we randomly shuffle input
, how do we shuffle target
the same way? To solve this, I added the indices
vector. At the beginning, it is filled with consecutive integers, i.e., {0, 1, 2, 3, ...}
. After each iteration, the indices are shuffled, giving us a different order of the original data every time.
The first thing we do is run the forward propagation pass, collecting the output (a vector of 10 values) and calculating the loss function, which is defined as:
$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$
where $y_i$ is what we expect, and $\hat{y}_i$ is what we got.
I keep track of the epoch loss and print it at the end of each epoch so I can better decide how many epochs are needed. If the loss stops changing from epoch to epoch, we don’t need to keep calculating.
The expectedOutput
is a vector of size 10, with all elements 0.0 except a single element, which is 1.0 for the target value.
Once we have the results of the forward pass and the expected results, we can run the backward propagation. For me, this was the most confusing part of the entire algorithm. I spent a lot of time reading, watching videos, and asking ChatGPT
about it. All of this helped a lot. But in the end, what actually gave me the AHA moment was good old pen and paper:


As mentioned, I will not explain it today - I am not a mathematician and I’d probably get it wrong. But in the code, the main idea is to keep a list - the so-called delta
- where the delta
from the previous layer is used to calculate the delta
for the current layer. With these delta
s, we can update weights and biases. For the last layer - the output layer - the delta
is calculated like this:
// NeuralNetwork::backward()
std::vector<Float> deltaOutput(output.size());
for (size_t i{0}; i < output.size(); ++i) {
Float const a{output[i]};
Float const dCdA{2.0_F * (a - expectedOutput[i]) / output.size()};
Float const dAdZ{sigmoidDerivative(a)};
deltaOutput[i] = dCdA * dAdZ;
}
where dCdA
is the derivative of the MSE loss function with respect to the output (i.e., the final) value. Remember that the loss function and its derivative are defined like this:
$$ \text{C} = \frac{1}{n} \sum_{i=1}^{n} (y_i - a_i)^2 $$
$$ \frac{\partial \text{C}}{\partial a_i} = \frac{2}{n}(a_i - y_i) $$
Here, $a_i$ is what we get after applying the sigmoid function, and $y_i$ is what we expect.
The sigmoid function and its derivative are:
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
$$ \frac{d\sigma}{dz} = \sigma(z) \big(1 - \sigma(z)\big) $$
Here, $z_i$ is the neuron value after the inner product but before the sigmoid. Since $\sigma(z_i)$ is $a_i$, we don’t actually keep $z_i$ in memory.
Once we have this delta
, we pass it to a layer to adjust its neurons’ weights and biases:
// Layer::update()
for (size_t i{0}; i < neurons.size(); ++i) {
auto& currNeuron{neurons[i]};
if (currNeuron.weights.size() != prevLayer.neurons.size()) {
return false; // Not enough weights for the neuron
}
for (size_t j{0}; j < currNeuron.weights.size(); ++j) {
currNeuron.weights[j] -= learningRate * delta[i] * prevLayer.neurons[j].value;
}
currNeuron.bias -= learningRate * delta[i];
}
And then we keep propagating the delta
backward through each layer, calling the Layer::update()
function to adjust the neurons’ weights and biases.
Results
I set up the network with the following parameters:
// main()
impl::NeuralNetwork nn{std::vector<size_t>{784, 100, 10}};
size_t constexpr EPOCH_COUNT{20};
impl::Float constexpr LEARNING_RATE{1.0};
nn.train(images, labels, EPOCH_COUNT, LEARNING_RATE);
I compiled the code with the -O3
option, i.e., in the Release mode.
Then I used the trained network to test its behavior. The MNIST
dataset comes in two parts - the training part, consisting of 60’000 images and labels, and the test part, consisting of 10’000 images and labels. Running nn.forward()
on the test data gave me ~97% confidence, which is quite good.
To measure performance, I told the program to run the training 100 times. Then my wife called me for dinner, and after that I left for a walk. When I came back, it had already finished with an average training time of 73071.55 ms.
NOTE: Take these numbers with a grain of salt. I’ll put here the time and relative improvement or downgrade on my machine. It may be different on other hardware.
Of course, one minute is nothing, and in real life it does not need the GPU at all. But that’s not what we came here for, right? What if I want to set the number of layers to 4 (1 input, 2 hidden, 1 output), with both hidden layers having 1000 neurons each? Additionally, what if I want 100 epochs? This time I did not have enough patience - a single epoch took 134640.45 ms, i.e., more than 2 minutes. So one hundred epochs would take more than 3 hours. I don’t want to wait just to find out that, with these parameters, my neural network became worse. So in the next attempts I’ll work on improving the performance, so stay tuned.
Conclusion
The source code for this step is here.
I am using a static generator (Hugo
) to build this site, so there is no comment section directly here. As a personal experiment, I published a short post on LinkedIn pointing to this article. If you have a question, you can ask it there. If you want to follow for updates, you can also follow me there.