Last time I added a brief note about choosing float
over double
, where I said that float
is better - because we’ll work mostly with small numbers, so the precision is not an issue, and because GPUs prefer floats (at least, mass-consumer GPUs). Also, logic says that float
should be faster - addition and subtraction should take the same amount of cycles, but memory throughput should be better with float
, since more data can fit into a single cache line. I showed performance for single-precision floating point numbers - “When I came back, it had already finished with an average training time of 73071.55 ms." But I did not measure the execution time when using double
. In any case it should be slower, right? Right??
Float vs Double
Not a problem, let’s fill that gap. I reconfigure cmake with cmake .. -DUSE_DOUBLE=ON
, rebuild the project, run it multiple times, averaging the result, and the execution time is… 56971.18 ms.


What. The. Hell? I was using standard C++, no fancy compiler options, and the code is quite simple. I tried to clean the build folder, rebuild, rerun. The results are the same. There’s clearly something wrong when doing these dot products - it’s the heaviest part. So I made a small test with two huuuge vectors and performed a dot-product on them. It didn’t show anything - floats
were faster.
But I can’t leave without knowing what’s going on. I didn’t want to, but now I have to look at the assembly.
I’ll focus on the Layer::activate()
function. Namely, the line 14: z += currNeuron.weights[j] * prevNeuron.value;
:
|
|
Here’s the assembly for the double
version:
movsd xmm0, QWORD PTR [rdx] # load prevLayer.neurons[j].value
movupd xmm3, XMMWORD PTR [rax] # load two contiguous weights
add rax, 16 # next 2 weights
add rdx, 80 # next neuron (stride = 80 bytes)
movhpd xmm0, QWORD PTR -40[rdx] # pack prevLayer.neurons[j-1].value into high lane
mulpd xmm0, xmm3 # dot product
addsd xmm1, xmm0
unpckhpd xmm0, xmm0
addsd xmm1, xmm0
The first thing to note is the usage of xmm
registers. My compiler (gcc (Ubuntu 10.5.0-1ubuntu1~20.04) 10.5.0
) is smart enough to detect the pattern and vectorize the math. What’s interesting is how data access is handled. add rax, 16
- for weights, it just takes 2 consecutive doubles. That’s perfectly fine, since the weights are stored in a contiguous vector (std::vector<Float> Neuron::weights
).
The values, on the other hand, are stored one per neuron (Float Neuron::value
), and neurons themselves are stored in a contiguous vector (std::vector<Neuron> Layer::neurons
). Let’s recall how the Neuron
class is declared:
class Neuron {
public:
Float value{0.0};
Float bias{0.0};
std::vector<Float> weights{};
};
On my machine (and most likely on yours too), the size of a std::vector
equals 24 bytes. The double value
and double bias
together take 16 bytes, which gives a total of 40 bytes per Neuron
. This explains the number in the movhpd xmm0, QWORD PTR -40[rdx]
line and the 80 in the add rdx, 80
line - for 2 weights, the code reads 2 neurons and packs the data into a single 16-byte register. Then it performs multiplication on 2 doubles at once.
Now let’s look at the generated assembly for the float
version:
movss xmm0, DWORD PTR [rdx] ; load prevLayer.neurons[j].value
mulss xmm0, DWORD PTR [rax] ; * currNeuron.weights[j]
add rax, 4 ; next weight
add rdx, 32 ; next neuron (stride = 32 bytes)
addss xmm1, xmm0 ; accumulate z
Though it uses the same xmm
registers, it does not use the full width. Instead of packing 4 floats at once (a 16-byte register can fit 4 floats), it loads a single weight and a single value. No vectorization here at all! Why does it do it like that? I don’t know ¯\_(ツ)_/¯
.
Probably it uses some heuristics and thinks that packing 4 floats into a single register would be more expensive than packing 2 doubles. But I can say for sure that using the so-called Array of Structures (AoS), i.e., storing neurons in a vector, plays a bad game here. Because of the stride between two adjacent elements, the access pattern is broken.
If I were to use the Structure of Arrays (SoA), and store biases and values in dedicated vectors instead of a Neuron
class, I’m pretty sure the picture would be different - in this case, the access pattern would be straightforward and the compiler could fully utilize SIMD instructions.
Is it possible to fix that? Yes - as mentioned, replacing the Neuron
class with something else would work. But I will not do it. Remember, this is a first-step naive implementation. In my next attempt, I’ll use a highly optimized SIMD library for the math and even more. So stay tuned.
Sorry for the wasted traffic, but I’ll share my favorite C++ memes, which I remember every time I get frustrated - like right now:






Conclusion
This was a brief explanation of why double
s can sometimes be faster than float
s. If you spot any errors, please let me know. There’s no code for this step.
I am using a static generator (Hugo
) to build this site, so there is no comment section directly here. As a personal experiment, I published a short post on LinkedIn pointing to this article. If you have a question, you can ask it there. If you want to follow for updates, you can also follow me there.