Last time I added a brief note about choosing float over double, where I said that float is better - because we’ll work mostly with small numbers, so the precision is not an issue, and because GPUs prefer floats (at least, mass-consumer GPUs). Also, logic says that float should be faster - addition and subtraction should take the same amount of cycles, but memory throughput should be better with float, since more data can fit into a single cache line. I showed performance for single-precision floating point numbers - “When I came back, it had already finished with an average training time of 73071.55 ms." But I did not measure the execution time when using double. In any case it should be slower, right? Right??
Float vs Double
Not a problem, let’s fill that gap. I reconfigure cmake with cmake .. -DUSE_DOUBLE=ON, rebuild the project, run it multiple times, averaging the result, and the execution time is… 56971.18 ms.
What. The. Hell? I was using standard C++, no fancy compiler options, and the code is quite simple. I tried to clean the build folder, rebuild, rerun. The results are the same. There’s clearly something wrong when doing these dot products - it’s the heaviest part. So I made a small test with two huuuge vectors and performed a dot-product on them. It didn’t show anything - floats were faster.
But I can’t leave without knowing what’s going on. I didn’t want to, but now I have to look at the assembly.
I’ll focus on the Layer::activate() function. Namely, the line 14: z += currNeuron.weights[j] * prevNeuron.value;:
|
|
Here’s the assembly for the double version:
movsd xmm0, QWORD PTR [rdx] # load prevLayer.neurons[j].value
movupd xmm3, XMMWORD PTR [rax] # load two contiguous weights
add rax, 16 # next 2 weights
add rdx, 80 # next neuron (stride = 80 bytes)
movhpd xmm0, QWORD PTR -40[rdx] # pack prevLayer.neurons[j-1].value into high lane
mulpd xmm0, xmm3 # dot product
addsd xmm1, xmm0
unpckhpd xmm0, xmm0
addsd xmm1, xmm0
The first thing to note is the usage of xmm registers. My compiler (gcc (Ubuntu 10.5.0-1ubuntu1~20.04) 10.5.0) is smart enough to detect the pattern and vectorize the math. What’s interesting is how data access is handled. add rax, 16 - for weights, it just takes 2 consecutive doubles. That’s perfectly fine, since the weights are stored in a contiguous vector (std::vector<Float> Neuron::weights).
The values, on the other hand, are stored one per neuron (Float Neuron::value), and neurons themselves are stored in a contiguous vector (std::vector<Neuron> Layer::neurons). Let’s recall how the Neuron class is declared:
class Neuron {
public:
Float value{0.0};
Float bias{0.0};
std::vector<Float> weights{};
};
On my machine (and most likely on yours too), the size of a std::vector equals 24 bytes. The double value and double bias together take 16 bytes, which gives a total of 40 bytes per Neuron. This explains the number in the movhpd xmm0, QWORD PTR -40[rdx] line and the 80 in the add rdx, 80 line - for 2 weights, the code reads 2 neurons and packs the data into a single 16-byte register. Then it performs multiplication on 2 doubles at once.
Now let’s look at the generated assembly for the float version:
movss xmm0, DWORD PTR [rdx] ; load prevLayer.neurons[j].value
mulss xmm0, DWORD PTR [rax] ; * currNeuron.weights[j]
add rax, 4 ; next weight
add rdx, 32 ; next neuron (stride = 32 bytes)
addss xmm1, xmm0 ; accumulate z
Though it uses the same xmm registers, it does not use the full width. Instead of packing 4 floats at once (a 16-byte register can fit 4 floats), it loads a single weight and a single value. No vectorization here at all! Why does it do it like that? I don’t know ¯\_(ツ)_/¯.
Probably it uses some heuristics and thinks that packing 4 floats into a single register would be more expensive than packing 2 doubles. But I can say for sure that using the so-called Array of Structures (AoS), i.e., storing neurons in a vector, plays a bad game here. Because of the stride between two adjacent elements, the access pattern is broken.
If I were to use the Structure of Arrays (SoA), and store biases and values in dedicated vectors instead of a Neuron class, I’m pretty sure the picture would be different - in this case, the access pattern would be straightforward and the compiler could fully utilize SIMD instructions.
Is it possible to fix that? Yes - as mentioned, replacing the Neuron class with something else would work. But I will not do it. Remember, this is a first-step naive implementation. In my next attempt, I’ll use a highly optimized SIMD library for the math and even more. So stay tuned.
Sorry for the wasted traffic, but I’ll share my favorite C++ memes, which I remember every time I get frustrated - like right now:
Conclusion
This was a brief explanation of why doubles can sometimes be faster than floats. If you spot any errors, please let me know. There’s no code for this step.
I am using a static generator (Hugo) to build this site, so there is no comment section directly here. As a personal experiment, I published a short post on LinkedIn pointing to this article. If you have a question, you can ask it there. If you want to follow for updates, you can also follow me there.