August 17, 2025

Hello, Neural Network 1.5! Float vs Double.

Last time I added a brief note about choosing float over double, where I said that float is better - because we’ll work mostly with small numbers, so the precision is not an issue, and because GPUs prefer floats (at least, mass-consumer GPUs). Also, logic says that float should be faster - addition and subtraction should take the same amount of cycles, but memory throughput should be better with float, since more data can fit into a single cache line. I showed performance for single-precision floating point numbers - “When I came back, it had already finished with an average training time of 73071.55 ms." But I did not measure the execution time when using double. In any case it should be slower, right? Right??

Float vs Double

Not a problem, let’s fill that gap. I reconfigure cmake with cmake .. -DUSE_DOUBLE=ON, rebuild the project, run it multiple times, averaging the result, and the execution time is… 56971.18 ms.


What. The. Hell? I was using standard C++, no fancy compiler options, and the code is quite simple. I tried to clean the build folder, rebuild, rerun. The results are the same. There’s clearly something wrong when doing these dot products - it’s the heaviest part. So I made a small test with two huuuge vectors and performed a dot-product on them. It didn’t show anything - floats were faster.

But I can’t leave without knowing what’s going on. I didn’t want to, but now I have to look at the assembly.

I’ll focus on the Layer::activate() function. Namely, the line 14: z += currNeuron.weights[j] * prevNeuron.value;:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
[[nodiscard]] auto Layer::activate(Layer const& prevLayer,  //
                                   std::function<auto(Float)->Float> const& activationFunction  //
                                   ) noexcept -> bool {
    for (auto& currNeuron : neurons) {
        Float z{currNeuron.bias};

        for (size_t j{0}; j < prevLayer.neurons.size(); ++j) {
            if (prevLayer.neurons.size() < currNeuron.weights.size()) {
                return false;  // Not enough weights for the neuron
            }

            auto const& prevNeuron{prevLayer.neurons[j]};

            z += currNeuron.weights[j] * prevNeuron.value;
        }

        currNeuron.value = activationFunction(z);
    }

    return true;
}

Here’s the assembly for the double version:

movsd    xmm0, QWORD PTR [rdx]      # load prevLayer.neurons[j].value
movupd   xmm3, XMMWORD PTR [rax]    # load two contiguous weights
add      rax,  16                   # next 2 weights
add      rdx,  80                   # next neuron (stride = 80 bytes)
movhpd   xmm0, QWORD PTR -40[rdx]   # pack prevLayer.neurons[j-1].value into high lane
mulpd    xmm0, xmm3                 # dot product
addsd    xmm1, xmm0
unpckhpd xmm0, xmm0
addsd    xmm1, xmm0

The first thing to note is the usage of xmm registers. My compiler (gcc (Ubuntu 10.5.0-1ubuntu1~20.04) 10.5.0) is smart enough to detect the pattern and vectorize the math. What’s interesting is how data access is handled. add rax, 16 - for weights, it just takes 2 consecutive doubles. That’s perfectly fine, since the weights are stored in a contiguous vector (std::vector<Float> Neuron::weights).

The values, on the other hand, are stored one per neuron (Float Neuron::value), and neurons themselves are stored in a contiguous vector (std::vector<Neuron> Layer::neurons). Let’s recall how the Neuron class is declared:

class Neuron {
public:
    Float value{0.0};
    Float bias{0.0};
    std::vector<Float> weights{};
};

On my machine (and most likely on yours too), the size of a std::vector equals 24 bytes. The double value and double bias together take 16 bytes, which gives a total of 40 bytes per Neuron. This explains the number in the movhpd xmm0, QWORD PTR -40[rdx] line and the 80 in the add rdx, 80 line - for 2 weights, the code reads 2 neurons and packs the data into a single 16-byte register. Then it performs multiplication on 2 doubles at once.

Now let’s look at the generated assembly for the float version:

movss  xmm0, DWORD PTR [rdx]   ; load prevLayer.neurons[j].value
mulss  xmm0, DWORD PTR [rax]   ; * currNeuron.weights[j]
add    rax,  4                 ; next weight
add    rdx,  32                ; next neuron (stride = 32 bytes)
addss  xmm1, xmm0              ; accumulate z

Though it uses the same xmm registers, it does not use the full width. Instead of packing 4 floats at once (a 16-byte register can fit 4 floats), it loads a single weight and a single value. No vectorization here at all! Why does it do it like that? I don’t know ¯\_(ツ)_/¯.

Probably it uses some heuristics and thinks that packing 4 floats into a single register would be more expensive than packing 2 doubles. But I can say for sure that using the so-called Array of Structures (AoS), i.e., storing neurons in a vector, plays a bad game here. Because of the stride between two adjacent elements, the access pattern is broken.

If I were to use the Structure of Arrays (SoA), and store biases and values in dedicated vectors instead of a Neuron class, I’m pretty sure the picture would be different - in this case, the access pattern would be straightforward and the compiler could fully utilize SIMD instructions.

Is it possible to fix that? Yes - as mentioned, replacing the Neuron class with something else would work. But I will not do it. Remember, this is a first-step naive implementation. In my next attempt, I’ll use a highly optimized SIMD library for the math and even more. So stay tuned.

Sorry for the wasted traffic, but I’ll share my favorite C++ memes, which I remember every time I get frustrated - like right now:




Conclusion

This was a brief explanation of why doubles can sometimes be faster than floats. If you spot any errors, please let me know. There’s no code for this step.

I am using a static generator (Hugo) to build this site, so there is no comment section directly here. As a personal experiment, I published a short post on LinkedIn pointing to this article. If you have a question, you can ask it there. If you want to follow for updates, you can also follow me there.

If you like what I do you can buy me a coffee © nikitablack 2021

Powered by Hugo & Kiss.