How to use Vulkan Timestamp Queries.

At some point in my development life, I decided that I want to know how fast my GPU can crack the tasks I throw at it. And if you read this then you probably want to know it too. I expect that you know that measure a GPU time on the CPU makes no sense since CPU and GPU are two different processors and if you place std::chrono::some_clock::now() before and after a draw call it will always give you a time near to zero. Luckily Vulkan provides a tool to make time snapshots - the so-called Queries. There’re multiple different query types but today we’ll talk about Timestamp Queries.

Before I started to dig the Holy Specification I thought that timestamps on the GPU are quite the same as on the CPU (with for example, std::chrono) - all I need is a special command to write the current time then write another set of commands - the actual work whose duration needs to be measured, and finally, close it with another write-time command. Next, somehow I need to read the time values back and voila. Of course, Vulkan would not be Vulkan if it was so easy, there’re so many details to take into account. As usual, the Specification is pretty dry - it just provides all possibilities of the API without going deep. And I was surprised to find that there’s almost zero information on the web about the topic. So after a couple of days of rereading the Spec and poking people with stupid questions, I want to share my findings which hopefully will be useful to somebody.

Queries

As I already mentioned, there’re multiple query types - Occlusion, Pipeline Statistics, Performance, etc. Some of them require a specific extension while others work out of the box. The Timestamp Queries belongs to the latter - we don’t need to enable anything.

Note: though Timestamp Queries don’t require an extension we still need to check their support with VkPhysicalDeviceLimits::timestampComputeAndGraphics (see the Specification).

To use queries we need a pool (of course!). The pools' creation is pretty straightforward - VkQueryPoolCreateInfo::queryType should be VK_QUERY_TYPE_TIMESTAMP and VkQueryPoolCreateInfo::queryCount should be the equal to number of timestamps we need and the struct is passed to vkCreateQueryPool(), nothing special here. So if we want to measure how long one draw call takes we need two query counts - one for the start and one for the end.

Getting Query Result

Things go interesting when we read our queries back - the function vkGetQueryPoolResults, besides usual device, queryPool, queryCount, firstQuery, accepts a buffer pData with size dataSize, some stride and of course flags. The size of the buffer as well as stride depends on flags. Possible variants for the flags are:

VK_QUERY_RESULT_64_BIT
VK_QUERY_RESULT_WAIT_BIT
VK_QUERY_RESULT_WITH_AVAILABILITY_BIT
VK_QUERY_RESULT_PARTIAL_BIT

If we don’t specify any flags the query default data type is uint32_t. The resolution of a query is defined with VkPhysicalDeviceLimits::timestampPeriod value. This is how the Specification describes it:

timestampPeriod is the number of nanoseconds required for a timestamp query to be incremented by 1.

VK_QUERY_RESULT_64_BIT

So if we have queries difference end - start and we want to know the time in nanoseconds we need to multiply that difference by VkPhysicalDeviceLimits::timestampPeriod. On my machine the period is 1, so every timestamp writes some number of nanoseconds (I don’t know when the timer starts). The maximum for uint32_t is 4294967295. In my case even if the first query will be 0ns (which is very unlikely) than I can measure only 4294967295 nanoseconds between two timestamps. Or ~0.43 seconds. Of course, the first query will be somewhere between 0 and uint32_t::max() and if I measure heavy work it’s very likely to hit an integer overflow and the final result will be wrong. So I’d recommend to always use VK_QUERY_RESULT_64_BIT flag.

VK_QUERY_RESULT_WAIT_BIT and VK_QUERY_RESULT_WITH_AVAILABILITY_BIT

Next. You probably know that when you write a Vulkan command and submit it’s not get executed immediately. Actually, you can’t know when it starts to execute and finishes. And of course, you can’t just read queries at an arbitrary point in time because you want to be sure the values are in place. One solution would be to use VK_QUERY_RESULT_WAIT_BIT. From the flag name, it’s clear that the CPU will wait until all queries are written. There’re use cases for this - for example, it would be super easy to fire a single heavy dispatch and wait for its completion on CPU. But in loop applications like games, we can’t afford this luxury so this flag is not for us. VK_QUERY_RESULT_WITH_AVAILABILITY_BIT can be used instead. In this case, each frame you check if the value is available on the host and don’t issue new write command until you read previous so it’s quite possible that some frames may be missed. And here’re the rules for this flag:

in addition to query value a special availability value is written after the query value. The non-zero value means that the query is available.
if nothing is said explicitly the size of the availability value is uint32_t. If VK_QUERY_RESULT_64_BIT is used the size of availability value is uint64_t.

That means that for 2 queries the buffer can be:

if no flags are used the size is 2 * sizeof(uint32_t) = 8bytes
if VK_QUERY_RESULT_WITH_AVAILABILITY_BIT only is used the size is 2 * sizeof(uint32_t) + 2 * sizeof(uint32_t) = 16bytes
if VK_QUERY_RESULT_64_BIT only is used the size is 2 * sizeof(uint64_t) = 16bytes
if both VK_QUERY_RESULT_64_BIT and VK_QUERY_RESULT_WITH_AVAILABILITY_BIT are used the size is 2 * sizeof(uint64_t) + 2 * sizeof(uint64_t) = 32bytes

Another approach to avoid waiting and don’t mess with availabilities is to use multiple query pools as we do with command buffers and descriptor pools. In that case, at the frame start, we read from the current pool then reset it, then record timestamp commands. In the next frame we do the same but for another pool and then repeat.

VK_QUERY_RESULT_PARTIAL_BIT

This flag is not used with timestamp queries.

Writing a Timestamp

Another part that led me to the confusion was how to write a timestamp itself. The only function has the following signature:

void vkCmdWriteTimestamp(
    VkCommandBuffer                             commandBuffer,
    VkPipelineStageFlagBits                     pipelineStage,
    VkQueryPool                                 queryPool,
    uint32_t                                    query);

where commandBuffer is, well, a command buffer, queryPool is a pool we’re working with in the current frame, query is the index of query we want to write and pipelineStage - mysterious stage flag. The suffix bits tells us that only one flag should be used and this is what the Spec tells us:

vkCmdWriteTimestamp latches the value of the timer when all previous commands have completed executing as far as the specified pipeline stage, and writes the timestamp value to memory.

So what does it mean? That means that the timer value will not be written until all previously submitted commands reached the specified stage. Let’s look at different scenarios to better understand:

We want to measure how long does it take to run a vertex shader. We want the first timestamp to be written before the vertex shader begins so the first flag should be VK_PIPELINE_STAGE_VERTEX_SHADER_BIT. The second timestamp must be written when the vertex shader finishes. We open Pipeline Stages Order part in the Specification and see that after VK_PIPELINE_STAGE_VERTEX_SHADER_BIT follows VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT so we could use this flag.
```
    vkCmdDrawIndexed();
    vkCmdWriteTimestamp(VK_PIPELINE_STAGE_VERTEX_SHADER_BIT); // says: write a timestamp as soon as all previous commands (we have only one - vkCmdDrawIndexed) reach STAGE_VERTEX_SHADER
    vkCmdWriteTimestamp(VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT); // says: write a timestamp as soon as all previous commands reach STAGE_TESSELLATION_CONTROL_SHADER
    
```
Though this could work in theory in practice we rarely have a single draw command. The more realistic example follows.
This time we call draw commands in a loop and we want to know how long does it take to run the same vertex shader. Or do we really want to know that? Does it make sense at all? Let’s take an abstract GPU and look at what’s happening when we submit a batch of work. Let’s imagine we submit 100 draw calls. Since our GPU has a limited number of processing units it will not launch all draw calls at the same time. Let’s say it will take 10 commands from the queue, execute the vertex shader, and move them to the next pipeline stage. Now it can start another 10 commands and so on. At the end, when the last draw calls' vertex shader finishes it’s quite possible that the first draw call already passed through the fragment shader. So it turns out that the total measured time will include all pipeline stages for some draw calls. The same can be true even for a single draw call - imagine we submit a HUGE model, millions of vertices - while some triangles are processed, others already can be rasterized. Of course, this is very very rough approximation how a real hardware works but you’ve got the idea that this code makes no sense:
```
    for(0..N)
    {
        vkCmdDrawIndexed();
    }
    vkCmdWriteTimestamp(VK_PIPELINE_STAGE_VERTEX_SHADER_BIT);
    vkCmdWriteTimestamp(VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT);
    
```
Now we know that measure a single stage doesn’t work well with timestamps but we still want some numbers. We can measure the entire pipeline for a set of draw calls and this is how we can do it:
```
    vkCmdWriteTimestamp(VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT); // says: write a timestamp as soon as all previous commands were taken by the command processor
    for(0..N)
    {
        vkCmdDrawIndexed();
    }
    vkCmdWriteTimestamp(VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT); // says: write a timestamp as soon as all previous commands retire
    
```
Here we want to put the first timestamp before the work starts. VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT is a special kind of stage and means that the command is parsed by a command processor. VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT is a special thing as well and indicates a stage where commands “leave a GPU” after work.

The final example is closer to real-life than previous scenarios:

    // work 1
    for(0..N)
    {
        vkCmdDrawIndexed();
    }

    vkCmdWriteTimestamp(VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT); // says: write a timestamp as soon as all previous commands were taken by the command processor
    // work 2
    for(0..N)
    {
        vkCmdDrawIndexed();
    }
    vkCmdWriteTimestamp(VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT); // says: write a timestamp as soon as all previous commands retire

A small diagram showing how the work can be organized in the GPU:

                                   timestamp
                                       ↓
top       work1 work1 work1 work1 work1 work2 work2 work2 work2 work2
vertex          work1 work1 work1 work1 work1 work2 work2 work2 work2 work2
fragment              work1 work1 work1 work1 work1 work2 work2 work2 work2 work2
bottom                      work1 work1 work1 work1 work1 work2 work2 work2 work2 work2
                                                                                       ↑
                                                                                   timestamp

---------------------------------------|-----|-----------------------------------|-----|----------> time
                                       t1    t2                                  t3    t4

Here we want to measure the second set of draw commands. In this - almost ideal setup you can see that the first timestamp is written as soon as all previous commands reached the TOP_OF_PIPE stage (t1) which is a little bit earlier than the first vertex processing happens (t2) for the first command from work2. The same goes for the second timestamp - after the last command from work2 (t3), some time passes before the command retires at the BOTTOM_OF_PIPE stage (t4). Hopefully, you see that the measured time is not strictly correct - we have some time before and after work. Additionally, we measure a part of work1. Though it’s not that bad here things can be pretty frustrating. Consider this:

                       timestamp
                           ↓
top       work1 work1 work1 work2 work2 work2
vertex          wooooooork1 wooooooork1 work2 wooooooork1 work2 work2
fragment                    wooooooork1 wooooooork1 work2 work2 work2 wooooooork1 
bottom                                  wooooooork1 wooooooork1 work2 work2 work2 wooooooork1
                                                                                             ↑
                                                                                         timestamp

---------------------------|-----------|-----------------------------------------|-----------|----> time
                           t1          t2                                        t3          t4

Here work1 is somewhat heavy and by the nature of GPUs the work will not be done sequentially but the commands will be mixed. The first timestamp is written when all the previous commands have reached TOP_OF_PIPE (as before) but since the tasks from work1 are time-consuming the actual work we wanted to measure starts much later. Also can happen that the work1 finishes after the work2 (despite the fact it was queued earlier) but because of the stage flag VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT for the second timestamp we have to wait for it. The resulting time in this case is far from the real.

Conclusion

Timestamp Queries can’t give us a precise time. So should we use it? Definitely! Though measurements are not 100% correct they show at least approximate load on a GPU. In the last example, we can measure both work1 and work2 and see how numbers change when we increase or decrease the amount of work. Also if you do scientific research and really need to know the exact time you can always wait for something - either for query result or when a device becomes idle.

Note: I found it interesting that on my GPU all timestamps are divisible by 1024. And if there’re two timestamps but no work between them I get 1024 or 2048 nanoseconds. I don’t know what is it but will try to guess that the internal clock has some specific resolution. Or maybe timestamps are written with a certain frequency?

August 8, 2020

Queries

Getting Query Result

Writing a Timestamp

Conclusion