AlexZ blog

07.09.2021

Its Steal Time!

Once an administrator finds out about steal time, he can really blame his cloud provider for stealing the CPU time of his virtual machine!

In numerous articles on the subject, it is written that this metric reflects the percentage of time that the virtual machine is missing from the hypervisor, but what does it all mean?

What it means is that the cloud provider is not only stealing time from his customers, he’s also telling them about it. In this article I will talk about where this metric comes from, how it is counted and how to see its value on the hypervisor. Also, how the cloud provider can monitor the steal time and if it is possible to steal time from other processes.

I hope that this article will help to dispel the fog of war and break some misconceptions about the steal time.

Along the way I will mercilessly link to the kernel sources and give excerpts of the source code.

Where does the virtual machine get steal time from?

To understand where the metric comes from, let’s look at the kernel source code. Let’s start with the utilities that show us statistics on CPU utilization, including steal time. All these utilities take data from /proc/stat. Column 9 is exactly what we are looking for:

cpu  1303987570 1816332 517174528 5342230565 3340275 0 235543645 67911003 0 0
cpu0 667533551 881380 288651431 2743891571 2162694 0 694626 29052880 0 0
cpu1 636454019 934951 228523097 2598338993 1177581 0 234849019 38858122 0 0
...

These statistics in the kernel are in the same structure as all the other CPU data: user/system/etc. What we are interested in is where this counter is incremented. It happens in the function account_steal_time():

/*
 * Account for involuntary wait time.
 * @cputime: the CPU time spent in involuntary wait
 */
void account_steal_time(u64 cputime)
{
	u64 *cpustat = kcpustat_this_cpu->cpustat;

	cpustat[CPUTIME_STEAL] += cputime;
}

Actually, nothing interesting so far, because this function is just a wrapper over the metric. So, we need to understand where it is called from.

Here it should be mentioned that this is only about steal time of the KVM hypervisor on the x86-64 architecture on linux. Different architectures may have their own quirks.

The account function is called from steal_account_process_time():

static __always_inline u64 steal_account_process_time(u64 maxtime)
{
#ifdef CONFIG_PARAVIRT
	if (static_key_false(&paravirt_steal_enabled)) {
		u64 steal;

		steal = paravirt_steal_clock(smp_processor_id());
		steal -= this_rq()->prev_steal_time;
		steal = min(steal, maxtime);
		account_steal_time(steal);
		this_rq()->prev_steal_time += steal;

		return steal;
	}
#endif
	return 0;
}

This is where it gets interesting. Underneath paravirt_steal_clock() there is a macro that runs a function by the pointer time.steal_clock. In our case, the KVM function kvm_steal_clock() is there:

static u64 kvm_steal_clock(int cpu)
{
        u64 steal;
        struct kvm_steal_time *src;
        int version;

        src = &per_cpu(steal_time, cpu);
        do {
                version = src->version;
                virt_rmb();
                steal = src->steal;
                virt_rmb();
        } while ((version & 1) || (version != src->version));

        return steal;
}

This small piece of code tells us that steal is simply read from memory (src->steal), from an already initialized structure. Okay, so for some reason this counter is initially stored separately from all the other CPU statistics and gets into it by copying.

Nearby, in the same file kvm.c, we find a function for registering stealtime:

static void kvm_register_steal_time(void)
{
	int cpu = smp_processor_id();
	struct kvm_steal_time *st = &per_cpu(steal_time, cpu);

	if (!has_steal_clock)
		return;

	wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
	pr_info("stealtime: cpu %d, msr %llx\n", cpu,
		(unsigned long long) slow_virt_to_phys(st));
}

The MSR register called MSR_KVM_STEAL_TIME is initialized here.

You can find the documentation here.

MSR - Model Specific Register. It is not some KVM new feature. Real processors have machine specific registers. The operating system can find out which ones are available by calling the CPUID. The MSR can be read and written with msr-tools. So, at this point you can take the Intel Software Developer’s Manual and start having fun ;).

Check in the virtual machine that the code above actually executes at boot:

~# dmesg | grep steal
[    0.000000] kvm-stealtime: cpu 0, msr 3fc23040

Cool! So our virtual machine just reads the MSR value, which at the other end of the KVM code is substituted by the hypervisor. This happens on every tick, along with incrementing all other CPU statistics.

What the hypervisor knows

Now we know where the virtual machine reads the value of the steal time from. We need to find the other end. Somewhere in the hypervisor must be a write to MSR_KVM_STEAL_TIME.

The hypervisor periodically writes to some MSRs. The register write request is classified in the kvm_set_msr_common() function, from where kvm_make_request() is called and the request is queued in vcpu->requests. After that, vcpu_enter_guest() processes the request from the queue. The record_steal_time() function is started:

static void record_steal_time(struct kvm_vcpu *vcpu)
{
	struct kvm_host_map map;
	struct kvm_steal_time *st;

	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
		return;

...

	st = map.hva +
		offset_in_page(vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS);

...

	if (st->version & 1)
		st->version += 1;  /* first time write, random junk */

	st->version += 1;

	smp_wmb();

	st->steal += current->sched_info.run_delay -
		vcpu->arch.st.last_steal;
	vcpu->arch.st.last_steal = current->sched_info.run_delay;

	smp_wmb();

	st->version += 1;

...
}

We see that a write to our register is taking place. Reasonably, but the interesting thing is that the value which is passed as a steal time is the value of one of the standard counters of the process statistics, run_delay:

struct sched_info {
#ifdef CONFIG_SCHED_INFO
	/* Cumulative counters: */

	/* # of times we have run on this CPU: */
	unsigned long			pcount;

	/* Time spent waiting on a runqueue: */
	unsigned long long		run_delay;

	/* Timestamps: */

	/* When did we last run on a CPU? */
	unsigned long long		last_arrival;

	/* When were we last queued to run? */
	unsigned long long		last_queued;

#endif /* CONFIG_SCHED_INFO */
};

The counter is described as “Time spent waiting on a runqueue”. To understand what this means we have to understand quite a bit about the states of the process in the queue. A process can be in four states. First it is enqueued, then it arrives at the CPU. After it has finished it’s job, it departs from the CPU and either gets back in the queue if it has something to do, or is immediately dequeued from this queue. So, run_delay is counted at moments when process arrives to CPU and when it departs from queue. The value itself is taken as now() - last_queued.

It takes place here:

/*
 * Called when a task finally hits the CPU.  We can now calculate how
 * long it was waiting to run.  We also note when it began so that we
 * can keep stats on how long its timeslice is.
 */
static void sched_info_arrive(struct rq *rq, struct task_struct *t)
{
	unsigned long long now = rq_clock(rq), delta = 0;

	if (t->sched_info.last_queued)
		delta = now - t->sched_info.last_queued;
	sched_info_reset_dequeued(t);
	t->sched_info.run_delay += delta;
	t->sched_info.last_arrival = now;
	t->sched_info.pcount++;

	rq_sched_info_arrive(rq, delta);
}

And here:

/*
 * We are interested in knowing how long it was from the *first* time a
 * task was queued to the time that it finally hit a CPU, we call this routine
 * from dequeue_task() to account for possible rq->clock skew across CPUs. The
 * delta taken on each CPU would annul the skew.
 */
static inline void sched_info_dequeued(struct rq *rq, struct task_struct *t)
{
	unsigned long long now = rq_clock(rq), delta = 0;

	if (sched_info_on()) {
		if (t->sched_info.last_queued)
			delta = now - t->sched_info.last_queued;
	}
	sched_info_reset_dequeued(t);
	t->sched_info.run_delay += delta;

	rq_sched_info_dequeued(rq, delta);
}

In the first case, the time from queuing to the start of the actual process execution will be taken into account. Moreover, both the delay after the first queuing and every waiting interval between the departure from the CPU with queuing and re-arrival to the CPU will be taken into account. That is, every time interval when our process has been displaced from the CPU due to the presence of other processes requiring CPU time.

The second case concerns the migration of a process to another CPU. The last_queued value is set every time the process is queued, and is reset when it arrives to the CPU. Therefore, if last_queued is set, it means that the process is waiting for the CPU to arrive, so it has something to occupy the CPU. A process with last_queued can be dequeued by linux when it migrates to another CPU. So, this is the place where waiting time in queue on the CPU where the process might not have been executed at all is taken into account. Here, the CPU means the logical core of the processor, which is visible in the system as the CPU, and migration is the standard movement of a process between cores.

As you can see, there is no magic in general. That said, it is clear why the high competition for cores leads to the virtualization stealtime. But the change of CPU frequency does not cause the steal time.

It is important that the metric concerns not only virtual machines, but all processes in the system in general. You can view its value with pidstat utility, it will be in %wait column:

# pidstat -t 1
20:58:53      UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
20:58:54        0        29        29    0.00    1.00    0.00    0.00    1.00     3  (ksoftirqd/3)__ksoftirqd/3
...
20:58:54        0   1136624         -    0.00    2.00    0.00    0.00    2.00    19  qemu-system-x86
20:58:54        0         -   1136624    0.00    2.00    0.00    0.00    2.00    19  |__qemu-system-x86
20:58:54        0   1182650         -    4.00    0.00    0.00    1.00    4.00    24  qemu-system-x86
20:58:54        0         -   1182650    0.00    5.00    0.00    1.00    5.00    24  |__qemu-system-x86
20:58:54        0   2538896         -    0.00    1.00    0.00    0.00    1.00     7  kworker/7:2
20:58:54     2001   2614918         -    1.00    0.00    0.00    0.00    1.00    21  nova-compute
20:58:54     2001         -   2614918    1.00    0.00    0.00    0.00    1.00    21  |__nova-compute
20:58:54        0   2888693         -    0.00   13.00   39.00    0.00   45.00    34  qemu-system-x86
20:58:54        0         -   2888693    1.00    3.00    0.00    0.00    4.00    34  |__qemu-system-x86
20:58:54        0         -   2888749    1.00    2.00    5.00    0.00    8.00    20  |__CPU 0/KVM
20:58:54        0         -   2888752    0.00    1.00    6.00    0.00    7.00    32  |__CPU 1/KVM
20:58:54        0         -   2888754    0.00    3.00    1.00    1.00    4.00    30  |__CPU 2/KVM
20:58:54        0         -   2888756    0.00    0.00   20.00    0.00    8.00    26  |__CPU 3/KVM
20:58:54        0         -   2888758    3.00    2.00    3.00    0.00    8.00    35  |__CPU 4/KVM
20:58:54        0         -   2888759    0.00    2.00    4.00    0.00    4.00    23  |__CPU 5/KVM
20:58:54        0   2888695         -    9.00   14.00    9.00    0.00   32.00    34  qemu-system-x86
20:58:54        0         -   2888695    1.00    3.00    0.00    0.00    4.00    34  |__qemu-system-x86
20:58:54        0         -   2888710    1.00    0.00    0.00    0.00    1.00     5  |__msgr-worker-2
20:58:54        0         -   2888748    3.00    2.00    3.00    0.00    8.00     6  |__CPU 0/KVM
20:58:54        0         -   2888750    1.00    2.00    2.00    0.00    5.00     5  |__CPU 1/KVM
20:58:54        0         -   2888751    0.00    2.00    1.00    0.00    3.00    10  |__CPU 2/KVM
20:58:54        0         -   2888753    1.00    1.00    2.00    0.00    4.00    13  |__CPU 3/KVM
20:58:54        0         -   2888755    0.00    1.00    1.00    0.00    2.00    28  |__CPU 4/KVM
20:58:54        0         -   2888757    3.00    3.00    1.00    0.00    7.00     3  |__CPU 5/KVM
20:58:54        0   3042453         -    0.00    1.00    0.00    0.00    1.00    27  kworker/u74:0
20:58:54        0         -   3042453    0.00    1.00    0.00    0.00    1.00    27  |__kworker/u74:0
20:58:54        0   3043239         -    5.00    7.00    0.00    0.00   12.00    25  pidstat
20:58:54        0         -   3043239    5.00    7.00    0.00    0.00   12.00    25  |__pidstat
20:58:54        0   3082533         -    0.00   26.00   19.00    0.00   35.00    13  qemu-system-x86
20:58:54        0         -   3082533    0.00    7.00    0.00    0.00    7.00    13  |__qemu-system-x86
20:58:54        0         -   3082539    0.00    1.00    0.00    0.00    1.00     6  |__msgr-worker-0
20:58:54        0         -   3082540    1.00    0.00    0.00    0.00    1.00     6  |__msgr-worker-1
20:58:54        0         -   3082541    0.00    1.00    0.00    0.00    1.00    27  |__msgr-worker-2
20:58:54        0         -   3082555    1.00    0.00    0.00    0.00    1.00    14  |__tp_librbd
20:58:54        0         -   3082559    1.00    2.00   16.00    0.00   19.00    10  |__CPU 0/KVM
20:58:54        0         -   3082560    1.00    1.00    3.00    0.00    5.00     9  |__CPU 1/KVM
20:58:54        0   3108848         -    4.00    0.00    0.00    0.00    4.00    21  qemu-system-x86
20:58:54        0         -   3108848    0.00    4.00    0.00    0.00    4.00    21  |__qemu-system-x86
20:58:54        0         -   3108874    1.00    1.00    0.00    0.00    2.00    28  |__CPU 0/KVM
20:58:54        0   3343378         -    0.00    5.00    0.00    0.00    5.00     6  qemu-system-x86
20:58:54        0         -   3343378    0.00    5.00    0.00    0.00    5.00     6  |__qemu-system-x86

Is my cloud provider hiding the steal time?

No. But he can if he wants to. The easiest way to do this is to disable the reporting of steal time to the virtual machines in the hypervisor settings. Checking this case is easy with the cpuid utility:

# cpuid | grep steal
      steal clock supported                   = true

You can see quite a lot of information with cpuid which may be useful. You can see things like the support of different instructions and available features. For more details see the documentation of the processor.

In theory, of course, the provider can modify the hypervisor code and give clients any value in this field. But in practice hardly anyone does it as there is no obvious profit and modification may lead to unfortunate consequences. And it will also need to be maintained.

Steal time in virtual machines can be constant. A few percent is not critical. Especially if the virtual machine has a large number of cores and is actively using them, the probability of periodic CPU waiting is many times higher, because your virtual CPUs are not the only processes on the hypervisor.

There’s also the funny thing, when a client can steal CPU time from itself. And it’s not even about both of his virtual cores under load falling on the same host core. The qemu virtual machine has additional threads to serve, for example, disk operations. Steal time can happen because one of these threads competes with the virtual core.

The only way to get rid of steal time from the cloud user’s point of view is to force the migration of the virtual machine to another host. The problem is that cloud providers do not give access to such an api, because it is difficult to control such processes, and customers do not need it that often. So, the easiest option is to recreate the virtual machine. Most likely it will be created on a different virtualization host and there will be no steal time. Another option is to resize the machine, which also causes VM migration, effectively recreating it, but this may not work with any cloud backend.

I’m a provider, what should I do?

Provide clients with this metric, of course! And also to monitor and make sure that the steal time does not exceed any reasonable values. Not so long ago libvirt api was updated with support for the steal time. This is needed to capture these statistics in the context of the virtual machines, with all the other data. So, you can modify your favorite monitoring agent to return this metric.

It’s convenient, to say the least. You can check every customer call on the steal time. And after every instance where something went wrong, you can assess how many customers really got hurt, and which customers were affected the most. After all, as we can see, if the hypervisor CPU was not needed for the virtual machine at the time the CPU load increased, this virtual machine was not affected. No need for the CPU - no steal time.

Also, you can automate the balancer’s work based on these metrics. A classic CPU utilization observation may not show the real picture. For example, a couple of large clients may be actively using CPU without competing with each other. At the same time, several small virtual machines are idle in the background on the same host. Utilization seems to be high, let’s say 80%. There are more virtual cores than real ones. But there is no steal time, because there is no competition for cores. If the balancer would immediately migrate machines in the cloud based on the resulting steal time, this would be an acceptable situation for most customers.

Here, of course, it all depends on the approach the provider takes. But this metric is good for instantaneous quality assessment. Although it does not give the whole picture, well, as always.

In the end what?

The bottom line is that there is a steal time metric. It does reflect the percentage of CPU time that the virtual machine is missing. It can only be missed when it was really trying to execute on the CPU, but had to wait due to competition for this resource. The hypervisor itself reports it to the virtual machine. In this case, on the hypervisor, the steal time is not a separate metric of the virtual machine, but a standard metric of any process in Linux.

While competition for CPU is always to blame, there can be different reasons for the occurrence of steal time, it’s not necessarily noisy neighboring virtual machines. Cloud clients can recreate a machine to move it to another host, and cloud providers can keep track of their clients steal time and act on the circumstances.

First steal time implementation in KVM: KVM steal time implementation.
Steal time in libvirt 7.2.0: qemu: add per-vcpu delay stats.