Hi,
***Update 23/11/2011 – Issue identified, with thanks to Brian Carrig! Check the comments for the details ***
Just a short one to share something interesting I found while using VMWare to run SQL Server in production. As soon as I get over the mad pre-Christmas rush, I hope to get posts out with more regularity.
While the following is true for all SQL server deployments, I’ve found in particular on Virtualised environments that CPU pressure doesn’t necessarily come from the “100% CPU usage” category. The situation I find myself in is that my overall CPU usage is quite low, but looking at SQL server wait stats I can see that I am starting to experience CPU pressure.
(As an aside, Jonathan Kehayias [Blog | Twitter] and Ted Krueger [Blog | Twitter] have written an excellent, and free eBook on troubleshooting SQL Server, including a good read on Wait stats. I highly recommend checking it out, it will make troubleshooting your servers so much easier. You can get it here)
Borrowing a script from someone much, much smarter than me, Glenn Berry [Blog | Twitter] and his SQL Diagnostic queries, I can see from my previous baselining, that wait times in actually getting time on the CPU are ramping up.
select signal_wait_time_ms=sum(signal_wait_time_ms),
‘%signal (cpu) waits’ = cast(100.0 * sum(signal_wait_time_ms) /sum(wait_time_ms) as numeric(20,2))
,resource_wait_time_ms=sum(wait_time_ms – signal_wait_time_ms)
,’%resource waits’=cast (100.0 * sum(wait_time_ms – signal_wait_time_ms)/sum(wait_time_ms) as numeric (20,2))
from sys.dm_os_wait_stats
On the physical server, I was getting CPU/Resource wait times of 0.03%/99.97%. On the virtual server when running the same query, on the same deployment, I get wait times of 4.91%/95.09%. Ok, so not a big deal, if I go above 10% or so I start to worry. So why am I moaning now?
Well in my companies case, virtualisation was a financial and admin hours saving exercise. We wanted to consolidate hardware, management and licencing to streamline our budget. My plan with this server was to consolidate two existing physical servers onto one virtual instance. But if the server is now waiting on time on the CPU for a factor of over 100 compared to its physical counterpart, and I add in the same amount of applications again, am I going to start getting into that 10-15% worry area? Does this mean I need to keep the two instances, so what I save on VM licencing and hardware, I end up spending anyway on additional instances? Or do I take the risk and hope I can get the queries fine tuned further?
So I don’t have a solution as of yet. The VSphere guys have a call in with VMWare to see if they have experienced this before, as it’s such a drastic leap from the old servers. It could be a case of the rest of the hardware is so much faster than it’s standalone counterpart that the “bottleneck” (I use the term loosely as I don’t yet have a real problem) has moved to the CPU, whereas it may have been on Disk/Memory on the old servers. Alternatively, the scheduling layer between the VM and the Hyper Visor may be adding the delay, which is what I would like to get investigated.
So on to the point of this post. Virtualising SQL Server can be a tricky business. There are plenty of courses and advice on offer, but I personally feel it’s still at the stage where each workload is different, and there’s no hard and fast rule as to whether you should virtualise or not. For me, my CPU usage according to task manager is down, yet my apps are having a harder time getting on the CPU’s. It’s just something to watch out for.
Also if anyone knows the answer as to why this might be happening, I’d love to hear it.
Just another handy query for this particular aspect, it will give you an average of how much time each task is waiting on CPU. Again, on the physical server this registered as 0, and on the virtual it was 8 (Time is in ms, but only accurate to the nearest round number).
select signal_wait_time_ms=SUM(signal_wait_time_ms),
‘Average wait time’ = (sum(signal_wait_time_ms) /sum (waiting_tasks_count))
from sys.dm_os_wait_stats
Hope this helps on your quest to Virtualise! And hopefully doesn’t put you off, because apart from this, it’s actually great! We had a physical host die with 2 DB servers on it, and with the High Availability in VSphere, no one noticed until we checked the alerts the next morning. It really is fantastic stuff!
Hope to have a more in depth post next time, once all these migrations are over with.
Thanks,
Stephen
Hi Stephen,
Are the VM servers experiencing any "CPU Ready Time"? You can check for this using resxtop/esxtop or from within vsphere if you have access, or from within Perfmon if you have VMware Tools installed on the server.
Cheers,
Brian
Hi Brian,
I was hoping you'd see this one!
I just had a look there, and the average across the board is about 20ms ready time for the VM. Averaging about 8ms (which is actually the same amount of time delay I'm seeing in SQL too) across the 4 vCPU's.
So these add up in my mind, but perhaps I have it wrong?
As I understand it Ready Time is the amount of time the VM was waiting for the physical CPU to let it on there? So the fact that they match up probably makes sense?
Is there anything I can do about this? Or should I just be happy that it's as low as it is? I know 8ms isn't a long time, however across 100K+ physical transactions (probably a million+ SQL transactions) a day it's affecting our throughput, and this is with only half of the services active!
Thanks,
Stephen
You have it spot on. This is the time the VM is spending waiting on access to the host logical CPU cores. The hypervisor needs to schedule these concurrently, so if you have three logical cores available, your VM will still experience ready time waiting for that fourth core. That's before we worry about complications from logical vs physical (which the hypervisor is aware of). For normal applications hyper-threading is great, but as I'm sure you know, this not always the case for OLTP systems.
Have you tried increasing the CPU shares allocated to the VM or possibly going one step further and reserving a certain amount of CPU for the VM?
I'd also look at the available host resources. We have two six core Intel X5690 processors in each of the hosts and we find that we generally start to experience ready time once we allocate 6 or more vCPUs to a VM. We are running approximately 3 vCPUs for every available physical core and most of our VMs use 1 – 2 vCPUs, with some of the busier ones at 4 vCPUs and just two or three VMs at 8 vCPUs. For most of the VMs ready time will not be a noticeable issue; where it does present a problem increasing the number of shares provided to the VM has been sufficient to reduce the problem.
I guess also check what other VMs are accessing resources on the same host as your database server VM (i.e. do you have a "noisy neighbour" problem?). Again reservations or shares will help with this also.
Hope this helps!
I'm talking to the guys here now about this (I don't actually manage the ESX installation, hence my constant lack of knowledge on the matter!). It seems that we're over committing on vCPU's/Logical cores, so that would explain the wait times for the VM, it has to request access for all 4 cores every time it does something.
I'll mention it to them about the shares, in the short term I think we're going to try migrating the VM to a host which isn't over committed. I'm still not 100% sure on the Hyper Threading, I believe if one logical core is tied up, it may leave the other logical core on that physical core idle, just in case it's needed? I'd sooner have less cores, but know where I stand with functionality, but it's not my call!
Plus it's the caching that makes these new gen CPU's so fast, I don't like the thoughts of halving it!
We're on X6550 Intels, 8 cores, 2 per physical server. Hyperthreading is giving 32 "cores" per box. I'll pass this on to them and see what they think!
Thanks again,
Stephen
If you have a host that's less busy moving the VM to that host should work, or failing that, increasing the shares will reduce the ready time for your VM (but increase it for others). Most VMs are probably not performance sensitive enough to even notice such an issue so it makes sense to prioritise the VMs that are sensitive.
I don't work with a lot of OLTP systems (we are mainly DSS and DW) but I've heard that the current generation of HT is good enough to offer a performance boost even for very busy OLTP systems. But I don't have enough direct experience to say for sure. The VMware hypervisor is HT-aware, so it will try and schedule on physical cores if they are free but may use the "extra" logical cores if they are not. The idle logical core then gets access to the physical core when it becomes free (say while fetching after a cache miss). Obviously this is going to affect your CPU wait stats!
Incidentally, over committing on vCPUs vs cores is very much encouraged for normal operation. 5-8 vCPUs to every single physical core fits best practise guidelines it might shock you to learn!
http://www.yellow-bricks.com/2011/07/21/scale-upout-and-impact-of-vram-part-2/
But I think a busy OLTP database VM should be an exception to this …