Sunday, June 6, 2010

xen time drift and ntp

Just a quick post about how I coerced some xen machines that could not keep time to keep time, maybe someone will comment on a better way or this will help someone else. I believe the problem crops up when you may have more guest machines than processor cores in the host, but regardless the symptom is massive time drifting on the DomU machines (the guest VMs). I had a machine that was counting to 5 whilst everyone else was counting to 4 (1 on thousand 2 one thousand 3 one thousand 4 one thousand....). Over the course of an evening this could get it way out of synch to the point of becoming more than just an annoyance.

Googling the issue there seem to be many people with similar issues but there didn't seem to be a solution out there other than to dig into what the issue is with the xen machines clock; there I found everything from my bios clock is wrong (not my problem) to advise about the interrupts and how they can be flaky when virtualized. One clever guy had even figured out the correct tick ratio to get close enough to real life so that ntpd can adjust the clock correctly using tickadj. This seemed like it might be fragile in my case as I am not sure the clock is predictably wrong, just that it is wrong.

After much digging into how ntp works I found a solution that works for me running CentOS 5 and xen 3. First you need to install ntpd and set it to run on startup:
yum install ntp
chkconfig ntpd on

Next you need to take the guest machine off of the hosts clock. The permanent way of doing this is by adding a line to your /etc/sysctl.conf file to tell the machine to not use the host machines clock (changes will take affect on restart):
echo "xen.independent_wallclock = 1" >> /etc/sysctl.conf

Finally you will need to tinker with ntp so as it does not panic when the time shift is too large and to ignore the jittering that may happen if you have a wilding shifting clock (as was my case). The panic should be for any VM using ntp and not just wacky ones that drift on time, this is because if they are suspended ntp will see the time shift is more than 1000 seconds and not adjust the clock. To make these changes edit with the default ntp.conf file. which should be in located at: "/etc/ntp.conf". You need to remove or comment out 2 lines:
server # local clock
fudge stratum 10
next add the following line to the very top of the file; it will not work unless this line is the first directive in the config file.
tinker panic 0 dispersion 1.000
Ideally you should have 1 server setup as a master ntp in your network that syncs off of public ntp server and all of your other servers synch off this. Although this is a ntp best practice and not really specific to this problem with the new setup the fact that I am already tinkering with ntps default values I think it is even more important. Below find a sample ntpd.con file that is based off the default with my changes in it:
#dispersion 1.000: Ignore high jitters and offsets as local clock dirfts wildly on xen
#panic 0: set time even if time shift is more than 1000 seconds
tinker panic 0 dispersion 1.000
# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery

# Permit all access over the loopback interface. This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict -6 ::1
server ntp01

# Drift file. Put this in a directory which the daemon can write to.
# No symbolic links allowed, either, since the daemon updates the file
# by creating a temporary in the same directory and then rename()'ing
# it to the file.
driftfile /var/lib/ntp/drift

# Key file containing the keys and key identifiers used when operating
# with symmetric key cryptography.
keys /etc/ntp/keys

Finally reboot. The reboot takes care of making all the changes active but if you cannot reboot you can try to set the vm to not use the the clock of the host machine and restarting the ntpd service. The independent_wallclock bit did not work for me so your millage may vary.
echo 1 > /proc/sys/xen/independent_wallclock
service ntpd stop
service ntpd start