In one of our earlier blog posts, we announced the availability of the Qubole Hadoop Platform on Google Compute Engine. This was also featured on the Google Cloud Platform Blog.
In this post, we talk about a critical issue that we faced (and eventually managed to circumvent) a few days before the Qubole-on-GCE beta release.
The startup scripts that we use to boot the Hadoop clusters on GCE are responsible for (among other things) installing Hadoop, setting up the cluster provisioning libraries, and starting Hadoop daemons.
Once the startup script finishes, Hadoop is configured and the cluster is set up to run Hadoop jobs. When we tried to run jobs on a GCE-based Hadoop cluster, we discovered that the jobs seemed to be perpetually stuck in the “preparation phase”. Initial investigations and hacks failed: nothing seemed to work.
Then we added asleep for a few minutes (190s) just before starting the Hadoop daemons and to our surprise, the jobs started to go through!
Though overjoyed at getting jobs to finally run on the cluster, having to wait for the 190s to submit a job on a new cluster was unacceptable. So we started to dig deeper to find the root cause.
As an experiment, we brought up 2 clusters: one with the 190s sleep in the startup script and the other without it. Here are the JobTracker logs from both clusters:
Cluster with sleep:
2014-01-06 17:32:04,438 INFO -1 org.apache.hadoop.mapred.JobTracker (main): STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting JobTracker
STARTUP_MSG: host = qubole-cluster-1-master.c.qubole-gce.internal/10.240.124.143
STARTUP_MSG: args = []
STARTUP_MSG: version = 0.20.1-dev
STARTUP_MSG: build = git://localhost.localdomain/ on branch release-branch-1 -r a3c5f51a85101fd4b8fb6831038eadb85e75c4d2; compiled by ‘root’ on Fri Jan 3 10:45:53 CET 2014
************************************************************/
2014-01-06 17:32:04,586 INFO -1 org.apache.hadoop.mapred.JobTracker (main): Starting jobtracker with owner as mapred and supergroup as a supergroup
2014-01-06 17:32:04,588 INFO -1 org.apache.hadoop.mapred.JobTracker (main): Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)
Cluster without sleep:
2014-01-06 12:37:50,988 INFO -1 org.apache.hadoop.mapred.JobTracker (main): STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting JobTracker
STARTUP_MSG: host = qubole-cluster-2-master.c.qubole-gce.internal/10.240.245.193
STARTUP_MSG: args = []
STARTUP_MSG: version = 0.20.1-dev
STARTUP_MSG: build = git://localhost.localdomain/ on branch release-branch-1 -r a3c5f51a85101fd4b8fb6831038eadb85e75c4d2; compiled by ‘root’ on Fri Jan 3 10:45:53 CET 2014
************************************************************/
2014-01-06 12:37:51,357 INFO -1 org.apache.hadoop.mapred.JobTracker (main): Starting jobtracker with owner as mapred and supergroup as supergroup
2014-01-06 12:37:51,359 INFO -1 org.apache.hadoop.mapred.JobTracker (main): Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)
The timestamps in the logs differ by 5 hours, although both the clusters were brought up almost simultaneously. This is a dead giveaway that the system time isn’t right on at least one of the clusters.
We added a few logs in the FairScheduler update thread and discovered that the thread sleeps for a duration of around 18000 secs (5 hrs), which is the time offset between the two JT logs. This is also the duration for which the job ‘hangs’ in the cluster without sleep.
We were already running ntpd on all our GCE nodes, so we expected the system to be properly set – but it turns out that merely running the ntp daemon isn’t good enough…
The kernel logs (/var/log/messages) from the two clusters held important clues:
Cluster with sleep:
Jan 6 17:34:38 qubole-cluster-2-master startupscript: collisions:0 txqueuelen:1000
Jan 6 17:34:38 qubole-cluster-2-master startupscript: RX bytes:134776506 (128.5 MiB) TX bytes:417371 (407.5 KiB)
Jan 6 12:35:45 qubole-cluster-2-master ntpd[1575]: synchronized to 169.254.169.254, stratum 2
Jan 6 12:35:45 qubole-cluster-2-master ntpd[1575]: time reset -17999.901947 s
Jan 6 12:35:45 qubole-cluster-2-master ntpd[1575]: kernel time sync status change 2001
Jan 6 12:37:48 qubole-cluster-2-master startupscript: qubole-cluster-2-master
Jan 6 12:37:48 qubole-cluster-2-master startupscript: Stopping Hadoop namenode daemon (hadoop-namenode): no namenode to stop
Jan 6 12:37:48 qubole-cluster-2-master startupscript: [ OK ]#01514/01/06 12:37:48 INFO -1 namenode.NameNode: STARTUP_MSG:
Cluster without sleep:
Jan 6 17:32:03 qubole-cluster-1-master startupscript: [ OK ]#015Stopping Hadoop jobtracker daemon (hadoop-jobtracker): no jobtracker to stop
Jan 6 17:32:03 qubole-cluster-1-master startupscript: [ OK ]#015Starting Hadoop jobtracker daemon (hadoop-jobtracker): starting jobtracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-jobtracker-qubole-cluster-1-master.out
Jan 6 17:32:05 qubole-cluster-1-master startupscript: [ OK ]#015
Jan 6 17:32:05 qubole-cluster-1-master startupscript: Finished running startup script /var/run/google.startup.script
Jan 6 12:33:10 qubole-cluster-1-master ntpd[1567]: synchronized to 169.254.169.254, stratum 2
Jan 6 12:33:10 qubole-cluster-1-master ntpd[1567]: time reset -18000.008126 s
Jan 6 12:33:10 qubole-cluster-1-master ntpd[1567]: kernel time sync status change 2001
Jan 6 12:39:43 qubole-cluster-1-master ntpd[1567]: synchronized to 169.254.169.254, stratum 2
So in the cluster without the sleep, ntpd finished synchronizing after the Hadoop daemons restart, whereas, in the other cluster, ntpd synchronized the system time before the restart.
Now we queried the NTP servers (metadata.google.internal) and added log statements in the startup script. As can be seen in the logs below the offset suddenly decreases from -18000 secs to 0 sec which takes close to 160s (~190s, the sleep duration added as the workaround).
Jan 6 17:32:04 qubole-cluster-1-master startupscript: remote refid st t when poll reach delay offset jitter
Jan 6 17:32:04 qubole-cluster-1-master startupscript: ===========================================================
Jan 6 17:32:04 qubole-cluster-1-master startupscript: metadata.google 255.28.23.83 2 u 65 64 7 0.318 -180000 2.838
Jan 6 17:32:05 qubole-cluster-1-master startupscript: remote refid st t when poll reach delay offset jitter
Jan 6 17:32:05 qubole-cluster-1-master startupscript: ===========================================================
Jan 6 17:32:05 qubole-cluster-1-master startupscript: metadata.google 255.28.23.83 2 u 66 64 7 0.318 -180000 2.838
Jan 6 12:35:45 qubole-cluster-1-master ntpd[1532]: synchronized to 169.254.169.254, stratum 2
Jan 6 12:35:45 qubole-cluster-1-master ntpd[1532]: time reset -18000.005458 s
Jan 6 12:35:45 qubole-cluster-1-master ntpd[1532]: kernel time sync status change 2001
Jan 6 12:35:45 qubole-cluster-1-master startupscript: remote refid st t when poll reach delay offset jitter
Jan 6 12:35:45 qubole-cluster-1-master startupscript: ===========================================================
Jan 6 12:35:45 qubole-cluster-1-master startupscript: metadata.google .STEP. 16 u 7 64 0 0.000 0.000 0.000
ntpd is designed to not synchronize the system time if the offset, drift, and jitter are too high – which was happening on our clusters. It behaves this way because it can’t be sure in these cases (especially if there is high jitter) if the time provided by the server is reliable.
The typical workaround in such cases is to stop ntpd, run ntpdate manually (which fixes the time offset to a great extent), and then start ntpd again. Now when ntpd starts up, it finds the offset is tolerable and keeps the time in sync by periodically syncing with NTP time servers:
sudo service ntpd stop
sudo ntpdate -u metadata.google.internal
sudo service ntpd start
When we ran `sudo hwclock -r`, we saw that the GCE instance’s (virtual) hardware clock time was off by 5 hours – this is the reason that the system time was off by such a large amount at boot time. This could be the source of the large clock skew that caused ntpd to quit trying to synchronize the time.
Lesson learned: always ensure that “Mr Ntpd” is alive and kicking… or things can start going wrong in all sorts of ways, including Hadoop cluster startup!