The industry is rapidly moving to adopt Hadoop 2.x. With every upgrade process — especially one that is so big in nature — there is a level of complexity involved. Qubole has already started offering a beta service to our customers. Our customers have started to try out Hadoop 2 as well, and as with any transition, this one is not without its own quirks as well.
One of our customers, who run their major workloads on Qubole, wanted to try out Hadoop 2, and for this experiment, they wanted to transfer a part of their data that is stored in HDFS clusters managed by Qubole (which is HDFS 1) to Hadoop 2. The Hadoop community has done tremendous work in simplifying the migration from Hadoop 1 clusters to hadoop2 clusters but there is no support to make sure that the clusters can interact with each other. This becomes very important during the transition process when the two systems are running in parallel and splitting workloads.
We had to figure out a scalable way to migrate data from HDFS 1 to HDFS 2 reliably
HFTP with DistCp
DistCp is a popular tool in Hadoop to transfer data from one location to another. It uses MapReduce to parallelize the file copy operations and is ideal for any Hadoop-compatible file system. One can use a combination of HDFS to HDFS or HDFS to S3 etc. all of which have compatible file system implementations in Hadoop.
To transfer data between HDFS 1 and HDFS 2, an HTTP-based file system called HFTP reads data using HTTP. Note that HFTP can only perform read operations. Writes are not supported.
In this particular use case, data is generated on HDFS 1 clusters and has to be immediately pushed to HDFS 2 clusters. The other issue is that this would require a map-reduce application running on HDFS 2, but the cluster was already running different applications due to which starting a MapReduce cluster would require more resources and maintenance.
Solution
To solve this, we have to create an HDFS 2 compatible file system in Hadoop 1. To see how we did it let’s discuss how the file system is implemented in Hadoop. There is a base class called FileSystem, and all Hadoop-compatible file systems override this class. Once overridden, one can point a specific URI to a specific class in the Hadoop configuration. For example:
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.fs.hdfs.DistributedFileSystem</value>
</property>
The above code will point all URIs of the form hdfs://… to use the DistributedFileSystem class. Thus the solution is to add the Hadoop 2 jars in Hadoop 1 classpath, and then point another URI (example hdfs2) to the corresponding Hadoop 2 file system class. It seems straight forward but there are two problems:
- The fully classified class names are conflicting in hadoop1 and hadoop2. There are many classes (eg org.apache.hadoop.fs.hdfs.DistributedFileSystem) that have name package paths but have different implementations.
- Even if the class names were separated, the file system implementations for hdfs 1 and hdfs 2 are entirely different which requires multiple dependencies in the respective Hadoop systems.
We solved these using maven shading. Using shading, we renamed the Hadoop 2 packages from org.apache.hadoop.fs.hdfs.* to qubole.org.apache.hadoop.fs.hdfs.* into an Uber jar which allowed us to place Hadoop 2 classes in Hadoop 1 classpath. The only thing remaining now was to somehow call Hadoop 2 DFS implementation using Hadoop 1 FileSystem class. For that, we created a new class overriding Hadoop 1 FileSystem and called functions of Hadoop 2 DistributedFileSystem inside that. A small code snippet is mentioned below:
// Inside Hadoop1
public class Hdfs2FileSystem extends FileSystem {
// HDFS1 FileSystem
qubole.org.apache.hadoop.hdfs.DistributedFileSystem dfs;
public void write(byte[] data) {
dfs.write(data);
}
public byte[] read(int offset) {
return dfs.read(offset);
}
// And so on.
}
The final thing we did was to add another configuration parameter:
<property>
<name>fs.hdfs2.impl</name>
<value>org.apache.hadoop.fs.Hdfs2FileSystem</value>
</property>
Using this, we were easily able to transfer files from HDFS 1 to HDFS 2 using the following command:
distcp hdfs://hdfs1_namenode_address/<src> hdfs2://hdfs2_namenode_address/<dest>
The power of maven shading and the simple way to override the FileSystem APIs resulted in a reliable solution for interaction between hdfs 1 and hdfs 2 at scale and has been extremely useful in the transition from Hadoop 1 to Hadoop 2.