|
|
Appistry Storage Hadoop Edition provides plug-in compatibility with the Hadoop Distributed File System (HDFS), allowing Hadoop users to easily upgrade to Storage as a replacement for HDFS when enhanced reliability and throughput are key considerations. Because Storage has no single point of failure and no centralized bottleneck, it is more scalable and reliable than HDFS and more suitable for mission-critical deployment.
Hadoop MapReduce comes with its own distributed file system (HDFS), but supports multiple file systems – several of which include drivers with Hadoop. For HDFS, 'nameNode' gets started on a single, well-known worker – resulting in difficulties with scalability and reliability. Appistry has developed a file system adapter, written in Java, for Storage. No single point of failure exists for Storage.
Hadoop requires Java 6 on all workers in the fabric.
Where to Find:
On the Resource CD under '\demos\hadoop_demo' or Peer2Peer Downloads as the 'Storage Hadoop Edition Support Files'. |
Deploy the Service
Appistry has put together a service-FAR for Hadoop prepackaged with 'fsfs.jar'. The 'hadoop.far' copies the 'fsfs.jar' into 'hadoop-$VERSION/lib' for selection as a filesystem to use with Hadoop as an alternative to HDFS.

To deploy the FAR, open the Management Console and connect to an active fabric. Hover the mouse-pointer over 'More' and click 'Deploy Files'. Browse for the appropriate FAR and click 'Submit'.
Edit and Deploy the Configuration
Before Hadoop will be ready to run a MapReduce job, a new service-configuration must be deployed for your environment. Inside '\demos\hadoop_demo\config_far', there are three files which should be updated:
- core-site.xml (default)-- check the Hadoop Filesystem Driver Configuration page to ensure you are using the best parameters for your installation. Any parameters that deviate from Appistry's defaults should be included/changed, as appropriate:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>storage:/</value>
</property>
<property>
<name>fs.storage.impl</name>
<value>org.apache.hadoop.fs.appistry.FabricStorageFileSystem</value>
<description>Appistry Fabric Storage FS for storage: uris.</description>
</property>
<property>
<name>fs.abs.impl</name>
<value>org.apache.hadoop.fs.appistry.BlockedFabricStorageFileSystem</value>
<description>Block-aware Appistry Fabric Storage FS for storage: uris.</description>
</property>
</configuration>
- appistry-site.xml (default)-- check the Hadoop Filesystem Driver Configuration page to ensure you are using the best parameters for your installation. Any parameters that deviate from Appistry's defaults should be included/changed, as appropriate:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.appistry.storage.host</name>
<value>localhost</value>
</property>
<property>
<name>fs.appistry.storage.port</name>
<value>16088</value>
</property>
<property>
<name>fs.appistry.chunked</name>
<value>true</value>
</property>
</configuration>
- mapred-site.xml (default) – update the 'mapred.job.tracker' value with the node that will act as jobtracker for MapReduce and any other MapReduce parameters that apply to your job.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>http://192.168.5.42:9001</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.task.timeout</name>
<value>90000</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>40</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>35</value>
</property>
</configuration>
Package 'hadoop_config.xml' into a FAR using fabric_pkg, by typing:
fabric_pkg hadoop_config.xml
from the '\demos\hadoop_demo\config_far' directory.
To deploy the FAR, open the Management Console and connect to an active fabric. Hover the mouse-pointer over 'More' and click 'Deploy Files'. Browse for the appropriate FAR and click 'Submit'.
Architecture
We implemented two classes which extend FileSystem:
- org.apache.hadoop.fs.appistry.FabricStorageFileSystem - a non-blocked filesystem interface to Storage.
- org.apache.hadoop.fs.appistry.BlockedFabricStorageFileSystem - a blocked filesystem interface to Storage.
Storage does not implement a blocked interface at present. This functionality is handled by our BlockedFabricStorageFileSystem driver.
Each class implements common file system methods. For example:
- open(Path p) return InputStream
- create(Path p) return OutputStream
- delete(Path p)
- rename(Path src, Path dst)
- mkdir(Path p)

One important call – getFileBlockLocations – is used by Hadoop to direct work to the worker with the file. File system classes use FabricStorageClient, which performs the HTTP request to Storage. Each MapReduce task accesses Storage from his local worker. We wrote two drivers – the blocked one adds a "blocked" abstraction on top of storage. Blocked file system may be necessary if user's files are really large. S3 implemented a blocked driver due to their file size limit.
We implemented a blocked driver for fabric affinity tasks:
- MapReduce will run the task on the worker with a given block.
- More workers locally processing parts of a single large file.

Our Performance Tests
We performed tests using Hadoop TeraGen and TeraSort:
- MapReduce applications.
- Generates and sorts a configurable number of records.
- Code used by Hadoop to win the Terabyte Sort Benchmark (209s).
Configuration
- 5 rhel5-64 nodes running CloudIQ Platform 4.2
- 8 GB RAM, 8 cores
- Deploying the following fabric archives (FARs):
- fabric_storage.far – deployed as service with the latest storage executable.
- hadoop.far – deployed as service with standard Hadoop install and scripts to start HDFS nameNode and MapReduce jobTracker
- Hadoop_fsfs.far – service-configuration with our Hadoop filesystem driver that gets copied into Hadoop.
- hadoop_config.far – service-configuration file for Hadoop
Hadoop file system parameters are set in service-configuration:
- fs.appistry.storage.username
- fs.appistry.storage.password
- fs.appistry.storage.logger.enabled
- fs.appistry.storage.logger.group
- fs.appistry.storage.logger.port
- fs.appistry.storage.host
- fs.appistry.storage.port
- fs.appistry.storage.default.replication
- fs.appistry.temp.dir
- fs.appistry.chunked
Launching
TeraGen: bin/hadoop jar hadoop-*-examples.jar teragen 100000000 /test/teragen
TeraSort: bin/hadoop jar hadoop-*-examples.jar terasort /test/teragen /test/terasort
Notes
- 100,000,000 is the number of records
- Last parameter is the input/output folders
- Command is using the default file system for both input and output
Terasort Performance Comparison (100,000 records)
Storage (no chunking, with logging) (40 maps. 35 reducers) |
HDFS (80 maps, 35 reducers) |
2mins, 38sec 2mins, 42sec 2mins, 48sec 2mins, 53sec 2mins, 55sec 2mins, 58sec 3mins, 2sec 3mins, 5sec 3mins, 8sec 3mins, 11sec |
2mins, 54sec 2mins, 56sec 3mins, 11sec 3mins, 13sec 3mins, 45sec |
