Access Keys:
Skip to content (Access Key - 0)
Searching Platform v4.6
In This Section

Appistry Storage Hadoop Edition provides plug-in compatibility with the Hadoop Distributed File System (HDFS), allowing Hadoop users to easily upgrade to Storage as a replacement for HDFS when enhanced reliability and throughput are key considerations. Because Storage has no single point of failure and no centralized bottleneck, it is more scalable and reliable than HDFS and more suitable for mission-critical deployment.

Hadoop MapReduce comes with its own distributed file system (HDFS), but supports multiple file systems – several of which include drivers with Hadoop. For HDFS, 'nameNode' gets started on a single, well-known worker – resulting in difficulties with scalability and reliability. Appistry has developed a file system adapter, written in Java, for Storage. No single point of failure exists for Storage.

Hadoop requires Java 6 on all workers in the fabric.

Where to Find:
On the Resource CD under '\demos\hadoop_demo' or Peer2Peer Downloads as the 'Storage Hadoop Edition Support Files'.

Deploy the Service

Appistry has put together a service-FAR for Hadoop prepackaged with 'fsfs.jar'. The 'hadoop.far' copies the 'fsfs.jar' into 'hadoop-$VERSION/lib' for selection as a filesystem to use with Hadoop as an alternative to HDFS.

To deploy the FAR, open the Management Console and connect to an active fabric. Hover the mouse-pointer over 'More' and click 'Deploy Files'. Browse for the appropriate FAR and click 'Submit'.

Edit and Deploy the Configuration

Before Hadoop will be ready to run a MapReduce job, a new service-configuration must be deployed for your environment. Inside '\demos\hadoop_demo\config_far', there are three files which should be updated:

  • core-site.xml (default)-- check the Hadoop Filesystem Driver Configuration page to ensure you are using the best parameters for your installation. Any parameters that deviate from Appistry's defaults should be included/changed, as appropriate:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <configuration>
    
       <property>
          <name>fs.default.name</name>
          <value>storage:/</value>
       </property>
    
       <property>
          <name>fs.storage.impl</name>
          <value>org.apache.hadoop.fs.appistry.FabricStorageFileSystem</value>
          <description>Appistry Fabric Storage FS for storage: uris.</description>
       </property>
    
       <property>
          <name>fs.abs.impl</name>
          <value>org.apache.hadoop.fs.appistry.BlockedFabricStorageFileSystem</value>
          <description>Block-aware Appistry Fabric Storage FS for storage: uris.</description>
       </property>
    
    </configuration>
    
  • appistry-site.xml (default)-- check the Hadoop Filesystem Driver Configuration page to ensure you are using the best parameters for your installation. Any parameters that deviate from Appistry's defaults should be included/changed, as appropriate:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <configuration>
    
       <property>
          <name>fs.appistry.storage.host</name>
          <value>localhost</value>
       </property>
    
       <property>
          <name>fs.appistry.storage.port</name>
          <value>16088</value>
       </property>
    
       <property>
          <name>fs.appistry.chunked</name>
          <value>true</value>
       </property>
    
    </configuration>
    
  • mapred-site.xml (default) – update the 'mapred.job.tracker' value with the node that will act as jobtracker for MapReduce and any other MapReduce parameters that apply to your job.
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <!-- Put site-specific property overrides in this file. -->
    
    <configuration>
       <property>
          <name>mapred.job.tracker</name>
          <value>http://192.168.5.42:9001</value>
       </property>
    
       <!-- guideline for these max(s) seems to be something like # CORES PER BOX -->
       <property>
          <name>mapred.tasktracker.map.tasks.maximum</name>
          <value>4</value>
       </property>
    
       <property>
          <name>mapred.tasktracker.reduce.tasks.maximum</name>
          <value>4</value>
       </property>
    
       <property>
          <name>mapred.task.timeout</name>
          <value>90000</value>
       </property>
    
       <property>
          <name>mapred.map.tasks</name>
          <value>40</value>
       </property>
    
       <property>
          <name>mapred.reduce.tasks</name>
          <value>35</value>
       </property>
    
    </configuration>
    

Package 'hadoop_config.xml' into a FAR using fabric_pkg, by typing:

fabric_pkg hadoop_config.xml

from the '\demos\hadoop_demo\config_far' directory.

To deploy the FAR, open the Management Console and connect to an active fabric. Hover the mouse-pointer over 'More' and click 'Deploy Files'. Browse for the appropriate FAR and click 'Submit'.

Architecture

We implemented two classes which extend FileSystem:

  • org.apache.hadoop.fs.appistry.FabricStorageFileSystem - a non-blocked filesystem interface to Storage.
  • org.apache.hadoop.fs.appistry.BlockedFabricStorageFileSystem - a blocked filesystem interface to Storage.

Storage does not implement a blocked interface at present. This functionality is handled by our BlockedFabricStorageFileSystem driver.

Each class implements common file system methods. For example:

  • open(Path p) return InputStream
  • create(Path p) return OutputStream
  • delete(Path p)
  • rename(Path src, Path dst)
  • mkdir(Path p)

One important call – getFileBlockLocations – is used by Hadoop to direct work to the worker with the file. File system classes use FabricStorageClient, which performs the HTTP request to Storage. Each MapReduce task accesses Storage from his local worker. We wrote two drivers – the blocked one adds a "blocked" abstraction on top of storage. Blocked file system may be necessary if user's files are really large. S3 implemented a blocked driver due to their file size limit.

We implemented a blocked driver for fabric affinity tasks:

  • MapReduce will run the task on the worker with a given block.
  • More workers locally processing parts of a single large file.

Our Performance Tests

We performed tests using Hadoop TeraGen and TeraSort:

  • MapReduce applications.
  • Generates and sorts a configurable number of records.
  • Code used by Hadoop to win the Terabyte Sort Benchmark (209s).

Configuration

  • 5 rhel5-64 nodes running CloudIQ Platform 4.2
  • 8 GB RAM, 8 cores
  • Deploying the following fabric archives (FARs):
    • fabric_storage.far – deployed as service with the latest storage executable.
    • hadoop.far – deployed as service with standard Hadoop install and scripts to start HDFS nameNode and MapReduce jobTracker
      • Hadoop_fsfs.far – service-configuration with our Hadoop filesystem driver that gets copied into Hadoop.
      • hadoop_config.far – service-configuration file for Hadoop

        Hadoop file system parameters are set in service-configuration:

        • fs.appistry.storage.username
        • fs.appistry.storage.password
        • fs.appistry.storage.logger.enabled
        • fs.appistry.storage.logger.group
        • fs.appistry.storage.logger.port
        • fs.appistry.storage.host
        • fs.appistry.storage.port
        • fs.appistry.storage.default.replication
        • fs.appistry.temp.dir
        • fs.appistry.chunked

Launching

TeraGen: bin/hadoop jar hadoop-*-examples.jar teragen 100000000 /test/teragen
TeraSort: bin/hadoop jar hadoop-*-examples.jar terasort /test/teragen /test/terasort

Notes

  • 100,000,000 is the number of records
  • Last parameter is the input/output folders
  • Command is using the default file system for both input and output

Terasort Performance Comparison (100,000 records)

Storage (no chunking, with logging)
(40 maps. 35 reducers)
HDFS
(80 maps, 35 reducers)
2mins, 38sec
2mins, 42sec
2mins, 48sec
2mins, 53sec
2mins, 55sec
2mins, 58sec
3mins, 2sec
3mins, 5sec
3mins, 8sec
3mins, 11sec
2mins, 54sec
2mins, 56sec
3mins, 11sec
3mins, 13sec
3mins, 45sec

Adaptavist Theme Builder Powered by Atlassian Confluence