SalsaBookshelf

IndexedHBase - MyHBase Users' Guide

Overview

MyHBase is a software package for building a dynamic HBase deployment in an HPC job. It is an extension to the MyHadoop package, as illustrated in Figure 1. MyHBase can work in two modes: batch mode and interactive mode, corresponding to a batch HPC job or an interactive HPC job. In the batch mode, MyHBase provides a script that consecutively completes all the steps from nodes application, HBase deployment, user task execution, output collection, and deployment clean-up. In the interactive mode, MyHBase assumes the user already has nodes allocated to his/her job, and provides two separate scripts for starting a dynamic HBase deployment and clearing it up.

Figure 1. MyHBase.

Configurations

(1) Prerequisites:

Download and decompress Hadoop, HBase, and MyHBase to different directories on the distributed file system that is mounted on the nodes of the HPC cluster.

(2) Job-specific configurations:

In batch mode, change the job-specific parameter values on the top of pbs-myhbase-batch.sh to configure your HPC job, including number of nodes to apply, reservation time, ect. MyHBase only supports the PBS queue scheduing system currently, and support for other systems will be added later.

(3) MyHBase related configurations:

Go to the directory for MyHBase, and change the following environment variables in the corresponding scripts:

MY_HBASE_HOME : the decompressed direcotry for MyHBase. (In bin/setenv.sh, pbs-myhbase-batch.sh, myHBaseInteractiveStartSystem.sh, myHBaseInteractiveStopAll.sh.)

HADOOP_HOME : the decompressed direcotry for Hadoop. (In bin/setenv.sh.)

HADOOP_DATA_DIR : a directory existing on the local file system of every cluster node; this will be used to store data for HDFS. (In bin/setenv.sh.)

HADOOP_LOG_DIR : the directory to be used for writing the log files from Hadoop name node, job tracker, data nodes, and task trackers. Set this to a shared directory on the distributed file system so that you can easily check the logs of all those processes without separately logging in to every node. (In bin/setenv.sh.)

MY_NODES_LIST : path to a file that will be used to store a list of allocated nodes to your HPC job. This has to be under a direcotry of the shared distributed file system. A good option is $MY_HBASE_HOME/nodes.txt. (In bin/setenv.sh.)

HBASE_HOME : the decompressed direcotry for HBase. (In bin/setenv.sh.)

ZOO_KEEPER_DATADIR : a directory existing on the local file system of every cluster node. This will be used to store data for zookeepers. (In bin/setenv.sh.)

JAVA_HOME : Java installation directory on the shared distributed file system. (In bin/setenv.sh.)

(4) Hadoop and HBase related configurations:

The Hadoop and HBase configuration template files can be found under $MY_HBASE_HOME/etc. Customize your Hadoop and HBase configurations by setting properties in corresponding .xml files. E.g., "dfs.replication" in hdfs-site.xml, "hbase.rootdir" in hbase.rootdir, etc.

(5) Special issues to pay attention to:

By default, MyHBase uses the first node in $MY_NODES_LIST as Hadoop head node, the second node in $MY_NODES_LIST as the HBase master node, and the next three nodes as the zookeeper quorum. Besides, it treats $HADOOP_LOG_DIR as a directory on the shared distributed file system. If you need to change these for any reasons, read through bin/pbs-configure.sh and bin/pbs-cleanup.sh to make necessary modifications.

Running your HBase applications

In batch mode, add codes for running your HBase applications and collecting results under the comment line "#### Add codes for running your HBase applications here. ####" in pbs-myhbase-batch.sh, and then do "qsub pbs-myhbase-batch.sh".

In interactive mode, write your own scripts for running your HBase applications and collecting results. Make sure to source $MY_HBASE_HOME/bin/setenv.sh and set up necessary environment variables in your scripts. After getting nodes allocated to your interactive HPC job, login to the head node of your job allocation, and go to $MY_HBASE_HOME. Then run myHBaseInteractiveStartSystem.sh, your application running scripts, and myHBaseInteractiveStopAll.sh in order.