Program Logos
SalsaBookshelf

IndexedHBase - Special Instructions for Truthy on Moe

(1) Please first refer to here for general documents about running the Truthy commands. Some example commands on Moe are available here.

(2) To create accounts for new users on Moe, please contact Koji Tanaka (kj.tanaka@gmail.com) or Allan Streib (astreib@indiana.edu). Login to moe by ssh to moe.soic.indiana.edu using your IU network ID and password. ln01, ln02, and hn are hostnames for the 3 head nodes, and the compute nodes are named cn01 to cn10. You can run the Truthy commands on any node, but if your command has a large output size, then only the compute nodes have large local disks which are better places for storing the output. The home directory /home/{username} is an NFS directory shared among all the nodes. The /data/sd* and /public directories are local directories on each compute node accessible to all user accounts. The /public directory is on a 120GB SSD and only intended to store configuration, temporary, and log files. The /data/sd* directories are used for HDFS and other storage purposes.

(3) Click here for a detailed document about configuration of the Moe cluster. start-dfs.sh and start-yarn.sh must be run on ln01, and start-hbase.sh must be run on ln02.

(4) Set the following environment variables in your .bashrc:
export JAVA_HOME=/home/gao4/software/jdk1.7.0_40
export ANT_HOME=/home/gao4/software/apache-ant-1.9.4
export HADOOP_HOME=/home/gao4/software/hadoop-2.5.1
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HBASE_HOME=/home/gao4/software/hbase-0.94.23
export HADOOP_PID_DIR=/public/hadoop-2.5.1-pids
export HADOOP_MAPRED_PID_DIR=$HADOOP_PID_DIR
export HADOOP_SECURE_DN_PID_DIR=$HADOOP_PID_DIR
export YARN_PID_DIR=$HADOOP_PID_DIR
export HADOOP_LOG_DIR=/public/hadoop-2.5.1-logs
export YARN_LOG_DIR=$HADOOP_LOG_DIR
export HBASE_PID_DIR=$HADOOP_PID_DIR
export HADOOP_CLASSPATH=`$HBASE_HOME/bin/hbase classpath`
export HDFS_URI=hdfs://ln01:44749
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export PATH=$HBASE_HOME/bin:$ANT_HOME/bin:$PATH

(5) When downloading the IndexedHBase source or binary package, put it to a shared directory such as a subdir under your home directory. The load-from-hdfs and build-index commands now require the custom-index.xml file to be under a shared directory.

(6) load-from-hdfs and build-index could put a heavy load to the cluster. Try to avoid running these commands for more than two months at the same time. I recommend creating a new custom-index.xml file for each month before loading data. A example file is under conf/custom-index.xml of the unzipped IndexedHBase package.

(7) It's kind of cumbersome to check the dynamic loading process under the new version of Hadoop. Currently the simplest way is just to monitor the stderr of the loading job for a high-level progress report.

(8) Queries may be kind of slow at the "first dry run" right after data is loaded, but will be faster once the HBase servers are warmed up.

(9) To check the contents of any HDFS directory, use something like
hdfs dfs -ls $HDFS_URI/truthy/loading/
The under-the-hood directory for storing the .json.gz files when doing preload is hdfs://ln01:44749/truthy/loading/.
The HDFS working directories for queries are under hdfs://ln01:44749/truthy/queries/ . Subdirs for queries are organized in the form of "{month}/{queryType}_{date}-{millisecondsOfDay}".

(10) To list the current tables in HBase, do
hbase shell
After getting into the shell, use "list" to list all current tables. Use "status" to monitor the health of HBase. Use "exit" to quit the HBase shell.

(11) Maintenance and periodical clean-ups:
Clean up hdfs://ln01:44749/truthy/loading/ after load-from-hdfs is complete, or from time to time.
Clean up hdfs://ln01:44749/truthy/queries/ from time to time.
Clean up /public/hadoop-2.5.1-logs and /public/hbase-0.94.23-logs from time to time.

 

Indiana University Bloomington