Program Logos
SalsaBookshelf

IndexedHBase - Special Instructions for Truthy on Madrid

(1) General documents for running the Truthy commands can be found here.

(2) Login to madrid by ssh to madrid.dsc.soic.indiana.edu using your IU network ID and password. This is the login node, and the actual working nodes are m1 - m8. ssh to any of these 8 nodes to work. The home directory /N/u/{username} is an NFS directory shared among all the nodes. The /scratch and /scratch_ssd directories are local directories on each node accessible to all user accounts.

(3) Set the following environment variables in your .bashrc:
export JAVA_HOME=/scratch/public/software/jdk1.7.0_45
export ANT_HOME=/scratch/public/software/apache-ant-1.9.2
export HADOOP_HOME=/scratch/public/software/hadoop-2.2.0
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HBASE_HOME=/scratch/public/software/hbase-0.96.0-hadoop2
export HADOOP_CLASSPATH=`$HBASE_HOME/bin/hbase classpath`
export HDFS_URI=hdfs://m1:44749
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
export PATH=$HBASE_HOME/bin:$ANT_HOME/bin:$PATH

(4) When downloading the IndexedHBase source package, put it to a shared directory such as a subdir under your home directory. The load-from-hdfs and build-index commands now require the custom-index.xml file to be under a shared directory. (I am currently working on this part to remove this requirement.)

(5) load-from-hdfs and build-index could put a heavy load to the cluster. Try to avoid running these commands for more than two months at the same time. I recommend creating a new custom-index.xml file for each month before loading data. A example file is under conf/custom-index.xml of the unzipped IndexedHBase package. (This is kind of ugly now -- I will work on making the index configuration file a template so that you don't need to create a new file for each month.)

(6) It's kind of cumbersome to check the dynamic loading process under the new version of Hadoop. Currently the simplest way is just to monitor the stderr of the loading job for a high-level progress report. I will try to find an easy way to monitor each task's progress and keep you updated.

(7) Queries may be kind of slow at the "first dry run" right after data is loaded, but will be faster once the HBase servers are warmed up.

(8) To check the contents of any HDFS directory, use something like
hadoop dfs -ls $HDFS_URI/truthy/loading/
The under-the-hood directory for storing the .json.gz files when doing preload is hdfs://m1:44749/truthy/loading/.
The HDFS working directories for queries are under hdfs://m1:44749/truthy/queries/ . Subdirs for queries are organized in the form of "{month}/{queryType}_{date}-{millisecondsOfDay}".

(9) To list the current tables in HBase, do
hbase shell
After getting into the shell, use "list" to list all current tables. Use "status" to monitor the health of HBase. Use "exit" to quit the HBase shell.

 

Indiana University Bloomington