Program Logos
SalsaBookshelf

IndexedHBase - Instructions for Truthy Applications

Prerequisites

(1) This document assumes Hadoop and HBase has been deployed, Ant is installed, and the user is operating on the head node of the cluster.

(2) Export the following environment variables in your .bashrc on all nodes:

export JAVA_HOME={your ant installation directory}
export ANT_HOME={your ant installation directory}
export HADOOP_HOME={your hadoop installation directory}
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HBASE_HOME={your hbase installation directory}
export HADOOP_CLASSPATH=`$HBASE_HOME/bin/hbase classpath`
export HDFS_URI=hdfs://{your hdfs name node hostname}:{your hdfs name node port number}
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
export PATH=$HBASE_HOME/bin:$ANT_HOME/bin:$PATH

(3) Download the Indexed HBase binary package for Truthy and get it ready for use:

wget http://salsaproj.indiana.edu/IndexedHBase/IndexedHBase-CoreTruthy-0.2.zip
unzip IndexedHBase-CoreTruthy-0.2.zip
cd IndexedHBase-CoreTruthy-0.2/
cp lib/IndexedHBase-CoreTruthy-0.2.jar $HADOOP_HOME/share/hadoop/common/lib/

Data loading

(1) Copy .json.gz files to HDFS

Data is loaded in the units of month. Before loading, organize .json.gz files under directories for every month in the local file system; e.g. 2014-08-01.json.gz to 2014-08-31.json.gz should be stored under a directory named "2014-08". Then run the preload command to copy the directory for a month to HDFS. The second parameter {dirname on hdfs} should be just a directory name, not a full path. The command implementation takes care of the conversion to HDFS path automatically.

cd bin/
chmod +x truthy-cmd.sh
./truthy-cmd.sh preload {local dir} {dirname on hdfs}

(2) Create tables

Due to the special layout of the user table, its loading may suffer from the "hot spot" problem. To avoid this, we suggest creating the user table with pre-defined regions that share a evenly distributed data size. To do this, first run the sort-uid command to create a file containing the sorted uids in the .json.gz file for the last day of the month to load. Then use this sorted uid file when running the create-tables command. Note: The sort-uid command takes about one hours to finish because it needs to scan the .json.gz file. But it's worth it. For {number of regions}, use a number proportional to the number of region servers, e.g. 5 for 'small' months and 10 for 'big' months.

./truthy-cmd.sh sort-uid {path to .json.gz file for last day of month} {path to result .gz file for sorted uids}
./truthy-cmd.sh create-tables {month} {number of regions} {.gz sorted uid file path}

(3) Load data

Create a new custom-index-{month}.xml file by modifying IndexedHBase-CoreTruthy-0.2/conf/custom-index.xml with the corresponding month.
./truthy-cmd.sh load-from-hdfs {dirname on hdfs} {index or noindex} {month} [{index config path} when "index" is set instead of "noindex"]

(4) Build new index based on existing data table

Modify IndexedHBase-CoreTruthy-0.2/conf/custom-index.xml to add new index table information
./truthy-cmd.sh build-index {index config path} {source data table name} {index table name}

Note: {index config path} must be a full path on a shared file system, starting with '/'. This is because the loading mapper running on every node will need to access it.

Query execution

./truthy-cmd.sh get-tweets-with-meme {memes} {start time} {end time} {tweet-id or tweet-content} {output directory} [{number of tweets per file}]

./truthy-cmd.sh get-tweets-with-text {keywords} {start time} {end time} {tweet-id or tweet-content} {output directory} [{number of tweets per file}]

./truthy-cmd.sh get-tweets-with-userid {userid} {start time} {end time} {tweet-id or tweet-content} {output directory} [{number of tweets per file}]

./truthy-cmd.sh get-retweets {retweeted tweet id} {start time} {end time} {tweet-id or tweet-content} {output directory} [{number of tweets per file}]

./truthy-cmd.sh timestamp-count {memes} {start time} {end time} {output directory}

./truthy-cmd.sh user-post-count {memes} {start time} {end time} {output directory} [{number of tweets per file}]

./truthy-cmd.sh meme-post-count {memes} {start time} {end time} {output directory}

./truthy-cmd.sh meme-cooccur-count {meme} {start time} {end time} {output directory} [{number of tweets per file}]

./truthy-cmd.sh get-retweet-edges {memes} {start time} {end time} {in or out} {output directory} [{number of tweets per file}]

./truthy-cmd.sh get-mention-edges {memes} {start time} {end time} {in or out} {output directory} [{number of tweets per file}]

Note: {memes} must be in the form of "#usa,#islam,#occpy*" (double quotes are required). {start time} and {end time} must be in the form of 2012-09-03T22:34:22 (the time part 'T22:34:22' is optional). The optional {number of tweets per file} parameter specifies how many tweet IDs will each mapper process in the MapReduce query evaluation phase. The defualt value is 30000. Based on estimation about how "popular" a meme is, this can be adjusted to impact the number of mappers used. The number of reducers is automatically set to 0.5 * number of mappers.

Helpers and extensions

(1) HBase table operations

./truthy-cmd.sh delete-tables {month}

./truthy-cmd.sh get-system-status (Get a brief summary about HBase system status)

./truthy-cmd.sh read-table {table name} {number of rows to read} [{starting row key}]

(2) Get tweets by using the MapReduce index table operator

./truthy-cmd.sh get-tweets-with-mrop {index constraint} {start time} {end time} {tweet-id or tweet-content for output} {output directory} [{number of tweets per ID file}]

Note: {index constraint} must be either "text:CONSTRAINT_BODY" or "meme:CONSTRAINT_BODY". CONSTRAINT_BODY must be given in one of the following formats:
A. [lower,upper] for range constraint;
B. <regular expression> for regular expression constraint;
C. ~prefix*~ or ~prefix?~ for prefix constraint.

(3) Process tweet IDs from get-tweets-with-*

./truthy-cmd.sh get-tweets-and-analyze {get-(re)tweets* command} {query constraint} {start time} {end time} {map class name} {reduce class name} {compress or nocompress for output} {output directory} [{additional arguments}]

To process existing tweet IDs contained in an {hdfs input directory} with a MapReduce job:
./truthy-cmd.sh process-tweet-ids {hdfs input directory} {hdfs output directory} {map class name} {reduce class name} {number of reducer} {compress or nocompress for output} [{additional arguments}]

Note: {hdfs input directory} must be an HDFS directory containing tweet ID files; {additional arguments} are additional arguments that will be provided to your own mapper and reducers. To learn how to write new mappers and reducers, check out the existing query implementations under the iu.pti.hbaseapp.truthy.mrqueries package.

Tip: when doing create-tables, delete-tables, read-table, and get-system-status, you can append " 2> /dev/null" to the command get rid of the messy HBase logs to stderr. When executing long-running commands such as preload, load-from-hdfs, and build-index, it's better to use nohup to avoid problems caused by console logout. For example:

nohup ./bin/truthy-cmd.sh load-from-hdfs 2010-09 index 2010-09 /N/u/gao4/custom-index.xml 1> truthyLoad201009.out 2> truthyLoad201009.err &

Click here for special instructions about using IndexedHBase on the Madrid cluster.

Click here for special instructions about using IndexedHBase on the Moe cluster.

 

Indiana University Bloomington