Program Logos
SalsaBookshelf

IndexedHBase - Instructions for Text Indexing and Search with ClueWeb09

Overview

This document assumes a dynamic HBase deployment has been set up in an HPC environment according to the MyHBase Users' Guide, and IndexedHBase-CoreCw09-0.2.jar has been put to $HADOOP_HOME/lib/. In case you have a standalone HBase deployment, change the arguments when running the following commands according to your configuration.

Commands

Login to the Hadoop head node, and execute the following commands for processing the ClueWeb09 data set.

(1) Create tables

cd $HADOOP_HOME

./bin/hadoop --config $HADOOP_CONF_DIR jar lib/IndexedHBase-CoreCw09-0.2.jar iu.pti.hbaseapp.clueweb09.TableCreatorClueWeb09 yes yes yes yes yes [path to a sample file for region creation for the text index table] [initial number of regions for the index table]

Here the "sample file for region creation for the text index table" is a file containing an ordered list of indexed terms, which is created by indexing a small sample of the whole data set in a sample run. According to our experienc, a good option for "initial number of regions for the index table" is [region server number] * 2.

(2) Load data to the data table

Create input files for the MapReduce data loader:

./bin/hadoop --config $HADOOP_CONF_DIR jar lib/IndexedHBase-CoreCw09-0.2.jar iu.pti.hbaseapp.clueweb09.Cw09Helpers create-mr-input [DFS directory containing .warc.gz files] [directory to save the input files for the MapReduce loader] [number of .warc.gz files to process by each mapper]

Upload input files for the MapReduce data loader to HDFS:

./bin/hadoop --config $HADOOP_CONF_DIR dfs -mkdir Data/allMrInput

./bin/hadoop --config $HADOOP_CONF_DIR dfs -copyFromLocal [directory containing the input files for the MapReduce loader] Data/allMrInput/

Run the MapReduce data loader:

./bin/hadoop --config $HADOOP_CONF_DIR jar lib/IndexedHBase-CoreCw09-0.2.jar iu.pti.hbaseapp.clueweb09.DataLoaderClueWeb09 Data/allMrInput text

Read some rows from the data table:

./bin/hadoop --config $HADOOP_CONF_DIR jar lib/IndexedHBase-CoreCw09-0.2.jar iu.pti.hbaseapp.HBaseTableReader clueWeb09DataTable d int string string string 1

(3) Build inverted index for the texts of ClueWeb09

Run the FreqIndexFileBuilderCw209 MapReduce application to create HFiles for the text index table:

./bin/hadoop --config $HADOOP_CONF_DIR jar lib/IndexedHBase-CoreCw09-0.2.jar iu.pti.hbaseapp.clueweb09.FreqIndexFileBuilder2Cw09 text [HDFS directory for output HFiles] 0 1 gzip

Login to the HBase master node, and run the "completeBulkLoad" command of HBase:

cd $HBASE_HOME

./bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles [HDFS directory for output HFiles] clueWeb09IndexTable

Login back to the Hadoop head node, and read some rows from the text index table:

cd $HADOOP_HOME

./bin/hadoop --config $HADOOP_CONF_DIR jar lib/IndexedHBase-CoreCw09-0.2.jar iu.pti.hbaseapp.HBaseTableReader clueWeb09IndexTable f string string int int 5

(4) Test the searching strategies

Run ScanSearcherClueWeb09 to search for frequent word 'changes':

./bin/hadoop --config $HADOOP_CONF_DIR jar lib/IndexedHBase-CoreCw09-0.2.jar iu.pti.hbaseapp.clueweb09.ScanSearcherClueWeb09 text changes true Data/scanSearchChanges

Run IndexSearcherClueWeb09 to search for frequent word 'changes':

./bin/hadoop --config $HADOOP_CONF_DIR dfs -mkdir Data/indexSearchInputChanges

./bin/hadoop --config $HADOOP_CONF_DIR jar lib/IndexedHBase-CoreCw09-0.2.jar iu.pti.hbaseapp.clueweb09.IndexSearcherClueWeb09 text changes Data/indexSearchInputChanges Data/indexSearchChanges -1 35000

Run sequential TestClientClueWeb09 to search for frequent word 'changes':

./bin/hadoop --config $HADOOP_CONF_DIR jar lib/IndexedHBase-CoreCw09-0.2.jar iu.pti.hbaseapp.clueweb09.TestClientClueWeb09 single-word-search changes -1 true $HOME/changes.txt text

(5) Get the distributions of documents count, total appearances, and record size

./bin/hadoop --config $HADOOP_CONF_DIR jar lib/IndexedHBase-CoreCw09-0.2.jar iu.pti.hbaseapp.FreqDistCounter clueWeb09IndexTable f Data/clueWeb09FreqDistCounter