SalsaBookshelf

IndexedHBase - Text Indexing and Search with ClueWeb09

Overview

IndexedHBase can be used to build inverted indices for text data stored in HBase to support efficient text search. Figure 1 shows an example of an inverted index. For a given set of documents, each composed of a series of terms (words), it records such information: for each term, which subset of documents contain it in their texts?

Figure 1. An example inverted index.

Here we use the ClueWeb09 dataset to demonstrate the application of IndexedHBase in text indexing and search, and measure its performance in variuos aspects, including inverted index building, real-time updating, searching, etc.

Table Schemas for Text and Index Data

Since the ClueWeb09 data set is composed of web pages crawled from the Internet, we design the table schemas in Figure 2 to store the text and index data. In CW09FreqTable, column names are document IDs, and cell values are frequencies of terms in documents. In CW09PosVecTable, cell values are position vectors of terms in documents. The structures of CW09FreqTable and CW09PosVecTable are general enough for storing inverted indexes for any other text data sets.

Figure 2. Table schemas for text and index data.

Batch and Online Index Building

The Batch Inverted Index Building is completed with a Hadoop MapReduce program provided by IndexedHBase. We use the bulk loading strategy provided by HBase to populate the index tables, so the output of the program is HFiles. The execution of the program is illustrated in Figure 3.

Figure 3. MapReduce index building program.

Figure 4 shows the index building time and scalability measured in the tests done on the Quarry HPC cluster. The whole data set contains about 50 million web pages; its size is 232GB compressed and is about 1.5TB after decompression. In case of 96 data nodes, it takes about 181 minutes to build the index table. This number is comparable to Ivory's index building program as reported here. Furthermore, we get a moderate speed up of 1.76 when the cluster size is doubled. This indicates that IndexedHBase is able to handle larger data sets by having more resources.

Figure 4. Parallel batch index building performance vs. scale. Figure 5. Docs/s by all clients in online indexing test.

Figure 5 shows the aggregate performance of multiple online indexers using the online indexing strategy of IndexedHBase. We can see that as the number of concurrent clients increase, the aggregate system throughput and number of documents processed per second increases sub-linearly. Even in the case of 32 distributed clients, it takes only 50ms for a client to insert and index one document. This proves that IndexedHBase can support dynamic real-time data updates from multiple application clients very well.

Searching Strategies

Based on the inverted indices, we design the following three searching strategies and compare their performance for searching terms with different characteristics.

Parallel scan search (PSS): launch a MapReduce program to scan the text data table in parallel and search for the given term in texts. This strategy does not use the index.

Sequential index search (SIS): first access the index table to get document IDs, then sequentially access the text data table to get document texts.

Parallel index search (PIS): first access the index table to get document IDS, then use MapReduce to get texts from the text data table for all document IDs in parallel.

Table 1 presents the performance comparison of these three strategies for searching terms with different selectivities. We can observe that sequential index search is especially efficient for searching infrequent terms, and parallel index search is almost always preferable than parallel scan search. To choose the best searching strategy in practice, multiple factors should be considered, including terms' document count distribution, random access speed of HBase, number of mappers to use in parallel scan search and parallel index search, etc.

Table 1. Performance comparison for 3 searching strategies
term document count (selectivity in %) PSS search time (s) SIS search time (s) PIS search time (s) longest / shortest
all 30,237,276 (65.31%) 2,335 25,208 904 28
copyrights 4,022,026 (9.98%) 2,365 3,579 155 23
continental 435,901 (1.08%) 2,394 961 208 12
youthful 64,409 (0.16%) 2,384 282 173 14
pairwise 6,011 (0.01%) 2,427 32 50 76
institutional 90 (< 0.01%) 2,413 3 31 804