In general, IndexedHBase uses HBase as the underlying storage platform, and provides users with the added flexibility to define the most suitable index structures to facilitate their queries about data. Both original data and index data are stored in HBase tables, and users can define customized index structures through an XML configuration file, as illustrated in Figure 1. Each index structure is implemented as one index table, and IndexedHBase can automatically generate records for the index tables by processing data from the source data tables, following the rules defined in the configuration file.
|Figure 1. An example customized index configuration file.|
The core of IndexedHBase is a customizable indexer library, as shown in Figure 2. The index configuration file contains multiple “index-config” elements that hold the mapping information between one source table and one index table. This element can flexibly define how to generate records for the index table off a given row from the source table. For more complicated index structures, users can implement a customizable indexer and use it by setting the “indexer-class” element.
|Figure 2. Components of customizable indexer.|
Upon initialization, the general customizable indexer reads and analyzes the index configuration file. When index() is invoked during runtime, all related “index-config” elements are used to generate index table records for a given row from the source table, either by following the rules defined in “index-config” or by invoking a user-defined indexer.
Online Indexing Mechanism and Batch Indexing Mechanism
IndexedHBase provides two means of indexing data: online and batch. The online mechanism is implemented through the insert() method of the general customizable indexer, as shown in Figure 2. The client application invokes this method to insert one row into a source table. The indexer will first insert the given row into the source table and then generate corresponding index table records by invoking index() and insert them into the index tables. From the client application’s perspective, data are indexed “online” when first inserted into the source table.
The batch indexing mechanism is designed for generating new customized index tables after all the data have been loaded into the source table. This mechanism is implemented as a “map-only” MapReduce job using the source table as input. The job accepts a source table and index table name as parameters and starts multiple mappers to index data in the source table in parallel, each processing one region of the table. Each mapper creates one general customizable indexer instance, which is initialized using the given index table name so that when index() is invoked, it will only generate index records for that single table. For each row of the source table, the mapper uses the indexer to generate index table records and write these records as output. All output records are handled by the table output format, which will automatically insert them into the index table.
System Architecture for Integrated Query and Analysis
Beyond index building and query evaluation, IndexedHBase can be integrated with parallel computing runtimes such as Hadoop MapReduce and Twister to support sophisticated analysis over the query results, as illustrated in Figure 3. Moreover, individual queries and analysis tasks can be composed into end-to-end analysis workflows. Check our latest paper for more details.
|Figure 3. IndexedHBase Architecture for Integrated Query and Analysis.|