Program Logos
SalsaBookshelf

What is IndexedHBase?

As data intensive problems evolve, many research projects require efficient analysis of a target subset of data, rather than the whole data set. IndexedHBase is a storage system that extends HBase with a customizable indexing framework to support fast queries and analysis of interesting data subsets. Leveraging an architecture based on YARN, IndexedHBase can be integrated with various parallel computing platforms, such as Hadoop MapReduce and Twister, to complete efficient analysis of the query results.

What can IndexedHBase do?

By building index structures that are specially customized for the actual applications, IndexedHBase can achieve a query evaluation speed that is siginifcantly faster (by one to two orders of magnitudes) than using the existing indexing techniques provided by commercial NoSQL databases such as Riak.

IndexedHBase has been successfully used in several applications, including Text Indexing (with ClueWeb09), LCIR Synonym Mining, and Social Data Analysis (with Twitter data sets). Please check out our System Design and Publications for more details.

Funding and Leadership

This project is in part supported by National Science Foundation CAREER Grant OCI-1149432, and supervised by Judy Qiu, Assistant Professor of Computer Science, School of Informatics and Computing at Indiana University Bloomington.

 

Our latest paper "Parallel Clustering of High-Dimensional Social Media Data Streams" will appear at CCGrid 2015."

Check out our last paper "Social Media Data Analysis with IndexedHBase and Iterative MapReduce" at the 6th MTAGS Workshop.

The picture below demonstrates political polarization of the retweet network generated from politically related tweets in 2012. Click it for more interesting results in our application page about Twitter data analysis:


Retweet network 2012.