SalsaBookshelf

IndexedHBase - Social Media Data Analysis

Overview

IndexedHBase is used to support Truthy, a public social media observatory that analyzes and visualizes information diffusion on Twitter. Truthy collects data through the Twitter streaming API, and the current total size of historical data collected since August 2010 is approximately 10 terabytes in compressed format. The real-time streaming data rate is in the range of 45-50 million tweets per day (approximately 20 GB compressed). Figure 1 illustrates one example tweet in JSON format.

Figure 1. An example tweet in JSON format.

Truthy poses the following requirements to our data storage and processing system based on IndexedHBase:

  • Efficient data loading solutions for the high volumn of social media data.
  • Efficient evaluation mechanisms for its unique temporal queries. Table 1 lists several example queries.
  • Efficient data analysis solutions based on the query results.

Table 1. Example Truthy queries
(Parameter "meme" means hashtags, user-mentions, and URLs contained in tweets.)
query meaning
get-tweets-with-meme(memes, time_window) Get all tweets during 'time_window' containing the given memes.
get-tweets-with-text(keywords, time_window) Get all tweets during 'time_window' containing the given memes.
timestamp-count(memes, time_window) For all tweets during 'time_window' containing the given memes, get the number of tweets posted on each day of the time window./font>
user-post-count(memes, time_window) For all tweets during 'time_window' containing the given memes, get the number of tweets posted by each user.

Table Schemas, Index Structures, and Query Evaluation Strategy

To satisfy the abovementioned requirements, we design the data tables and specially customized index tables in Figure 2. The index tables can be built using the batch and online indexing strategies provided by IndexedHBase. Furthermore, a two-phase parallel query evaluation strategy is implemented to support the Truthy queries, as illustrated in Figure 3. The first phase utilizes the index tables to find the related tweet IDs for the query, and the second phase launches a MapReduce job that processes the tweet IDs in parallel to compute the final results.

Figure 2. Data tables and index tables for Truthy.

Figure 3. Parallel query evaluation strategy.

We compare our solutions on IndexedHBase against using another widely adopted NoSQL database system, Riak. Table 2 provides a summary about the data loading performance using one month's data. Overall, IndexedHBase achieves a data loading speed that is 6 times faster than Riak, and generates a much smaller loaded data size. These advantages come from better data model normalization, customized index structures, less JSON parsing, and better data compression of IndexedHBase. Figure 4 presents the query evaluation performance for several typical queries on both platforms. An obvious trend is that Riak is faster on "small" queries involving less tweets, while IndexedHBase is significantly faster on queries involving a larger number of related tweets and results. The main reason for this difference comes from the light-weighted MapReduce framework employed by Riak, which iccurs less job initialization time but cannot handle large intermediate and result data sizes. Riak MapReduce always uses only one reducer, and intermediate data are transmitted directly from mappers to the reducer without being sorted or grouped. The reducer relies on its memory stack to store the whole list of intermediate data, and has a default timeout of only five seconds.

Table 2. Data loading performance comparison
loading time (hours) total data size (GB) original data size (GB) index data size (GB)
Riak 294.11 3258 2591 667
IndexedHBase 45.47 1167 955 212
Riak / IndexedHBase 6.47 2.79 2.71 3.15

Figure 4. Query evaluation performance: IndexedHBase vs. Riak.

Sophisticated Social Data Analysis Workflow.

Figure 5 gives an example analysis workflow consisting of both queries and post-query analysis tasks. This workflow is implemented for reproducing the experiments from a previous research project about political polarization. Task (2) is a parallel analysis program implemented with Hadoop MapReduce, and Task (5) is an iterative MapReduce Frutcherman-Reingold (MRFR) algorithm implemented using Twister.

Figure 5. An example analysis workflow in Truthy.

We extend this workflow to another dataset about the 2012 presidential election, and the left side of Figure 6 presents the per-iteration execution time of MRFR for processing the retweet network of 2012. The near-linear scalability clearly demonstrates that MRFR is especially good at handling large networks. In particular, using 256 mappers on 32 nodes, MRFR can finish one iteration 355 times faster than the sequential implementation in R. The right side of Figure 6 shows the final plot of the two largest communities of the 2012 retweet network, from which we can still observe a clearly segregated political structure.

Figure 3. Performance of MRFR and the final plot of 2012 retweet network.