IndexedHBase - LCIR Synonym Mining

LC-IR (Local Context–Information Retrieval) is an algorithm for mining synonyms from large data sets. It discovers synonyms based on analysis of words' co-appearances in documents, and computes similarity of words using the formula in Equation 1:

Equation 1. Similarity calculation in LC-IR synonym mining.

Based on Equation 1, we desing an algorithm composed of the following steps for mining synonyms from the ClueWeb09 Category B data set:

(1) Word pair frequency counting step. Scan the text data table with a MapReduce program, pick pairs with Hits("w1 w2") > 0 and Hits("w2 w1") > 0, and generate a “pair count” table these pairs in the documents.

(2) Word counting step. Scan the index table and generate a "word count" table to boost queries about Hits(w1). The table does not contain terms with hits of only 1, and uses a bloom filter to detect non-existence of a term.

(3) Synonym scoring step. Scan the “pair count” table with a MapReduce program, and calculate similarities of word pairs. Single word hits are found by accessing the "word count" table, and then cached for repeated access.

(4) Synonym filtering step. Filter the word pairs with a similarity value above a threshold, and output them as results. This step is actually carried out on-the-fly by the MapReduce program in step (2).

In a configuration with 48 data nodes, step (1) finished in 4 hours and 42 minutes, and step (3) finished in 1 hour and 42 minutes. Table 1 lists some unusual synonyms mined.

Table 1. Example synonyms mined
synonyms synonym score meaning
ablepharie, ablephary 0.17 German and English words for the same eye disease
AbsoftProFortran, PGIFortran 0.11 two fortran compilers
abzuyian, bzypian 0.5 two dialects of the Abkhazian language
acamposate, acomposate 0.14 two drugs for curing alcoholism
accessLinkId, idAccessLink 0.13 variable names meaning the same thing