org.apache.mahout.vectorizer.collocations.llr
Class CollocReducer

java.lang.Object
  extended by org.apache.hadoop.mapreduce.Reducer<GramKey,Gram,Gram,Gram>
      extended by org.apache.mahout.vectorizer.collocations.llr.CollocReducer

public class CollocReducer
extends org.apache.hadoop.mapreduce.Reducer<GramKey,Gram,Gram,Gram>

Reducer for Pass 1 of the collocation identification job. Generates counts for ngrams and subgrams.


Nested Class Summary
static class CollocReducer.Skipped
           
 
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Reducer
org.apache.hadoop.mapreduce.Reducer.Context
 
Field Summary
static int DEFAULT_MIN_SUPPORT
           
static String MIN_SUPPORT
           
 
Constructor Summary
CollocReducer()
           
 
Method Summary
protected  void processSubgram(Iterator<Gram> values, org.apache.hadoop.mapreduce.Reducer.Context context)
          Sum frequencies for subgram, ngrams and deliver ngram, subgram pairs to the collector.
protected  void processUnigram(Iterator<Gram> values, org.apache.hadoop.mapreduce.Reducer.Context context)
          Sum frequencies for unigrams and deliver to the collector
protected  void reduce(GramKey key, Iterable<Gram> values, org.apache.hadoop.mapreduce.Reducer.Context context)
          collocation finder: pass 1 reduce phase:

given input from the mapper,

protected  void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
           
 
Methods inherited from class org.apache.hadoop.mapreduce.Reducer
cleanup, run
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MIN_SUPPORT

public static final String MIN_SUPPORT
See Also:
Constant Field Values

DEFAULT_MIN_SUPPORT

public static final int DEFAULT_MIN_SUPPORT
See Also:
Constant Field Values
Constructor Detail

CollocReducer

public CollocReducer()
Method Detail

reduce

protected void reduce(GramKey key,
                      Iterable<Gram> values,
                      org.apache.hadoop.mapreduce.Reducer.Context context)
               throws IOException,
                      InterruptedException
collocation finder: pass 1 reduce phase:

given input from the mapper,

 k:head_subgram,ngram,  v:ngram:partial freq
 k:head_subgram         v:head_subgram:partial freq
 k:tail_subgram,ngram,  v:ngram:partial freq
 k:tail_subgram         v:tail_subgram:partial freq
 k:unigram              v:unigram:partial freq
 
sum gram frequencies and output for llr calculation

output is:

 k:ngram:ngramfreq      v:head_subgram:head_subgramfreq
 k:ngram:ngramfreq      v:tail_subgram:tail_subgramfreq
 k:unigram:unigramfreq  v:unigram:unigramfreq
 
Each ngram's frequency is essentially counted twice, once for head, once for tail. frequency should be the same for the head and tail. Fix this to count only for the head and move the count into the value?

Overrides:
reduce in class org.apache.hadoop.mapreduce.Reducer<GramKey,Gram,Gram,Gram>
Throws:
IOException
InterruptedException

setup

protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
              throws IOException,
                     InterruptedException
Overrides:
setup in class org.apache.hadoop.mapreduce.Reducer<GramKey,Gram,Gram,Gram>
Throws:
IOException
InterruptedException

processUnigram

protected void processUnigram(Iterator<Gram> values,
                              org.apache.hadoop.mapreduce.Reducer.Context context)
                       throws IOException,
                              InterruptedException
Sum frequencies for unigrams and deliver to the collector

Throws:
IOException
InterruptedException

processSubgram

protected void processSubgram(Iterator<Gram> values,
                              org.apache.hadoop.mapreduce.Reducer.Context context)
                       throws IOException,
                              InterruptedException
Sum frequencies for subgram, ngrams and deliver ngram, subgram pairs to the collector.

Sort order guarantees that the subgram/subgram pairs will be seen first and then subgram/ngram1 pairs, subgram/ngram2 pairs ... subgram/ngramN pairs, so frequencies for ngrams can be calcualted here as well.

We end up calculating frequencies for ngrams for each sugram (head, tail) here, which is some extra work.

Throws:
InterruptedException
IOException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.