org.apache.mahout.vectorizer.collocations.llr
Class CollocReducer
java.lang.Object
org.apache.hadoop.mapreduce.Reducer<GramKey,Gram,Gram,Gram>
org.apache.mahout.vectorizer.collocations.llr.CollocReducer
public class CollocReducer
- extends org.apache.hadoop.mapreduce.Reducer<GramKey,Gram,Gram,Gram>
Reducer for Pass 1 of the collocation identification job. Generates counts for ngrams and subgrams.
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Reducer |
org.apache.hadoop.mapreduce.Reducer.Context |
Method Summary |
protected void |
processSubgram(Iterator<Gram> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
Sum frequencies for subgram, ngrams and deliver ngram, subgram pairs to the collector. |
protected void |
processUnigram(Iterator<Gram> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
Sum frequencies for unigrams and deliver to the collector |
protected void |
reduce(GramKey key,
Iterable<Gram> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
collocation finder: pass 1 reduce phase:
given input from the mapper, |
protected void |
setup(org.apache.hadoop.mapreduce.Reducer.Context context)
|
Methods inherited from class org.apache.hadoop.mapreduce.Reducer |
cleanup, run |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
MIN_SUPPORT
public static final String MIN_SUPPORT
- See Also:
- Constant Field Values
DEFAULT_MIN_SUPPORT
public static final int DEFAULT_MIN_SUPPORT
- See Also:
- Constant Field Values
CollocReducer
public CollocReducer()
reduce
protected void reduce(GramKey key,
Iterable<Gram> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException,
InterruptedException
- collocation finder: pass 1 reduce phase:
given input from the mapper,
k:head_subgram,ngram, v:ngram:partial freq
k:head_subgram v:head_subgram:partial freq
k:tail_subgram,ngram, v:ngram:partial freq
k:tail_subgram v:tail_subgram:partial freq
k:unigram v:unigram:partial freq
sum gram frequencies and output for llr calculation
output is:
k:ngram:ngramfreq v:head_subgram:head_subgramfreq
k:ngram:ngramfreq v:tail_subgram:tail_subgramfreq
k:unigram:unigramfreq v:unigram:unigramfreq
Each ngram's frequency is essentially counted twice, once for head, once for tail.
frequency should be the same for the head and tail. Fix this to count only for the
head and move the count into the value?
- Overrides:
reduce
in class org.apache.hadoop.mapreduce.Reducer<GramKey,Gram,Gram,Gram>
- Throws:
IOException
InterruptedException
setup
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException,
InterruptedException
- Overrides:
setup
in class org.apache.hadoop.mapreduce.Reducer<GramKey,Gram,Gram,Gram>
- Throws:
IOException
InterruptedException
processUnigram
protected void processUnigram(Iterator<Gram> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException,
InterruptedException
- Sum frequencies for unigrams and deliver to the collector
- Throws:
IOException
InterruptedException
processSubgram
protected void processSubgram(Iterator<Gram> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException,
InterruptedException
- Sum frequencies for subgram, ngrams and deliver ngram, subgram pairs to the collector.
Sort order guarantees that the subgram/subgram pairs will be seen first and then
subgram/ngram1 pairs, subgram/ngram2 pairs ... subgram/ngramN pairs, so frequencies for
ngrams can be calcualted here as well.
We end up calculating frequencies for ngrams for each sugram (head, tail) here, which is
some extra work.
- Throws:
InterruptedException
IOException
Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.