org.apache.mahout.clustering.lda.cvb
Class CachingCVB0Mapper
java.lang.Object
org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper
- Direct Known Subclasses:
- CVB0DocInferenceMapper
public class CachingCVB0Mapper
- extends org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
Run ensemble learning via loading the ModelTrainer
with two TopicModel
instances:
one from the previous iteration, the other empty. Inference is done on the first, and the
learning updates are stored in the second, and only emitted at cleanup().
In terms of obvious performance improvements still available, the memory footprint in this
Mapper could be dropped by half if we accumulated model updates onto the model we're using
for inference, which might also speed up convergence, as we'd be able to take advantage of
learning during iteration, not just after each one is done. Most likely we don't
really need to accumulate double values in the model either, floats would most likely be
sufficient. Between these two, we could squeeze another factor of 4 in memory efficiency.
In terms of CPU, we're re-learning the p(topic|doc) distribution on every iteration, starting
from scratch. This is usually only 10 fixed-point iterations per doc, but that's 10x more than
only 1. To avoid having to do this, we would need to do a map-side join of the unchanging
corpus with the continually-improving p(topic|doc) matrix, and then emit multiple outputs
from the mappers to make sure we can do the reduce model averaging as well. Tricky, but
possibly worth it.
ModelTrainer
already takes advantage (in maybe the not-nice way) of multi-core
availability by doing multithreaded learning, see that class for details.
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Mapper |
org.apache.hadoop.mapreduce.Mapper.Context |
Methods inherited from class org.apache.hadoop.mapreduce.Mapper |
run |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CachingCVB0Mapper
public CachingCVB0Mapper()
getModelTrainer
protected ModelTrainer getModelTrainer()
getMaxIters
protected int getMaxIters()
getNumTopics
protected int getNumTopics()
setup
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException,
InterruptedException
- Overrides:
setup
in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
- Throws:
IOException
InterruptedException
map
public void map(org.apache.hadoop.io.IntWritable docId,
VectorWritable document,
org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException,
InterruptedException
- Overrides:
map
in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
- Throws:
IOException
InterruptedException
cleanup
protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException,
InterruptedException
- Overrides:
cleanup
in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
- Throws:
IOException
InterruptedException
Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.