org.apache.mahout.clustering.lda.cvb
Class ModelTrainer

java.lang.Object
  extended by org.apache.mahout.clustering.lda.cvb.ModelTrainer

public class ModelTrainer
extends Object

Multithreaded LDA model trainer class, which primarily operates by running a "map/reduce" operation, all in memory locally (ie not a hadoop job!) : the "map" operation is to take the "read-only" TopicModel and use it to iteratively learn the p(topic|term, doc) distribution for documents (this can be done in parallel across many documents, as the "read-only" model is, well, read-only. Then the outputs of this are "reduced" onto the "write" model, and these updates are not parallelizable in the same way: individual documents can't be added to the same entries in different threads at the same time, but updates across many topics to the same term from the same document can be done in parallel, so they are. Because computation is done asynchronously, when iteration is done, it's important to call the stop() method, which blocks until work is complete. Setting the read model and the write model to be the same object may not quite work yet, on account of parallelism badness.


Constructor Summary
ModelTrainer(TopicModel model, int numTrainThreads, int numTopics, int numTerms)
          WARNING: this constructor may not lead to good behavior.
ModelTrainer(TopicModel initialReadModel, TopicModel initialWriteModel, int numTrainThreads, int numTopics, int numTerms)
           
 
Method Summary
 void batchTrain(Map<Vector,Vector> batch, boolean update, int numDocTopicsIters)
           
 double calculatePerplexity(VectorIterable matrix, VectorIterable docTopicCounts)
           
 double calculatePerplexity(VectorIterable matrix, VectorIterable docTopicCounts, double testFraction)
           
 double calculatePerplexity(Vector document, Vector docTopicCounts, int numDocTopicIters)
           
 TopicModel getReadModel()
           
 void persist(org.apache.hadoop.fs.Path outputPath)
           
 void start()
           
 void stop()
           
 void train(VectorIterable matrix, VectorIterable docTopicCounts)
           
 void train(VectorIterable matrix, VectorIterable docTopicCounts, int numDocTopicIters)
           
 void train(Vector document, Vector docTopicCounts, boolean update, int numDocTopicIters)
           
 void trainSync(Vector document, Vector docTopicCounts, boolean update, int numDocTopicIters)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ModelTrainer

public ModelTrainer(TopicModel initialReadModel,
                    TopicModel initialWriteModel,
                    int numTrainThreads,
                    int numTopics,
                    int numTerms)

ModelTrainer

public ModelTrainer(TopicModel model,
                    int numTrainThreads,
                    int numTopics,
                    int numTerms)
WARNING: this constructor may not lead to good behavior. What should be verified is that the model updating process does not conflict with model reading. It might work, but then again, it might not!

Parameters:
model - to be used for both reading (inference) and accumulating (learning)
numTrainThreads -
numTopics -
numTerms -
Method Detail

getReadModel

public TopicModel getReadModel()

start

public void start()

train

public void train(VectorIterable matrix,
                  VectorIterable docTopicCounts)

calculatePerplexity

public double calculatePerplexity(VectorIterable matrix,
                                  VectorIterable docTopicCounts)

calculatePerplexity

public double calculatePerplexity(VectorIterable matrix,
                                  VectorIterable docTopicCounts,
                                  double testFraction)

train

public void train(VectorIterable matrix,
                  VectorIterable docTopicCounts,
                  int numDocTopicIters)

batchTrain

public void batchTrain(Map<Vector,Vector> batch,
                       boolean update,
                       int numDocTopicsIters)

train

public void train(Vector document,
                  Vector docTopicCounts,
                  boolean update,
                  int numDocTopicIters)

trainSync

public void trainSync(Vector document,
                      Vector docTopicCounts,
                      boolean update,
                      int numDocTopicIters)

calculatePerplexity

public double calculatePerplexity(Vector document,
                                  Vector docTopicCounts,
                                  int numDocTopicIters)

stop

public void stop()

persist

public void persist(org.apache.hadoop.fs.Path outputPath)
             throws IOException
Throws:
IOException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.