org.apache.mahout.classifier.sgd
Class CrossFoldLearner

java.lang.Object
  extended by org.apache.mahout.classifier.AbstractVectorClassifier
      extended by org.apache.mahout.classifier.sgd.CrossFoldLearner
All Implemented Interfaces:
Closeable, org.apache.hadoop.io.Writable, OnlineLearner

public class CrossFoldLearner
extends AbstractVectorClassifier
implements OnlineLearner, org.apache.hadoop.io.Writable

Does cross-fold validation of log-likelihood and AUC on several online logistic regression models. Each record is passed to all but one of the models for training and to the remaining model for evaluation. In order to maintain proper segregation between the different folds across training data iterations, data should either be passed to this learner in the same order each time the training data is traversed or a tracking key such as the file offset of the training record should be passed with each training example.


Field Summary
 
Fields inherited from class org.apache.mahout.classifier.AbstractVectorClassifier
MIN_LOG_LIKELIHOOD
 
Constructor Summary
CrossFoldLearner()
           
CrossFoldLearner(int folds, int numCategories, int numFeatures, PriorFunction prior)
           
 
Method Summary
 void addModel(OnlineLogisticRegression model)
           
 CrossFoldLearner alpha(double alpha)
           
 double auc()
           
 Vector classify(Vector instance)
          Compute and return a vector containing n-1 scores, where n is equal to numCategories(), given an input vector instance.
 Vector classifyNoLink(Vector instance)
          Compute and return a vector of scores before applying the inverse link function.
 double classifyScalar(Vector instance)
          Classifies a vector in the special case of a binary classifier where AbstractVectorClassifier.classify(Vector) would return a vector with only one element.
 void close()
          Prepares the classifier for classification and deallocates any temporary data structures.
 CrossFoldLearner copy()
           
 CrossFoldLearner decayExponent(double x)
           
 OnlineAuc getAucEvaluator()
           
 double getLogLikelihood()
           
 List<OnlineLogisticRegression> getModels()
           
 int getNumFeatures()
           
 double[] getParameters()
           
 PriorFunction getPrior()
           
 int getRecord()
           
 CrossFoldLearner lambda(double v)
           
 CrossFoldLearner learningRate(double x)
           
 double logLikelihood()
           
 int numCategories()
          Returns the number of categories that a target variable can be assigned to.
 double percentCorrect()
           
 void readFields(DataInput in)
           
 void resetLineCounter()
           
 void setAucEvaluator(OnlineAuc auc)
           
 void setLogLikelihood(double logLikelihood)
           
 void setNumFeatures(int numFeatures)
           
 void setParameters(double[] parameters)
           
 void setPrior(PriorFunction prior)
           
 void setRecord(int record)
           
 void setWindowSize(int windowSize)
           
 CrossFoldLearner stepOffset(int x)
           
 void train(int actual, Vector instance)
          Updates the model using a particular target variable value and a feature vector.
 void train(long trackingKey, int actual, Vector instance)
          Updates the model using a particular target variable value and a feature vector.
 void train(long trackingKey, String groupKey, int actual, Vector instance)
          Updates the model using a particular target variable value and a feature vector.
 boolean validModel()
           
 void write(DataOutput out)
           
 
Methods inherited from class org.apache.mahout.classifier.AbstractVectorClassifier
classify, classifyFull, classifyFull, classifyFull, classifyScalar, logLikelihood
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CrossFoldLearner

public CrossFoldLearner()

CrossFoldLearner

public CrossFoldLearner(int folds,
                        int numCategories,
                        int numFeatures,
                        PriorFunction prior)
Method Detail

lambda

public CrossFoldLearner lambda(double v)

learningRate

public CrossFoldLearner learningRate(double x)

stepOffset

public CrossFoldLearner stepOffset(int x)

decayExponent

public CrossFoldLearner decayExponent(double x)

alpha

public CrossFoldLearner alpha(double alpha)

train

public void train(int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.

There may an assumption that if multiple passes through the training data are necessary, then the training examples will be presented in the same order. This is because the order of training examples may be used to assign records to different data splits for evaluation by cross-validation. Without the order invariance, records might be assigned to training and test splits and error estimates could be seriously affected.

If re-ordering is necessary, then using the alternative API which allows a tracking key to be added to the training example can be used.

Specified by:
train in interface OnlineLearner
Parameters:
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.

train

public void train(long trackingKey,
                  int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.

There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.

Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.

train

public void train(long trackingKey,
                  String groupKey,
                  int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.

There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.

Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
groupKey - An optional value that allows examples to be grouped in the computation of the update to the model.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.

close

public void close()
Description copied from interface: OnlineLearner
Prepares the classifier for classification and deallocates any temporary data structures. An online classifier should be able to accept more training after being closed, but closing the classifier may make classification more efficient.

Specified by:
close in interface Closeable
Specified by:
close in interface OnlineLearner

resetLineCounter

public void resetLineCounter()

validModel

public boolean validModel()

classify

public Vector classify(Vector instance)
Description copied from class: AbstractVectorClassifier
Compute and return a vector containing n-1 scores, where n is equal to numCategories(), given an input vector instance. Higher scores indicate that the input vector is more likely to belong to that category. The categories are denoted by the integers 0 through n-1 (inclusive), and the scores in the returned vector correspond to categories 1 through n-1 (leaving out category 0). It is assumed that the score for category 0 is one minus the sum of the scores in the returned vector.

Specified by:
classify in class AbstractVectorClassifier
Parameters:
instance - A feature vector to be classified.
Returns:
A vector of probabilities in 1 of n-1 encoding.

classifyNoLink

public Vector classifyNoLink(Vector instance)
Description copied from class: AbstractVectorClassifier
Compute and return a vector of scores before applying the inverse link function. For logistic regression and other generalized linear models, this is just the linear part of the classification.

The implementation of this method provided by AbstractVectorClassifier throws an UnsupportedOperationException. Your subclass must explicitly override this method to support this operation.

Overrides:
classifyNoLink in class AbstractVectorClassifier
Parameters:
instance - A feature vector to be classified.
Returns:
A vector of scores. If transformed by the link function, these will become probabilities.

classifyScalar

public double classifyScalar(Vector instance)
Description copied from class: AbstractVectorClassifier
Classifies a vector in the special case of a binary classifier where AbstractVectorClassifier.classify(Vector) would return a vector with only one element. As such, using this method can avoid the allocation of a vector.

Specified by:
classifyScalar in class AbstractVectorClassifier
Parameters:
instance - The feature vector to be classified.
Returns:
The score for category 1.
See Also:
AbstractVectorClassifier.classify(Vector)

numCategories

public int numCategories()
Description copied from class: AbstractVectorClassifier
Returns the number of categories that a target variable can be assigned to. A vector classifier will encode it's output as an integer from 0 to numCategories()-1 (inclusive).

Specified by:
numCategories in class AbstractVectorClassifier
Returns:
The number of categories.

auc

public double auc()

logLikelihood

public double logLikelihood()

percentCorrect

public double percentCorrect()

copy

public CrossFoldLearner copy()

getRecord

public int getRecord()

setRecord

public void setRecord(int record)

getAucEvaluator

public OnlineAuc getAucEvaluator()

setAucEvaluator

public void setAucEvaluator(OnlineAuc auc)

getLogLikelihood

public double getLogLikelihood()

setLogLikelihood

public void setLogLikelihood(double logLikelihood)

getModels

public List<OnlineLogisticRegression> getModels()

addModel

public void addModel(OnlineLogisticRegression model)

getParameters

public double[] getParameters()

setParameters

public void setParameters(double[] parameters)

getNumFeatures

public int getNumFeatures()

setNumFeatures

public void setNumFeatures(int numFeatures)

setWindowSize

public void setWindowSize(int windowSize)

getPrior

public PriorFunction getPrior()

setPrior

public void setPrior(PriorFunction prior)

write

public void write(DataOutput out)
           throws IOException
Specified by:
write in interface org.apache.hadoop.io.Writable
Throws:
IOException

readFields

public void readFields(DataInput in)
                throws IOException
Specified by:
readFields in interface org.apache.hadoop.io.Writable
Throws:
IOException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.