org.apache.mahout.vectorizer.tfidf
Class TFIDFConverter
java.lang.Object
org.apache.mahout.vectorizer.tfidf.TFIDFConverter
public final class TFIDFConverter
- extends Object
This class converts a set of input vectors with term frequencies to TfIdf vectors. The Sequence file input
should have a WritableComparable
key containing and a
VectorWritable
value containing the
term frequency vector. This is conversion class uses multiple map/reduces to convert the vectors to TfIdf
format
Method Summary |
static Pair<Long[],List<org.apache.hadoop.fs.Path>> |
calculateDF(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
int chunkSizeInMegabytes)
Calculates the document frequencies of all terms from the input set of vectors in
SequenceFile format. |
static void |
processTfIdf(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
Pair<Long[],List<org.apache.hadoop.fs.Path>> datasetFeatures,
int minDf,
long maxDF,
float normPower,
boolean logNormalize,
boolean sequentialAccessOutput,
boolean namedVector,
int numReducers)
Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in
SequenceFile format. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
VECTOR_COUNT
public static final String VECTOR_COUNT
- See Also:
- Constant Field Values
FEATURE_COUNT
public static final String FEATURE_COUNT
- See Also:
- Constant Field Values
MIN_DF
public static final String MIN_DF
- See Also:
- Constant Field Values
MAX_DF
public static final String MAX_DF
- See Also:
- Constant Field Values
WORDCOUNT_OUTPUT_FOLDER
public static final String WORDCOUNT_OUTPUT_FOLDER
- See Also:
- Constant Field Values
processTfIdf
public static void processTfIdf(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
Pair<Long[],List<org.apache.hadoop.fs.Path>> datasetFeatures,
int minDf,
long maxDF,
float normPower,
boolean logNormalize,
boolean sequentialAccessOutput,
boolean namedVector,
int numReducers)
throws IOException,
InterruptedException,
ClassNotFoundException
- Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in
SequenceFile
format. This job uses a fixed limit on the maximum memory used by the feature chunk
per node thereby splitting the process across multiple map/reduces.
Before using this method calculateDF should be called
- Parameters:
input
- input directory of the vectors in SequenceFile
formatoutput
- output directory where RandomAccessSparseVector
's of the document
are generateddatasetFeatures
- Document frequencies information calculated by calculateDFminDf
- The minimum document frequency. Default 1maxDF
- The max percentage of vectors for the DF. Can be used to remove really high frequency features.
Expressed as an integer between 0 and 100. Default 99numReducers
- The number of reducers to spawn. This also affects the possible parallelism since each reducer
will typically produce a single output file containing tf-idf vectors for a subset of the
documents in the corpus.
- Throws:
IOException
InterruptedException
ClassNotFoundException
calculateDF
public static Pair<Long[],List<org.apache.hadoop.fs.Path>> calculateDF(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
int chunkSizeInMegabytes)
throws IOException,
InterruptedException,
ClassNotFoundException
- Calculates the document frequencies of all terms from the input set of vectors in
SequenceFile
format. This job uses a fixed limit on the maximum memory used by the feature chunk
per node thereby splitting the process across multiple map/reduces.
- Parameters:
input
- input directory of the vectors in SequenceFile
formatoutput
- output directory where document frequencies will be storedchunkSizeInMegabytes
- the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce
stage. Its recommended you calculated this based on the number of cores and the free memory
available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we
recommend you use a split size of around 400-500MB so that two simultaneous reducers can create
partial vectors without thrashing the system due to increased swapping
- Throws:
IOException
InterruptedException
ClassNotFoundException
Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.