org.apache.mahout.vectorizer
Class DictionaryVectorizer
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.mahout.common.AbstractJob
org.apache.mahout.vectorizer.DictionaryVectorizer
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool, Vectorizer
public final class DictionaryVectorizer
- extends AbstractJob
- implements Vectorizer
This class converts a set of input documents in the sequence file format to vectors. The Sequence file
input should have a Text
key containing the unique document identifier and a StringTuple
value containing the tokenized document. You may use DocumentProcessor
to tokenize the document.
This is a dictionary based Vectorizer.
Method Summary |
static void |
createTermFrequencyVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
String tfVectorsFolderName,
org.apache.hadoop.conf.Configuration baseConf,
int minSupport,
int maxNGramSize,
float minLLRValue,
float normPower,
boolean logNormalize,
int numReducers,
int chunkSizeInMegabytes,
boolean sequentialAccess,
boolean namedVectors)
Create Term Frequency (Tf) Vectors from the input set of documents in SequenceFile format. |
void |
createVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
VectorizerConfig config)
|
static void |
main(String[] args)
|
int |
run(String[] args)
|
Methods inherited from class org.apache.mahout.common.AbstractJob |
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
DOCUMENT_VECTOR_OUTPUT_FOLDER
public static final String DOCUMENT_VECTOR_OUTPUT_FOLDER
- See Also:
- Constant Field Values
MIN_SUPPORT
public static final String MIN_SUPPORT
- See Also:
- Constant Field Values
MAX_NGRAMS
public static final String MAX_NGRAMS
- See Also:
- Constant Field Values
DEFAULT_MIN_SUPPORT
public static final int DEFAULT_MIN_SUPPORT
- See Also:
- Constant Field Values
createVectors
public void createVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
VectorizerConfig config)
throws IOException,
ClassNotFoundException,
InterruptedException
- Specified by:
createVectors
in interface Vectorizer
- Throws:
IOException
ClassNotFoundException
InterruptedException
createTermFrequencyVectors
public static void createTermFrequencyVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
String tfVectorsFolderName,
org.apache.hadoop.conf.Configuration baseConf,
int minSupport,
int maxNGramSize,
float minLLRValue,
float normPower,
boolean logNormalize,
int numReducers,
int chunkSizeInMegabytes,
boolean sequentialAccess,
boolean namedVectors)
throws IOException,
InterruptedException,
ClassNotFoundException
- Create Term Frequency (Tf) Vectors from the input set of documents in
SequenceFile
format. This
tries to fix the maximum memory used by the feature chunk per node thereby splitting the process across
multiple map/reduces.
- Parameters:
input
- input directory of the documents in SequenceFile
formatoutput
- output directory where RandomAccessSparseVector
's of the document
are generatedtfVectorsFolderName
- The name of the folder in which the final output vectors will be storedbaseConf
- job configurationnormPower
- L_p norm to be computedlogNormalize
- whether to use log normalizationminSupport
- the minimum frequency of the feature in the entire corpus to be considered for inclusion in the
sparse vectormaxNGramSize
- 1 = unigram, 2 = unigram and bigram, 3 = unigram, bigram and trigramminLLRValue
- minValue of log likelihood ratio to used to prune ngramschunkSizeInMegabytes
- the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce
stage. Its recommended you calculated this based on the number of cores and the free memory
available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we
recommend you use a split size of around 400-500MB so that two simultaneous reducers can create
partial vectors without thrashing the system due to increased swapping
- Throws:
IOException
InterruptedException
ClassNotFoundException
run
public int run(String[] args)
throws Exception
- Specified by:
run
in interface org.apache.hadoop.util.Tool
- Throws:
Exception
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.