org.apache.mahout.vectorizer
Class DictionaryVectorizer

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.mahout.common.AbstractJob
          extended by org.apache.mahout.vectorizer.DictionaryVectorizer
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool, Vectorizer

public final class DictionaryVectorizer
extends AbstractJob
implements Vectorizer

This class converts a set of input documents in the sequence file format to vectors. The Sequence file input should have a Text key containing the unique document identifier and a StringTuple value containing the tokenized document. You may use DocumentProcessor to tokenize the document. This is a dictionary based Vectorizer.


Field Summary
static int DEFAULT_MIN_SUPPORT
           
static String DOCUMENT_VECTOR_OUTPUT_FOLDER
           
static String MAX_NGRAMS
           
static String MIN_SUPPORT
           
 
Fields inherited from class org.apache.mahout.common.AbstractJob
argMap, inputFile, inputPath, outputFile, outputPath, tempPath
 
Method Summary
static void createTermFrequencyVectors(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, String tfVectorsFolderName, org.apache.hadoop.conf.Configuration baseConf, int minSupport, int maxNGramSize, float minLLRValue, float normPower, boolean logNormalize, int numReducers, int chunkSizeInMegabytes, boolean sequentialAccess, boolean namedVectors)
          Create Term Frequency (Tf) Vectors from the input set of documents in SequenceFile format.
 void createVectors(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, VectorizerConfig config)
           
static void main(String[] args)
           
 int run(String[] args)
           
 
Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DOCUMENT_VECTOR_OUTPUT_FOLDER

public static final String DOCUMENT_VECTOR_OUTPUT_FOLDER
See Also:
Constant Field Values

MIN_SUPPORT

public static final String MIN_SUPPORT
See Also:
Constant Field Values

MAX_NGRAMS

public static final String MAX_NGRAMS
See Also:
Constant Field Values

DEFAULT_MIN_SUPPORT

public static final int DEFAULT_MIN_SUPPORT
See Also:
Constant Field Values
Method Detail

createVectors

public void createVectors(org.apache.hadoop.fs.Path input,
                          org.apache.hadoop.fs.Path output,
                          VectorizerConfig config)
                   throws IOException,
                          ClassNotFoundException,
                          InterruptedException
Specified by:
createVectors in interface Vectorizer
Throws:
IOException
ClassNotFoundException
InterruptedException

createTermFrequencyVectors

public static void createTermFrequencyVectors(org.apache.hadoop.fs.Path input,
                                              org.apache.hadoop.fs.Path output,
                                              String tfVectorsFolderName,
                                              org.apache.hadoop.conf.Configuration baseConf,
                                              int minSupport,
                                              int maxNGramSize,
                                              float minLLRValue,
                                              float normPower,
                                              boolean logNormalize,
                                              int numReducers,
                                              int chunkSizeInMegabytes,
                                              boolean sequentialAccess,
                                              boolean namedVectors)
                                       throws IOException,
                                              InterruptedException,
                                              ClassNotFoundException
Create Term Frequency (Tf) Vectors from the input set of documents in SequenceFile format. This tries to fix the maximum memory used by the feature chunk per node thereby splitting the process across multiple map/reduces.

Parameters:
input - input directory of the documents in SequenceFile format
output - output directory where RandomAccessSparseVector's of the document are generated
tfVectorsFolderName - The name of the folder in which the final output vectors will be stored
baseConf - job configuration
normPower - L_p norm to be computed
logNormalize - whether to use log normalization
minSupport - the minimum frequency of the feature in the entire corpus to be considered for inclusion in the sparse vector
maxNGramSize - 1 = unigram, 2 = unigram and bigram, 3 = unigram, bigram and trigram
minLLRValue - minValue of log likelihood ratio to used to prune ngrams
chunkSizeInMegabytes - the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce stage. Its recommended you calculated this based on the number of cores and the free memory available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we recommend you use a split size of around 400-500MB so that two simultaneous reducers can create partial vectors without thrashing the system due to increased swapping
Throws:
IOException
InterruptedException
ClassNotFoundException

run

public int run(String[] args)
        throws Exception
Specified by:
run in interface org.apache.hadoop.util.Tool
Throws:
Exception

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.