org.apache.mahout.vectorizer
Class DocumentProcessor
java.lang.Object
org.apache.mahout.vectorizer.DocumentProcessor
public final class DocumentProcessor
- extends Object
This class converts a set of input documents in the sequence file format of StringTuple
s.The
SequenceFile
input should have a Text
key
containing the unique document identifier and a
Text
value containing the whole document. The document should be stored in UTF-8 encoding which is
recognizable by hadoop. It uses the given Analyzer
to process the document into
Token
s.
Method Summary |
static void |
tokenizeDocuments(org.apache.hadoop.fs.Path input,
Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf)
Convert the input documents into token array using the StringTuple The input documents has to be
in the SequenceFile format |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TOKENIZED_DOCUMENT_OUTPUT_FOLDER
public static final String TOKENIZED_DOCUMENT_OUTPUT_FOLDER
- See Also:
- Constant Field Values
ANALYZER_CLASS
public static final String ANALYZER_CLASS
- See Also:
- Constant Field Values
tokenizeDocuments
public static void tokenizeDocuments(org.apache.hadoop.fs.Path input,
Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf)
throws IOException,
InterruptedException,
ClassNotFoundException
- Convert the input documents into token array using the
StringTuple
The input documents has to be
in the SequenceFile
format
- Parameters:
input
- input directory of the documents in SequenceFile
formatoutput
- output directory were the StringTuple
token array of each document has to be createdanalyzerClass
- The Lucene Analyzer
for tokenizing the UTF-8 text
- Throws:
IOException
InterruptedException
ClassNotFoundException
Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.