org.apache.mahout.classifier.sgd
Class CsvRecordFactory

java.lang.Object
  extended by org.apache.mahout.classifier.sgd.CsvRecordFactory
All Implemented Interfaces:
RecordFactory

public class CsvRecordFactory
extends Object
implements RecordFactory

Converts CSV data lines to vectors. Use of this class proceeds in a few steps.


Constructor Summary
CsvRecordFactory(String targetName, Map<String,String> typeMap)
          Construct a parser for CSV lines that encodes the parsed data in vector form.
CsvRecordFactory(String targetName, String idName, Map<String,String> typeMap)
           
 
Method Summary
 void defineTargetCategories(List<String> values)
          Defines the values and thus the encoding of values of the target variables.
 void firstLine(String line)
          Processes the first line of a file (which should contain the variable names).
 String getIdName()
           
 String getIdString(CharSequence line)
          Extract the id column value from the CSV record
 Iterable<String> getPredictors()
          Returns a list of the names of the predictor variables.
 List<String> getTargetCategories()
           
 String getTargetLabel(int code)
          Extract the corresponding raw target label according to a code
 String getTargetString(CharSequence line)
          Extract the raw target string from a line read from a CSV file.
 Map<String,Set<Integer>> getTraceDictionary()
           
 CsvRecordFactory includeBiasTerm(boolean useBias)
           
 CsvRecordFactory maxTargetValue(int max)
          Defines the number of target variable categories, but allows this parser to pick encodings for them as they appear.
 int processLine(CharSequence line, Vector featureVector, boolean returnTarget)
          Decodes a single line of CSV data and records the target(if retrunTarget is true) and predictor variables in a record.
 int processLine(String line, Vector featureVector)
          Decodes a single line of CSV data and records the target and predictor variables in a record.
 void setIdName(String idName)
           
 boolean usesFirstLineAsSchema()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CsvRecordFactory

public CsvRecordFactory(String targetName,
                        Map<String,String> typeMap)
Construct a parser for CSV lines that encodes the parsed data in vector form.

Parameters:
targetName - The name of the target variable.
typeMap - A map describing the types of the predictor variables.

CsvRecordFactory

public CsvRecordFactory(String targetName,
                        String idName,
                        Map<String,String> typeMap)
Method Detail

defineTargetCategories

public void defineTargetCategories(List<String> values)
Defines the values and thus the encoding of values of the target variables. Note that any values of the target variable not present in this list will be given the value of the last member of the list.

Specified by:
defineTargetCategories in interface RecordFactory
Parameters:
values - The values the target variable can have.

maxTargetValue

public CsvRecordFactory maxTargetValue(int max)
Defines the number of target variable categories, but allows this parser to pick encodings for them as they appear.

Specified by:
maxTargetValue in interface RecordFactory
Parameters:
max - The number of categories that will be expected. Once this many have been seen, all others will get the encoding max-1.

usesFirstLineAsSchema

public boolean usesFirstLineAsSchema()
Specified by:
usesFirstLineAsSchema in interface RecordFactory

firstLine

public void firstLine(String line)
Processes the first line of a file (which should contain the variable names). The target and predictor column numbers are set from the names on this line.

Specified by:
firstLine in interface RecordFactory
Parameters:
line - Header line for the file.

processLine

public int processLine(String line,
                       Vector featureVector)
Decodes a single line of CSV data and records the target and predictor variables in a record. As a side effect, features are added into the featureVector. Returns the value of the target variable.

Specified by:
processLine in interface RecordFactory
Parameters:
line - The raw data.
featureVector - Where to fill in the features. Should be zeroed before calling processLine.
Returns:
The value of the target variable.

processLine

public int processLine(CharSequence line,
                       Vector featureVector,
                       boolean returnTarget)
Decodes a single line of CSV data and records the target(if retrunTarget is true) and predictor variables in a record. As a side effect, features are added into the featureVector. Returns the value of the target variable. When used during classify against production data without target value, the method will be called with returnTarget = false.

Parameters:
line - The raw data.
featureVector - Where to fill in the features. Should be zeroed before calling processLine.
returnTarget - whether process and return target value, -1 will be returned if false.
Returns:
The value of the target variable.

getTargetString

public String getTargetString(CharSequence line)
Extract the raw target string from a line read from a CSV file.

Parameters:
line - the line of content read from CSV file
Returns:
the raw target value in the corresponding column of CSV line

getTargetLabel

public String getTargetLabel(int code)
Extract the corresponding raw target label according to a code

Parameters:
code - the integer code encoded during training process
Returns:
the raw target label

getIdString

public String getIdString(CharSequence line)
Extract the id column value from the CSV record

Parameters:
line - the line of content read from CSV file
Returns:
the id value of the CSV record

getPredictors

public Iterable<String> getPredictors()
Returns a list of the names of the predictor variables.

Specified by:
getPredictors in interface RecordFactory
Returns:
A list of variable names.

getTraceDictionary

public Map<String,Set<Integer>> getTraceDictionary()
Specified by:
getTraceDictionary in interface RecordFactory

includeBiasTerm

public CsvRecordFactory includeBiasTerm(boolean useBias)
Specified by:
includeBiasTerm in interface RecordFactory

getTargetCategories

public List<String> getTargetCategories()
Specified by:
getTargetCategories in interface RecordFactory

getIdName

public String getIdName()

setIdName

public void setIdName(String idName)


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.