Package org.apache.mahout.fpm.pfpgrowth

MapReduce (parallel) implementation of FP Growth Algorithm for frequent Itemset Mining

See:
          Description

Class Summary
AggregatorMapper outputs the pattern for each item in the pattern, so that reducer can group them and select the top K frequent patterns
AggregatorReducer groups all Frequent Patterns containing an item and outputs the top K patterns containing that particular item
CountDescendingPairComparator<A extends Comparable<? super A>,B extends Comparable<? super B>> Defines an ordering on Pairs whose second element is a count.
FPGrowthDriver  
MultiTransactionTreeIterator Iterates over multiple transaction trees to produce a single iterator of transactions
ParallelCountingMapper maps all items in a particular transaction like the way it is done in Hadoop WordCount example
ParallelCountingReducer sums up the item count and output the item and the count This can also be used as a local Combiner.
ParallelFPGrowthCombiner takes each group of dependent transactions and\ compacts it in a TransactionTree structure
ParallelFPGrowthMapper maps each transaction to all unique items groups in the transaction.
ParallelFPGrowthReducer takes each group of transactions and runs Vanilla FPGrowth on it and outputs the the Top K frequent Patterns for each group.
PFPGrowth Parallel FP Growth Driver Class.
TransactionTree A compact representation of transactions modeled on the lines to FPTree This reduces plenty of space and speeds up Map/Reduce of PFPGrowth algorithm by reducing data size passed from the Mapper to the reducer where FPGrowth mining is done
 

Package org.apache.mahout.fpm.pfpgrowth Description

MapReduce (parallel) implementation of FP Growth Algorithm for frequent Itemset Mining

We have a Top K Parallel FPGrowth Implementation. What it means is that given a huge transaction list, we find all unique features(field values) and eliminates those features whose frequency in the whole dataset is less that minSupport. Using these remaining features N, we find the top K closed patterns for each of them, generating NK patterns. FPGrowth Algorithm is a generic implementation, we can use any object type to denote a feature. Current implementation requires you to use a String as the object type. You may implement a version for any object by creating Iterators, Convertors and TopKPatternWritable for that particular object. For more information please refer the package org.apache.mahout.fpm.pfpgrowth.convertors.string.

FPGrowth<String> fp = new FPGrowth<String>(); Set<String> features = new HashSet<String>(); fp.generateTopKStringFrequentPatterns( new StringRecordIterator( new FileLineIterable(new File(input), encoding, false), pattern), fp.generateFList( new StringRecordIterator(new FileLineIterable(new File(input), encoding, false), pattern), minSupport), minSupport, maxHeapSize, features, new StringOutputConvertor(new SequenceFileOutputCollector<Text,TopKStringPatterns>(writer)));

The command line launcher for string transaction data org.apache.mahout.fpm.pfpgrowth.FPGrowthJob has other features including specifying the regex pattern for spitting a string line of a transaction into the constituent features.

The numGroups parameter in FPGrowthJob specifies the number of groups into which transactions have to be decomposed. The numTreeCacheEntries parameter specifies the number of generated conditional FP-Trees to be kept in memory so as not to regenerate them. Increasing this number increases the memory consumption but might improve speed until a certain point. This depends entirely on the dataset in question. A value of 5-10 is recommended for mining up to top 100 patterns for each feature.



Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.