itemset-mining
itemset-mining copied to clipboard
Probabilistic Itemset Mining
IIM: Interesting Itemset Miner ![Build Status](https://travis-ci.org/mast-group/itemset-mining.svg?branch=master)
IIM is a novel algorithm that mines the itemsets that are most interesting under a probablistic model of transactions. Our model is able to efficiently infer interesting itemsets directly from the transaction database.
This is an implementation of the itemset miner from our paper:
A Bayesian Network Model for Interesting Itemsets
J. Fowkes and C. Sutton. PKDD 2016.
Installation
Installing in Eclipse
Simply import as a maven project into Eclipse using the File -> Import... menu option (note that this requires m2eclipse).
It's also possible to export a runnable jar from Eclipse using the File -> Export... menu option.
Compiling a Runnable Jar
To compile a standalone runnable jar, simply run
mvn package
in the top-level directory (note that this requires maven). This will create the standalone runnable jar itemset-mining-1.0.jar
in the itemset-mining/target subdirectory. The main class is itemsetmining.main.ItemsetMining (see below).
Running IIM
IIM uses a Bayesian Network Model to determine which itemsets are the most interesting in a given dataset.
Mining Interesting Itemsets
Main class itemsetmining.main.ItemsetMining mines itemsets from a specified transaction database file. It has the following command line options:
- -f database file to mine (in FIMI format)
- -i max. no. iterations
- -s max. no. structure steps
- -r max. runtime (min)
- -l log level (INFO/FINE/FINER/FINEST)
- -v print to console instead of log file
See the individual file javadocs in itemsetmining.main.ItemsetMining for information on the Java interface. In Eclipse you can set command line arguments for the IIM interface using the Run Configurations... menu option.
Example Usage
A complete example using the command line interface on a runnable jar. We can mine the provided example dataset example.dat
as follows:
$ java -cp itemset-mining/target/itemset-mining-1.0.jar itemsetmining.main.ItemsetMining
-i 100
-f example.dat
-v
which will output to the console. Omitting the -v
flag will redirect output to a log-file in /tmp/
.
Input/Output Formats
Input Format
IIM takes as input a transaction database file in FIMI format. The FIMI format is very simple: each line of the input file represents a transaction
and each transaction is a space-seperated list of items, represented by positive integers. The FIMI format requires the transaction items to be listed in increasing order
and does not allow duplicate items (however IIM is not sensitive to item order and ignores item duplicates). For example, a few lines (transactions) from example.dat
are:
6 10 22 31 32 41 52
2 12 14 26 50
3 18 25 31 34 38 63
17 28 30 37
16 19 45 46 49 51 52 54 56 65
Note that any other item formats (e.g. words for text corpora) need to be manually mapped to (and from) positive integers by means of a dictionary.
Output Format
IIM outputs a list of interesting itemsets, one itemset per line, ordered first by their interestingness (given in the 'int' column) followed by their probability (given in the 'prob' column). For example, the first few lines of output for the usage example above are:
============= INTERESTING ITEMSETS =============
{18} prob: 0.34830 int: 1.00000
{14} prob: 0.13740 int: 1.00000
{5} prob: 0.11740 int: 1.00000
{16} prob: 0.09110 int: 1.00000
{6, 7, 22, 36, 65, 67} prob: 0.08440 int: 1.00000
{17, 28, 30, 37} prob: 0.07830 int: 1.00000
{1, 2, 8, 11, 12, 13, 20, 63, 64} prob: 0.07670 int: 1.00000
{59, 60, 62} prob: 0.06980 int: 1.00000
{43, 46, 55} prob: 0.06890 int: 1.00000
{53} prob: 0.06870 int: 1.00000
See the accompanying paper for details of how to interpret 'interestingness' and 'probability' under IIM's probabilistic model.
Spark Implementation
IIM also has a (beta) parallel implemetation using Spark in Standalone Mode with an HDFS filesystem (see e.g. relevant parts of this tutorial).
Configuring Spark Options
Basic IIM configuration for Spark and HDFS must be set in itemset-miner/src/main/resources/spark.properties
(see the example config provided):
- SparkHome Spark home directory
- SparkMaster URL of spark master server
- MachinesInCluster No. machines in the cluster
- HDFSMaster URL of HDFS master server
-
HDFSConfFile Location of Hadoop
core-site.xml
Mining Itemsets using Spark
Main class itemsetmining.main.SparkItemsetMining mines itemsets using a Standalone Spark Sever. It has the following additional command line options:
- -c no. Spark cores to use
-
-j location of IIM standalone jar (default is
itemset-mining/target/itemset-mining-1.0.jar
)
See the individual file javadocs in itemsetmining.main.SparkItemsetMining for information on the Java interface.
Example Usage
A complete Spark example using the command line interface is as follows:
$ java -cp itemset-mining/target/itemset-mining-1.0.jar itemsetmining.main.SparkItemsetMining
-c 16
-i 100
-f example.dat
-v
which will output to the console. Omitting the -v
flag will redirect output to a log-file in /tmp/
.
Bugs
Please report any bugs using GitHub's issue tracker.
License
This algorithm is released under the GNU GPLv3 license.