Print

Print


Hi Hooman,

It sounds like you are are working on the "frequently bought together" type of problem. Here are some thoughts:

 - GA sounds like a slow way to do what is basically correlation analysis and clustering. If you want to do it to learn GA, ok, you might want to use these other methods to check your results.

 - I'm guessing the 4200*120 matrix is a really sparse matrix (mostly zeros). What were you planning for a genotype? A vector of 4200 1s and 0s? I don't think that is going to help, because then you have a population of transactions, not a population of itemsets.

 - You might want to change your representation and input data to reflect what you really are interested in - itemsets. Break down the input into pairs of items (sku1, sku2) that appear in the same transaction. This is also a much more compact representation of the sparse matrix. Perhaps a genotype could be a variable length list of SKUs, which is what I think you mean by itemset. Since you now have a population of itemsets, you are closer to your goal. Of course, there will be several popular itemsets (beer, diapers), (paper, glue, scissors)... so you will need a nicheing algorithm or something like that.

Cheers,
David vun Kannon

> Date: Sat, 29 Jun 2013 07:13:47 -0400
> From: [log in to unmask]
> Subject: using GA on ECJ to find frequent itemsets
> To: [log in to unmask]
>
> Hi,
>
> I am gonna use ECJ framwork to implement genetic algorithm and extract
> frequent itemsets out of over 4000 items and 120 transactions.
>
> the data is as presented in the below matrix
>
> Items Item1 Item2 Item3 ... Item4200
>
> Trans1 1 0 1 1
> Trans2 1 0 0 1
> .
> .
> .
> Trans120 0 1 1 0
>
>
> There is a two-dimention csv file which is going to be loaded into ECJ framwork.
> The issue is that "ECJ doesn’t have direct support for loading arrays[1]"
>
> Any idea or clue would be appreciated.
>
> [1] http://cs.gmu.edu/~eclab/projects/ecj/docs/manual/manual.pdf, page 19