An appropriate method to approach this type of problem (and one that I’ve used myself with considerable success) is called k-modes clustering. This is a variant of the k-means clustering algorithm that is especially useful for binary response data such as yours. I’ve used it on data with as many as 500 variables and hundreds of thousands of observations; it is quite able to handle large datasets. The implementation I used is in the R statistics package. (R is free, but I would suggest finding someone who knows R to help you if you are unfamiliar with it yourself.)
I am concerned that you may have problems with any method with only 120 observations. This is a very small number for this type of unsupervised learning. I wish you luck and please let me know how you do.
It sounds like you are are working on the "frequently bought together" type of problem. Here are some thoughts:
- GA sounds like a slow way to do what is basically correlation analysis and clustering. If you want to do it to learn GA, ok, you might want to use these other methods to check your results.
- I'm guessing the 4200*120 matrix is a really sparse matrix (mostly zeros). What were you planning for a genotype? A vector of 4200 1s and 0s? I don't think that is going to help, because then you have a population of transactions, not a population of itemsets.
- You might want to change your representation and input data to reflect what you really are interested in - itemsets. Break down the input into pairs of items (sku1, sku2) that appear in the same transaction. This is also a much more compact representation of the sparse matrix. Perhaps a genotype could be a variable length list of SKUs, which is what I think you mean by itemset. Since you now have a population of itemsets, you are closer to your goal. Of course, there will be several popular itemsets (beer, diapers), (paper, glue, scissors)... so you will need a nicheing algorithm or something like that.
David vun Kannon
> Date: Sat, 29 Jun 2013 07:13:47 -0400
> From: [log in to unmask]
> Subject: using GA on ECJ to find frequent itemsets
> To: [log in to unmask]
> I am gonna use ECJ framwork to implement genetic algorithm and extract
> frequent itemsets out of over 4000 items and 120 transactions.
> the data is as presented in the below matrix
> Items Item1 Item2 Item3 ... Item4200
> Trans1 1 0 1 1
> Trans2 1 0 0 1
> Trans120 0 1 1 0
> There is a two-dimention csv file which is going to be loaded into ECJ framwork.
> The issue is that "ECJ doesn’t have direct support for loading arrays"
> Any idea or clue would be appreciated.
>  http://cs.gmu.edu/~eclab/projects/ecj/docs/manual/manual.pdf, page 19