answered question

answers (1)

rakeem
0
Votes
BEST ANSWER  decided by votes   |  rakeem  |  November 05, 2009 10:50 AM
(a) Take a sample that fits in main memory. Run the simple approach on this data, but with a threshold lowered so that we are unlikely to miss any truly frequent itemsets (e.g., if sample is 1% of the data, use s=125 as the support threshold).
(b) Add to the candidates of the sample the negative border: those sets of items S such that S is not identified as frequent in the sample, but every immediate subset of S is. For example, if ABCD is not frequent in the sample, but all of ABC;ABD;ACD, and BCD are frequent in the sample, then ABCD is in the negative border.
(c) Make a pass over the data, counting all the candidate itemsets and the negative border. If no member of the negative border is frequent in the full data, then the frequent itemsets are exactly those candidates that are above threshold.
(d) Unfortunately, if there is a member of the negative border that turns out to be requent, then we don't know whether some of its supersets are also frequent, so the whole process needs to be repeated (or we accept what we have and don't worry about a few false negatives).

Voted as best: kareul, lidyax
Comment
140

ask any question

Top of Page
Buy Mahalo Dollars
WITH CREDIT CARD OR PAYPAL

Please log in to use this function.