Association Rules Mining Using Python Generators: The Apriori Algorithm

4 min read1 day ago

The Apriori Algorithm is commonly used to identify frequent item sets in datasets, particularly in the context of market basket analysis. It applies a “bottom-up” approach, starting with individual items that meet a minimum occurrence threshold. Once it identifies frequent individual items, it combines them to form larger item sets and verifies if these sets also satisfy the threshold. The algorithm stops when no additional items can be combined to create new sets that meet the threshold.

Let’s break down the Apriori algorithm with an example, assuming a minimum occurrence threshold of 3:

Example

Orders:

{apple, egg, milk}
{carrot, milk}
{apple, egg, carrot}
{apple, egg}
{apple, carrot}

Iteration 1: Count Single Items

The first step is to count the occurrence of each item:

Since {milk} and {carrot} do not meet the threshold of 3, we eliminate them.

Iteration 2: Create Item Sets of Size 2

Now, we build pairs of the remaining items: {apple, egg}.

At this point, only {apple, egg} remains as a frequent item set, and the algorithm stops because there are no other items that meet the threshold.

If we had more orders and items, we could continue iterating, building sets with more than two items. However, for this example, identifying pairs suffices.

Association Rules Mining

Once the Apriori algorithm has identified frequent item sets, we can start mining association rules. Since we are focusing on item sets of size 2, the association rules will take the form{A} -> {B}, where the purchase of item A implies a likelihood of item B being bought as well. This technique is commonly used in recommender systems to suggest products based on customers' previous purchases.

Key Metrics for Evaluating Association Rules

Three key metrics for evaluating association rules are support, confidence, and lift. Let’s define each using our example.

Support:

This metric indicates the percentage of total orders that contain a particular item set.
In our example, the set {apple, egg} appears in 3 out of 5 orders, giving

2. Confidence:

This metric measures how often item B is purchased given that item A was purchased, calculated as:

For our example:

Confidence of apple -> egg:

Here, 100% of orders containing egg also contain apple. Confidence can help determine the strength of an association, but it does not account for the popularity of both items.

3. Lift:

Lift measures the strength of a relationship between two items by comparing the observed co-occurrence with the expected co-occurrence if they were independent. It is directionless (i.e., lift from A to B equals lift from B to A) and is calculated as follows:

For {apple, egg}, we calculate lift as:

The interpretation of lift values:

Lift = 1: No relationship (items co-occur as expected by chance).
Lift > 1: Positive relationship (items co-occur more often than by chance).
Lift < 1: Negative relationship (items co-occur less often than by chance).

In our example, a lift of 1.25 suggests that apple and egg appear together 1.25 times more than if they were occurring independently, indicating a positive association.

In summary, Apriori and association rule mining help uncover valuable insights from transaction data. These metrics — support, confidence, and lift — offer meaningful ways to evaluate item relationships and can be used to drive recommendations and understand purchasing patterns.