Association Rules Mining Using Python Generators: The Apriori Algorithm
The Apriori Algorithm is commonly used to identify frequent item sets in datasets, particularly in the context of market basket analysis. It applies a “bottom-up” approach, starting with individual items that meet a minimum occurrence threshold. Once it identifies frequent individual items, it combines them to form larger item sets and verifies if these sets also satisfy the threshold. The algorithm stops when no additional items can be combined to create new sets that meet the threshold.
Let’s break down the Apriori algorithm with an example, assuming a minimum occurrence threshold of 3:
Example
Orders:
- {apple, egg, milk}
- {carrot, milk}
- {apple, egg, carrot}
- {apple, egg}
- {apple, carrot}
Iteration 1: Count Single Items
The first step is to count the occurrence of each item:
Since {milk} and {carrot} do not meet the threshold of 3, we eliminate them.
Iteration 2: Create Item Sets of Size 2
Now, we build pairs of the remaining items: {apple, egg}.
At this point, only {apple, egg} remains as a frequent item set, and the algorithm stops because there are no other items that meet the threshold.
If we had more orders and items, we could continue iterating, building sets with more than two items. However, for this example, identifying pairs suffices.
Association Rules Mining
Once the Apriori algorithm has identified frequent item sets, we can start mining association rules. Since we are focusing on item sets of size 2, the association rules will take the form{A} -> {B}
, where the purchase of item A implies a likelihood of item B being bought as well. This technique is commonly used in recommender systems to suggest products based on customers' previous purchases.
Key Metrics for Evaluating Association Rules
Three key metrics for evaluating association rules are support, confidence, and lift. Let’s define each using our example.
- Support:
- This metric indicates the percentage of total orders that contain a particular item set.
- In our example, the set {apple, egg} appears in 3 out of 5 orders, giving
2. Confidence:
- This metric measures how often item B is purchased given that item A was purchased, calculated as:
For our example:
- Confidence of
apple -> egg
:
Here, 100% of orders containing egg
also contain apple
. Confidence can help determine the strength of an association, but it does not account for the popularity of both items.
3. Lift:
Lift measures the strength of a relationship between two items by comparing the observed co-occurrence with the expected co-occurrence if they were independent. It is directionless (i.e., lift from A to B equals lift from B to A) and is calculated as follows:
For {apple, egg}
, we calculate lift as:
The interpretation of lift values:
- Lift = 1: No relationship (items co-occur as expected by chance).
- Lift > 1: Positive relationship (items co-occur more often than by chance).
- Lift < 1: Negative relationship (items co-occur less often than by chance).
In our example, a lift of 1.25 suggests that apple
and egg
appear together 1.25 times more than if they were occurring independently, indicating a positive association.
In summary, Apriori and association rule mining help uncover valuable insights from transaction data. These metrics — support, confidence, and lift — offer meaningful ways to evaluate item relationships and can be used to drive recommendations and understand purchasing patterns.