EVALUATING THE PERFORMANCE OF ASSOCIATION RULES IN APRIORI AND FP-GROWTH ALGORITHMS: MARKET BASKET ANALYSIS TO DISCOVERRULES OF ITEM COMBINATIONS

This study focuses on applying data mining techniques, especially association rules mining using the Apriori and FP-GROWTH algorithms, for market basket analysis on PT. XYZ is a pharmaceutical company in Indonesia. A quantitative methodology uses a dataset of 100,498 transactions originating from 432,356 rows of data covering July to December 2022 in the JABODETABEK area. Apriori and FP-GROWTH algorithms are applied for association rules mining. The results show that FP-GROWTH has the fastest execution time of 84,655 seconds. However, the memory usage for the Apriori algorithm is the lowest at 482.32 MiB, with increments of: 0.21 MiB. For the rules generated, the two algorithms, both Apriori and FP-GROWTH, produce the same number of rules and values of support, confidence, lift, Bi-Support, Bi-Confidence, and Bi-Lift. In conclusion, Apriori is recommended for sales datasets if memory usage and ease of implementation are important. However, if the speed of execution time and a large amount of data are considered, FP-GROWTH is a better choice because the execution time is faster for large amounts of data. However, the choice of algorithm depends on the specific analysis objectives, itemset size, data scale, and computational capabilities. Results from association rules mining provide evidence of product popularity, purchasing patterns, and opportunities for strategic marketing and inventory management. These findings can help PT. XYZ improves business efficiency, understands customer behavior, and increases profitability.


INTRODUCTION
The need to deeply understand customers to predict their desires has always been a major ambition for companies worldwide, especially those in the health sector (Sumarwan, 2014). This has become increasingly important in recent years due to the Covid-19 pandemic, resulting in increased competition and technological advances, which are now making this ambition more achievable (Diandra & Syahputra, 2021).
Consumer behavior is consumer activity in deciding to buy, use, and consume goods and services purchased, including customer factors that can lead to their decision whether to buy and use products (Sudirman et al., 2020). Every customer has different needs and tendencies and has different behavior in fulfilling these things (Daliyah, 2020). However, there are different behaviors to meet their needs. In that case, they still have some things in common, one of which is to maximize their satisfaction in consuming the required product or service.
In recent years, transaction data has been commonly used as an object of research and analysis for researchers (Kurniawan et al., 2018). This research focuses on a pharmaceutical company The high competition in the pharmaceutical business in Indonesia has also resulted in pharmaceutical entrepreneurs looking for the right marketing strategy to increase sales (Sayyid, 2020). One of them, PT XYZ, is a well-known pharmaceutical company in Indonesia. This pharmaceutical company only sells traditional medicines.
Products based on patterns of consumer spending habits are association rules (Alamsyah et al., 2021). Association rules (AR) is the process of finding patterns, correlations, associations, or causal structures that often occur from a set of data found in various types of databases such as relational data, transactional data and other forms of data storage (Dhanalakshmi & Sankari, 2014 ). The association rules method first came from marketing and is increasingly used in other fields, such as bioinformatics, nuclear science, pharmacoepidemiology, and geophysics (Alfiqra & Alfizi, 2018). One application of the association rules method is Market basket analysis. Market Basket Analysis (MBA) is an application of association rules (AR) often used to analyze consumer buying patterns. Therefore, this method is often called association rules-market basket analysis (ARMBA) (Kurniawan et al., 2018 ). The main objective of market basket analysis is to identify relationships in a set of products, items or categories (Qoniah & Priandika, 2020). The main objective of market basket analysis in marketing strategy is to increase sales by understanding customer buying patterns (Umar et al., 2022). By analyzing transaction data and finding associations between products purchased, businesses can identify products that are likely to be sold together and then strategically place those products near one another in a physical store or on a website. In this way, businesses can increase sales by tempting customers to buy more products while shopping. In addition, market basket

Figure 1. CRISP-DM Diagram
CRISP-DM is not the only standard in data mining but is currently the most popular (Muhammad, 2019). Based on the results of data science-pm polling in the period August-September 2020. CRISP-DM is used 2 to 3 times more than the top 4 widely used standards.

Figure 9. Poll Results
Stages of the CRISP-DM method 1. Business Understanding Phase: The initial stage in the CRISP-DM methodology is centered on comprehending the business goals or research objectives to be achieved. In this phase, we aim to gain a thorough understanding of the business aspects that underlie this project. 2. Data Understanding Phase: The Data Understanding phase in the CRISP-DM methodology is dedicated to delving deeper into the understanding of the data that will be the focus of the research. We endeavor to unearth insights and information from the dataset that will be used in this project. 3. Data Preparation Phase: This phase involves a series of meticulous data processing steps before we embark on further analysis. We make efforts to ensure that the data to be used is properly prepared for use. 4. Modeling Phase: In this phase, we apply the chosen analytical methods to achieve our research objectives. We employ this approach to uncover relationships and patterns within the data.

Evaluation Phase:
The primary focus of the Evaluation phase in the CRISP-DM methodology is to assess the outcomes of the modeling and market basket analysis conducted using association rules. We evaluate the extent to which these results align with the predetermined business or research objectives. In applying the CRISP-DM Methodology, the data mining process with the association rules technique is carried out in a non-paid version of the cloud environment from the Google company, namely Google Collab. Collab allows users to build, run, and share Python code online and provides free access to computing resources such as CPU, GPU, and TPU.
The association rules process is a technique in data mining used to discover relationships or associations between items in a dataset. In this process, there are three main parameters used to control the quality of the generated association rules, namely min support, min confidence, and min lift. 1. Min Support (Support Threshold): Min support is the minimum threshold value for the frequency of occurrence of an association in the dataset. If an association does not meet the specified min support value, it is considered insignificant and will not be included in the final results. The min support value is used to eliminate associations that occur infrequently. 2. Min Confidence (Confidence Threshold): Min confidence is the minimum threshold value for the probability that an association truly occurs. Confidence value measures the extent to which we can trust that an association will occur based on the available data. Associations with confidence values below the min confidence threshold are disregarded. 3. Min Lift (Lift Threshold): Min lift is the minimum threshold value used to determine whether an association is a significant relationship or just a coincidence. Lift measures how much the probability of an association occurring differs from the probability of both items occurring independently. Associations with lift values below the min lift threshold are considered insignificant. By using these three parameters, the association rules process can generate more relevant and meaningful association rules in the dataset, helping data analysts identify important patterns and potentially providing valuable insights for decision-making.

Evaluation Metrics of Association Rules
In association analysis, there are several important metrics: 1. Support: This measures the extent to which an itemset or association rule appears in the data.
The higher the support, the more frequently the itemset or rule occurs. The formula is as follows: 2. Confidence: Confidence measures how reliable or probable an association rule is. It looks at how often item B appears together with item A in transactions. The formula is as follows: 3. Lift: Lift measures whether an association rule is better than random chance. Values above 1 indicate a useful relationship, while values below 1 suggest a less significant one. The formula is as follows: 4. Bi-Support: If both rule A → B and rule ¬A → ¬B are strong, then the rule A → B would be very strong. Thus, we should look for strong evidence to prove these rules are interesting. So the Support conditions (Bi-support) of the Bi-directional measure framework.
5. Bi-confidence: Confidence typically signifies that when certain itemsets occur, they may lead to the occurrence of other itemsets. However, we've observed that the Confidence metric in association rules primarily focuses on the probability of "B" occurring when "A" occurs but doesn't adequately account for the relationship between "A" and "B" when "A" doesn't occur. This limitation renders many mined association rules invalid. To address the shortcomings of association rules, it's apparent that Confidence alone doesn't provide a complete depiction and doesn't fully capture the degree of correlation between itemsets. Therefore, we propose the concept of "Bi-confidence." The Bi-Confidence formula is as follows: 6. Bi-lift: Related research shows that the Lift method helps produce good evaluation results. However, it is obvious that Lift puts A and B in equivalent positions, which shows rule A → B is equivalent to B → A. If we accept rule A → B, we should also accept rule B → A. However, sometimes it is not true. For this problem, the paper proposes a Bi-lift measurement method. The Bi-lift formula is as follows: All of these metrics are used to identify significant relationships in the data and support business decision-making.

RESULTS AND DISCUSSION
The dataset used in this study consists of 100,497 transactions with 126 items that occurred within 155 days. This data is a sales transaction dataset from PT. XYZ, which is a pharmaceutical company in Indonesia. For the discussion, the research results are divided into two, namely, the evaluation of the association rules algorithm and the results of the association rules in the form of combination rules between items that will be used following the objectives of this study.

Algorithm Evaluation Results
In this research, the sales data analysis of PT. XYZ for July 2022 to December 2022 uses two association rules mining algorithms, Apriori and FP-GROWTH. The following is a table of the results of the processes carried out by the researchers on the dataset, as described in Table 8. Dataset information. by using the Python programming language on Google Colab. The FP-GROWTH algorithm takes a different approach by building a compact FP-Tree tree structure from a dataset. This avoids explicitly generating candidate itemsets.

Practical
The Apriori algorithm is easy to understand and implement. The concept is simple and intuitive. Generates all association rules that meet the specified support and confidence limits.
The FP-GROWTH algorithm requires a deeper understanding and a more complicated implementation than Apriori. Requires FP-Tree data structures and complex tree crawling processes.
Moreover, after applying the Apriori and FP-GROWTH algorithms. Then do the Overall Variability of Association rules (OVAR). Moreover, where OVAR stands for "Overall Variability of Association Rules." It is a statistical measure used in data mining and association rule analysis to quantify the overall variation or dispersion of support values among discovered association patterns. It helps assess how much the support values of different association patterns deviate from their average support value, Mi. OVAR is used to gauge the level of variation or heterogeneity within the dataset with respect to association patterns. OVAR formula is as follows:

OVAR -(1/N)*sum(i=1,N) (sum(j=1,K) ((Xij-Mi)˄2))
Where : N = the number of associatiob patterns discovered by the algorithm K = the number of itemsets present in the dataset Xij = the support of association pattern I and itemset j Mi = the average support of all association patterns. The following is a table of results from the OVAR based on OVAR Formula.  Moreover, the following are some explanations of the evaluation results of the two algorithms: 1. The amount of data processed for the three algorithms is 100,497 transactions, with 126 items in the dataset. 2. If implementing these association rules uses a different measure, namely Bi-support, Biconfidence, and Bi-lift. According to all criteria, Apriori is the better time and memory usage algorithm. 3. FP-GROWTH has a faster overall execution time compared to Apriori. FP-GROWTH takes 84,655 seconds, while Apriori takes 168,488 seconds, almost half of Apriori's processing time. What is interesting here is that if the execution time is only for implementing the algorithm without considering the initial load and pre-processing processes, FP-GROWTH has a very short execution time of 2,645 seconds, while a priori requires 133,576 seconds. 4. Apriori has almost five times lower memory usage compared to FP-GROWTH. The peak memory usage for Apriori is 482.32 MiB, while for FP-GROWTH, it is 2393.77 MiB. 5. Both methods generate the same number of rules for both criteria. 6. In terms of effectiveness, FP-GROWTH is faster in execution because it reduces the number of database scans and candidate generation. 7. Regarding practicality, Apriori has the advantage of being easy to understand and implement. At the same time, FP-GROWTH is more complicated but more efficient in large datasets. In addition, OVAR (Overall Variability of Association rules) calculations are also carried out with the Mi value calculated from the support value. The OVAR calculation results show a value of 0.000598837 for Apriori and 0.000598321 for FP-GROWTH. Meanwhile, OVAR based on Bi-Support shows a value of 0.020236498 for Apriori and 0.020233159 for FP-GROWTH. The results of the rules for each Apriori and FP-GROWTH algorithm To analyze these results, we can look at some of the metrics used by the algorithm: support, confidence, and lift. 1. Support measures how often an itemset appears in the dataset 2. Confidence measures how often the resulting itemsets appear together 3. Lift measures the dependency between the resulting itemsets 4. Bi-Support measures how often itemset A and itemset B occur together with other itemsets in the dataset. Bi-Support considers the relationship between rules A → B and rules ¬A → ¬B and calculates the minimum value of their support. 5. Bi-Confidence measures how often rules A → B and rules ¬A → ¬B occur together with other rules in the dataset. Bi-Confidence considers the relationship between the rules A → B and the rules ¬A → ¬B and calculate the minimum value of the confidence of both. 6. Bi-Lift measures the strength of the relationship between rules A → B and rules ¬A → ¬B by Dedy Dwiputra, Agung Mulyo Widodo, Habibullah Akbar, Gerry Firmansyah Evaluating the Performance of Association Rules in Apriori and FP-Growth Algorithms: Market Basket Analysis to Discover Rules of Item Combinations considering the relationship between the itemsets involved. Bi-Lift compares the dependency between the A → B rule and the ¬A → ¬B rule by considering the other itemsets in the dataset.
Using the above metrics, we can understand how often the itemsets and association rules appear in the dataset, how strong the relationship between the itemsets and the rules is, and how much influence there is between the A → B rules and ¬A → ¬B rules in the dataset. In analyzing these results, we can compare the values of support, confidence, lift, Bi-support, Bi-confidence, and Bi-lift to understand better the relationship between the itemsets and the association rules in the dataset used. The results of each algorithm will display the top 15 from each Support, Confidence, Lift, Bi-Support, Bi-Confidence, and Bi-Lift in each Apriori and FP-GROWTH algorithm.

Results of Apriori Algorithm Rules
The following is a table of results from implementing the a priori algorithm from top to bottom based on the Support value. Then the following is a table of results from applying the Apriori algorithm for the top to lowest rules based on the lift value.

The results of the FP-GROWTH Algorithm rules
The following is a table of results from implementing the FP-GROWTH algorithm for the top to lowest rules based on the Support value. Then the following is a table of results from implementing the FP-GROWTH algorithm for the top to lowest rules based on the Confidence value.

Analysis of the results of the rules
From the results of the existing rules, several things can be analyzed: Table 16. Analysis of the results of the rules Analysis Results Some combinations often appear in transactions and have a high level of trust.
Almost half of the top 10 rules confidence is also in the top 10 rules support. For example, the rule "CAP EUCALYPTUS OIL X 15 → CAP EUCALYPTUS OIL X 30" with a support value of 18.22% and a confidence of 64.38%.
There is a correlation between the level of confidence (support) and the increase (lift) of the product.
Half of the top 10 rules confidence also appears in the top 10 rules lift. For example, in the rule "CAP EUCALYPTUS OIL X 15, CAP EUCALYPTUS OIL X 60 → CAP EUCALYPTUS OIL X 30" with a confidence value of 82.22% and a lift of 2.68.
It was found that the relationship between products in terms of lift was also supported by a high confidence level.
Five rules from the top 10 rules lift the top 10 rules confidence.
It was found that although these rules have a significant lift, the frequency of occurrence of the combinations of itemset A and B in the dataset is relatively low.
2 rules from the top 10 rules support are in the top 10 rules lift The concept analysis of bi-support combines the two rules in two directions to consider the frequency of occurrence of itemset A and B combinations.
There is a difference between the value of support and bisupport, where the value tends to be higher than the value of support.
Conceptual analysis of confidence only considers itemset B's appearance when itemset A occurs, while bi-confidence involves the correlation between itemset A and B.
The difference between the confidence value and the biconfidence value. The bi-confidence value tends to be lower than the confidence value.
Concept analysis of Bi-lift is to consider the relationship between itemsets A and B regardless of whether itemset A occurs or not.
The difference in the value of the lift with the bi-lift. Bi-lift values tend to be higher than lift values.

Dedy Dwiputra, Agung Mulyo Widodo, Habibullah Akbar, Gerry Firmansyah
Evaluating the Performance of Association Rules in Apriori and FP-Growth Algorithms: Market Basket Analysis to Discover Rules of Item Combinations