Today is the deadline for Coupon Purchase Prediction competition in Kaggle! That was my first Kaggle competition that I’ve actively participated in and I learned a lot along the way. I though it is a good time to share some of what I’ve learned as an overview note with the people out there who are curious about an interesting problem in general, and specially for those who are interested in the field of data-science in particular. The main purposes of writing this note is, first, showing how does a typical, real, data-science problem look like, and second, how you may/may-not approach it. If you are a super expert in this field this note is not written for you. However, I’ll published the technical details of this work in the future and that’s what the experts may want to take a look at. Although, there have been developed better models by other Kagglers which may be available elsewhere. This is not the best model you can find for this problem. All parts of this approach perfectly works together though! Anyways, I hope you find this note helpful to understand: the problem/data that Ponpare is dealing with, the strategies that I’ve gone through to solve this problem, and the functions and tools that I’ve developed during the past two months of work on this project. It was fun overall!
1. Understanding The Problem/Data (What Do We Want/Have?)
What is the problem here and why is it important? Let’s start by understanding the motivation and nature of this problem. Have you ever thought about how famous websites such as Netflix, Amazon or Groupon recommend you a short list of their products and services such as movies, goods, and coupons every time you visit? This is exactly the subject of one of the recent machine learning competitions in famous Kaggle platform: Recommender Systems. Recommender Systems are widely used in different busynesses and are extremely important to improve the quality of online services. As a costumer you want to save money and time. On the other hand, the companies want to fulfill this costumers’ desire to be more efficient, more competitive, and less costly and as a result gains more money. Recommender systems do all these for you and the companies!
The options in each website are too many and anytime you intend to make a purchase you fear wasting your time and money on a product or service that we may not enjoy or fully understand. Therefore it is helpful to have a enriched recommendation package in front. Moreover, I’m sure you have noticed that the items in the recommendation package is not random and it also changes over time for each user based on your activities in the website. But have you really thought how a machine learning platform can come up with a package of recommendation items for you? The problem and available data are the following.
What do we want? Ponpare, Japan’s leading joint coupon site, wants to earn more money! Specifically, Ponpare wants you to design a model to recommend each registered user of this website a package of 10 most-likely-to-get-purchased coupons out of hundreds of coupons available this week.
What do we have? Ponpare have released a year of data from 07/01/2011 to 06/30/2012 (except the information related to the very last week i.e. 06/24/2012-06/30/2012 which is kept out for the model evaluation purposes) about the users, visits, purchases, and products of its coupon website. Namely, we have the following data files:
- USERS [22,873 unique users]: Some information about all the registered users including their state, sex, age, registration and withdraw dates, and user ID hashtag. (user_list.csv)
- COUPONS [TRAINING DATASET] [19,413 unique coupons]
- coupon_list_train.csv: some information about some of the coupons that has been offered in the past year (during 07/01/2011-06/24/2012) including their general category, store location, store state, discount rate, original/discounted price of the item, validity period, availability period, coupon ID hashtag, etc. This is part of the training dataset and does not include any coupon in the test dataset i.e. any coupon offered anytime during 06/24/2012-06/30/2012.
- coupon_area_train.csv: the coupon listing area for the training set coupons.
- COUPONS [TEST DATASET] [310 unique coupons]
- coupon_list_test.csv: including the same set of information as for coupon_list_train.csv, except the coupons in this data frame are all have been offered during the very last week i.e. 06/24/2012-06/30/2012. All the recommended coupons for the last week must be chosen from this set.
- coupon_area_test.csv: the coupon listing area for the test set coupons.
- VISITS [2,833,180 unique visits]: Information about the visits (or simply the records of the all clicks on each coupon by any registered user in the website during 07/01/2011-06/24/2012) including: the date of visit, the user ID of the visitor, the coupon ID of the visited coupon, etc. (coupon_visit_train.csv)
- PURCHASES [168,996 unique purchases]: Information about the purchases in the website during 07/01/2011-06/24/2012 including the location, number of purchased items, date of purchase, the user ID of the purchaser, the coupon ID of the purchased coupon, etc. (coupon_detail_train.csv)
We are also given the attitude/latitude of all the Japanese prefectures. In order to digest data and have a better understanding of it, I have done some visualizations of data which is very informative and gives useful insights in cleaning data and strategies that should be taken in the next steps. This section has been removed from this review report for simplicity but it is available with full details in the technical report.
2. Cleaning Data (What We Can Clean Up?)
Sometimes it is very important and useful to clean the data before even starting to work on them. But, how do you know that cleaning data is actually helpful before analyzing them? Well, sometimes just visualizing/understanding the data is all you need. Previous section gives you some helpful insight in this regard, for example. But this is not always the case. Sometimes you figure out later during the next steps that some cleaning on data would be useful. Therefore, you come back and clean the data and redo the analysis. Here in this specific problem there are following issues that one may consider in order to clean-up/enrich data. Considering the following issues makes it easier to do the analysis and also improves the quality of the final outcome.
- Replacing all NA values with “1”: This may seem a dangerous and harsh action to do at the first glance, but looking back at the position of NAs and their neighbor values shows that it’s actually not that bad! One should notice that all NAs are in flag format variables such as “USABLE_DATE_SUN” and “USABLE_DATE_HOLIDAY” and most of the majority of them are 1. So, replacing all NAs with “1” shouldn’t hurt much! Moreover, in the next steps you will see that the importance weight of the features in which the Nas are all located at (USABLE_* features) in the final result is very small.
- Converting the class of date variables to “date”: It turned out that the class of those variables which are supposed to be in the format of “date” is not “date” but “factor”! One has to convert these date variables to “date” class such that R can understand and work on them. The examples include DISPFROM and I_DATE variables.
- Translation from Japanese to English: This is not a necessary step but a helpful modification to non-javanese speakers because it helps to have a meaningful picture of the features when it comes to comparison and examples. Many of text data are in Japanese. I used a separate R script to translate data to English. This script is available in my technical report.
- Adding extra features to USERS data frame: One of the most important questions that we would like to answer is about the history of the activities of each registered user in the website. It is crucial to know that what range of category, price, and discount rate have been visited in the past by each user. On the other hand, because of the huge size of VISITS data file (~3 million unique visits) it’s computationally expensive to apply an algorithm to search for the history of each user more than once! So we do it once and save the results as an additional feature to our USERS data file. For examples, I would like to have an answer to the following questions : 1) what category is she/he interested in? 2) What range of price/discount-rate is she/he looking for? 3) Where she/he is interested to buy? etc. I specially built and implemented functions to calculate the following helpful, extra features about the users: 1) The top-3 most favorite categories that each user has visited in the past, 2) the most recent visited category that each user has checked out in the website. I did the search for all of the USERS in the VISITS file and added the results as new features to USERS data file.
3. Strategies (What We Can Do?)
The strategy that you take to attack a data-science problem is the core of your approach. There are many strategies that you can follow for a specific type of problems but it’s always important to choose wisely. Regarding to this problem, I’ve gone through two approaches as described below. The first approach is Cosine Similarity which is pretty quick&dirty but relatively efficient and the second approach is based on Decision Trees which is much more accurate and flexible but also much more extensive computationally. I briefly describe the basics of each approach in the following paragraphs. If you would like to see the details and the actual codes, you can check it out in the technical report of mine on this project.
- COSINE SIMILARITY: This strategy is all based on comparing the similarity of each coupon in the test dataset with the coupons that have been purchased by the user earlier in the training dataset in an semi-automatic manner by matrix analysis. In another words, I came up with a model in which for each user all the 310 coupons in the test dataset get scored based on their similarities with the earlier purchases of the same user in the training dataset. At the end of the day the top-10 most scored coupons for each user get chosen to be in the recommendation list for that specific user. Notice that we don’t give different features of the coupons the same weight, different features have different level of importance in this model. We figured that the coupon’s category, display period, and validity period have the most significant weights among all the other coupon features. For example, based on our analysis on these data, it is more important for a user to purchase from the category of his/her interest rather than the item’s price. So it get’s more weight in the scoring process.
- Pros: matrix analysis; relatively fast and efficient; computationally inexpensive; very general [we rank all 310 coupons in the test dataset for each user]
- Cones: uncertainties on the values of the free parameters; producing many NA values (~20%); neglecting the history of VISITS (and USERS’ features: AGE, for instance) [only the information from COUPONS and PURCHASES have been used]; difficult to understand/modify piece by piece
- Lines of Improvement: Filling out NAs via another strategy such as Decision Trees [explained in the next subsection]; applying either random-forest, gradient-decent, and LASSO methods to set the weights of importance more accurately; implementing the effect of VISITS and USERS’ features
- DECISION TREES: Unlike the previous strategy, in which we score all 310 coupons in the test dataset and rank them based on their scores, here, based on the history of purchases, we filter out some of the less-interesting coupons and start from a smaller portion of coupons, which based on the history of purchases/visits, are most likely to get purchased by the users by a huge probability difference. For example, (again based on the history of purchases) I considered the 50%-off and above-90%-off coupons as my initial bag of coupons. This action filters out ~66% of test coupons. The remaining portion is our initial bag of coupons. Starting from this initial bag, I considered the elements of this bag which are in the same category of user’s most recent visited category. If this intersection has less than 10 elements, I add those coupons which are in the most favorite category of the users, and so on until I a sweet bag of 10 coupons to recommend is made. This strategy is very easy to modify and implement but computationally expensive.
- Pros: easy to follow/understand/modify; history of VISITS are implemented; no NAs get produced; flexible to add new features [more trees are always possible]
- Cones: limited in scope [starting with a filtered portion of coupons]; computationally expensive; various features cannot be applied simultaneously [nature of decision trees]
- Lines of Improvement: Implement different weights for different features depending on their level of importance
4. Functions (What We Can Build?)
Dealing with huge datasets is a normal part of almost every data-science problem. Creating your own functions are extremely useful in this regard. In this specific problem I have developed various functions and routines to make things faster and easier to run and modify. Some of the functions that I have developed during working on this project are briefly described below.
- FUNCTION: “top_most_recent_visited_genre_last_week”: takes the USER_ID_hash and returns a list of all recent visited genres for this user in last week. This function has a crucial rule in Decision Tree strategy to identify the last visited category of each user [Why only last week? Because it’s good enough! And searching over the whole one year period is very expensive computationally, even for doing once. But notice that everywhere in the data in average ~50% of all last visits have happened during the previous week.]
- FUNCTION: “scores”: takes two vectors, v1 (vector of purchased coupons) and v2 (vector of recommended coupons), and returns the performance efficiency score (if all purchased coupons are predicted score = 1 otherwise < 1). This function is used for scoring in Cross Validation function (see FUNCTION:”CV”).
- FUNCTION: “total_scores”: takes two data frames i.e. old_submission and new_submission as its inputs and computes the accuracy of prediction (old_submission) versus real purchases (new_submission) for one random week. This function has a main role in Cross Validation function (see FUNCTION:”CV”).
- FUNCTION: “top3_visited_genre”: takes USER_ID_hash and returns the top-3 most visited genres for that specific user in vector format. This function is used in Decision Tree strategy to identify the 3 most visited categories of coupons for each user in VISITS data file.
- FUNCTION: “top3_recent_visited_genre”: takes USER_ID_hash and returs the top-3 most recent visited genres for that user. This function is used in Decision Tree strategy to identify the 3 most visited categories of coupons for each user in VISITS data file.
- FUNCTION: “instate_coupons”: takes USER_ID_hash and returs the available coupons from test dataset in the same state as for user. This function is used in Decision Tree strategy to identify the coupons in same state of user in test dataset.
- FUNCTION: “recent_ingenre_coupons”: takes USER_ID_hash and returs the available coupons from test dataset in the same genre as the most *recent* visited genre of the same user. This function is used in Decision Tree strategy to identify the coupons in the test dataset in the same category of user’s last visit.
- FUNCTION: “ingenre_coupons_I”: takes USER_ID_hash and returs the available coupons from test dataset in the same genre as the most visited genre of the same user. This function is used in Decision Tree strategy to identify the coupons in the test dataset in the same category of user’s most favorite category.
- FUNCTION: “ingenre_coupons_II”: takes USER_ID_hash and returs the available coupons from test dataset in the same genre as the second most visited genre of the same user. This function is used in Decision Tree strategy to identify the coupons in the test dataset in the same category of user’s second most favorite category.
- FUNCTION: “RECOM”: takes a USER_ID_hash and a test data frame and returns a list of 10 coupon recommendation from test data-set. This function has the main role in Decision Tree strategy to produce the recommendation list. This function is also used in (1) Cosine Similarity strategy to fill out the NAs and (2) in Cross Validation in order to test the quality of Decision Tree strategy with considering random weeks as new test datasets.
- FUNCTION: “MAINFUNC1”: takes a training dataset and a test dataset and returns a “submission” file which includes two columns consist of USER_ID_hash and 10-Recommended Coupons in from of each USER_ID_hash. This function includes the heart of Cosine Similarity strategy. This function is also used in Cross Validation with different, random chunks of data as new training and test datasets.
- FUNCTION: “CV”: takes a random_date and returns a total_score. The input random_date is considered as the starting date of a new test dataset in the old training dataset. The total_score gets calculated by old_submission and new_submission as inputs in which new_submission includes the predicted recommendation lists produced by a specific strategy (e.g. Cosine Similarity) and old_submission includes the true purchases during the new test dataset. CV function is designed to use in Cross Validation.
- FUNCTION: “Cross_Val”: takes an integer, i.e. “r”, and does a r-fold cross validation and returns the total_score for each iteration and averages ova all values. This function is designed to use in Cross Validation.
5. Cross Validation (How We Can Measure Our Performance?)
Self evaluation or technically speaking, Cross Validation, is a very useful tool to measure the quality of your main strategy in a self-evaluating way even without knowing the real answer for whatever you are predicting. Here is how Cross Validation works in a very simple language. After designing a predictive model in a data-science problem, you can evaluate your analysis by applying that on your training data. You randomly select a small portion of your training data and take it out as your new test dataset. Then you apply your model to the remaining portion of training data and come up with a prediction for your new test dataset which, in fact, you have the true answer for this part. Therefore, you can compare your prediction with truth. If you randomly choose your new test dataset from the training dataset multiple times, you can ensure that whatever the outcome is, it is a reliable yardstick to measure the quality of your method for the real test dataset which you don’t know the real answers for that.
I have built a Cross Validation platform for this Kaggle competition which does a cross validation for arbitrary times for your chosen strategy (as long as it returns a submission file in a standard format i.e. two columns. First: USER_ID_hash, and second: list of 10 recommended coupons) and returns the average quality score for your strategy. Check out the technical report for the details.
Acknowledgment: I acknowledge D.Sabri and S.Mirshamsi for their invaluable help and helpful discussion during this project. A part of the code on Cosine Similarity strategy is based on a script written by S. Mandal published on Kaggle website.