Problem Definition

Customer Engagement is the key to success for any ecommerce service. The more time the user spends on the website the more is the potential to generate revenue. There are multiple engagement strategies that are implemented successfully. One of the most popular strategies is product recommentation. In this post we will work through a simple yet effective product recommendation algorithm which recommends products based on the current popularity/activity level. The metric we will consider for measuring the activity level is cart loads. There are many more complex recommendation algorithms which I will cover in the future. But this is a great way to get started.

Data Simulation

For the purpose of this experiement we will simulate the data which will be very close to reality. Below are some of the simulation startegies and assumptions.

Products : We will generate 100 Product IDs. Assumption that the ecommerce website sells 100 products only

Observation Window : We will consider the activity data for past 31 days (Lets say 01-Mar to 31-Mar)

Monthly Average cart loads for each product : Lets assume this follows a beta distribution with parameters 2 and 5

Daily cart loads for each product : Follows a poisson distribution with mean for each product obtained from the above step

Weekly seasonality : Lets assume a 10% spike on the weekends. This is reasonable because customer get more time over the weekends to shop

Random Normal noise : This is everywhere and we cannot escape!

# Simulate 100 products
products <- 1:100

# Simualate dates
library(lubridate)
dates <- seq(as.Date("2017-03-01"), as.Date("2017-03-31"), by = "day")

set.seed(1)
#Simulate verage cart loads per day of products : Follows a beta distribution
daily_avg <- data.frame(product_id = products, 
                        avg = 100*rbeta(length(products), 2, 5))

# Simulate Daily cart loads of products : FOllows a poisson distribution
cart_txns <- expand.grid(product_id = products, date = dates)

#Merge daily average and transactions
cart_txns <- merge(cart_txns, daily_avg, by = "product_id")
cart_txns$carts <- sapply(cart_txns$avg, function(x) rpois(1,x))

# Simulate Weekend spike

# Carts spike by below proportion on weekends
spike <- 0.1
cart_txns$carts <- ifelse(wday(cart_txns$date) == 1 | wday(cart_txns$date) == 7, 
                          round(cart_txns$carts *(1 + spike),0), cart_txns$carts)

# Add random noise
cart_txns$carts <- cart_txns$carts + 
        round(rnorm(length(cart_txns$carts), 0, 1), 0)

The data is now ready. So we can dive in to it right away and explore. We will print the top few rows of the data for reference.

head(cart_txns)
##   product_id       date      avg carts
## 1          1 2017-03-01 17.54713    23
## 2          1 2017-03-08 17.54713    14
## 3          1 2017-03-18 17.54713    23
## 4          1 2017-03-23 17.54713    12
## 5          1 2017-03-16 17.54713    15
## 6          1 2017-03-04 17.54713    21

Understanding the trend

Lets randomly select a product and look at the cart trend over the observatio period.

p <- sample(products, 1)

library(dplyr)
library(ggplot2)

cart_txns[cart_txns$product_id == p, ] %>% 
        ggplot(aes(x = date, y = carts)) + 
        geom_line(size = 1) + 
        labs(title = sprintf("Cart trend of product %s", p)) + 
        my_theme() + geom_smooth()

Smoothing the trend

The trend we observed above usually includes noise (although we added it in this case for the purpose of simulation). Smoothing removes the noise so that the long term trend of the line can be observered more clearly. We will write a small custom function for smoothing this signal.

Before we apply the smoothing function we will prepare the data. The algorithm for smoothing is adopted from scipy cookbook. We basically create reflected copies of the signal of length equal to window size on either side of the actual signal so that the output signal is of the same length as the original signal.

sig_smooth <- function(x, window_size = 7, window_type = "hamming.window") {
        
        # Add reflection on either end
        x <- c(rev(x[1:window_size]), x, rev(x)[1:window_size])
        
        #Create filter weights
        fil <- do.call(window_type, list(window_size))
        
        #Apply the smoothing  filter
        x <- stats::filter(x, fil/sum(fil))
        
        #Remove the unwanted ends
        x[(window_size+1):(length(x) - 2*window_size)]
}

Now we have the smoothing function ready. Lets see how the smoothed version looks compared to the original.

# Create a combined data frame
cart_txns <- cart_txns %>% arrange(product_id, date)

library(data.table)
cart_txns <- as.data.table(cart_txns)

cart_txns[, smooth := as.numeric(sig_smooth(carts)), by = product_id]
##       product_id       date      avg carts   smooth
##    1:          1 2017-03-01 17.54713    23 19.38253
##    2:          1 2017-03-02 17.54713    16 17.88855
##    3:          1 2017-03-03 17.54713    12 17.22590
##    4:          1 2017-03-04 17.54713    21 18.02410
##    5:          1 2017-03-05 17.54713    19 18.98494
##   ---                                              
## 3096:        100 2017-03-27 20.13960    27 20.57831
## 3097:        100 2017-03-28 20.13960    27 18.70181
## 3098:        100 2017-03-29 20.13960    19 17.07530
## 3099:        100 2017-03-30 20.13960    14 16.52108
## 3100:        100 2017-03-31 20.13960    25 16.65663
cart_txns[cart_txns$product_id == p, ] %>% 
        ggplot(aes(x = date)) + 
        geom_line(aes(y = carts, col = "carts"), size = 1) + 
        labs(title = sprintf("Cart and smooth trend of product %s", p)) + 
        my_theme() + 
        geom_line(aes(y = smooth, col = "smooth"), size = 1)

The smoothing function did a pretty good job at flattening the daily noise. Finally, we will normalize the carts and smooth variables. This will help to compare the trends of various products with vastly unequal means. For this purpose we will use min-max normalization. Min-max normalization subtracts the mean/median from every element and divides the result by the difference between the minimum and the maximum.

invisible(cart_txns[, ":=" (
        carts_norm = as.numeric(
                (carts - median(carts))/(max(carts) - min(carts))
        ), 
        smooth_norm = as.numeric(
                (smooth - median(smooth))/(max(smooth) - min(smooth)))
        ), 
        by = product_id])

cart_txns[cart_txns$product_id == p, ] %>% 
        ggplot(aes(x = date)) + 
        geom_line(aes(y = carts_norm, col = "carts_norm"), size = 1) + 
        labs(title = sprintf("Normalized Cart and smooth trend of product %s", p)) + 
        my_theme() + 
        geom_line(aes(y = smooth_norm, col = "smooth_norm"), size = 1)

In the graph above, observe the change in the y axis scale. We basically centered the values at 0 and scaled them to be between -1 and 1.

Lets move on to the next step.

Trend Identification

Identifying trend is very simple. Mathematically speaking, trend is nothing but the slope of ascend or descend. So we will take the difference between current value and the previous value to ascertain the trend. The below code chunk will do exactly the same and plot it against the smoothed and normalized carts.

invisible(
        cart_txns[, 
                  smooth_norm_prev := shift(smooth_norm, type = "lag"), 
                  by = product_id]
)

# Then compute the trend by taking the difference
invisible(
        cart_txns[, 
                  trend := as.numeric(smooth_norm - smooth_norm_prev), 
                  by = product_id]
)

# Plotting the Trend

cart_txns[cart_txns$product_id == p, ] %>% 
        ggplot(aes(x = date)) + 
        geom_line(aes(y = carts_norm, col = "carts_norm"), size = 1) + 
        labs(title = sprintf("Normalized Cart, smooth and trend of product %s", p)) + 
        my_theme() + 
        geom_line(aes(y = smooth_norm, col = "smooth_norm"), size = 1) + 
        geom_line(aes(y = trend, col = "trend"), size = 1)

Instead of looking only at the trend value on the current day, we can aswell take a weighted average of trend over past n days to factor in for trend of trends. This will give us a more robust representation of trend. For simplicity we will consider n = 1 and weights as c(1, 0.5).

invisible(
        cart_txns[, 
                  trend_robust := as.numeric(trend + 0.5*shift(trend, type = "lag")), 
                  by = product_id]
)

# Plotting trend_robust

cart_txns[cart_txns$product_id == p, ] %>% 
        ggplot(aes(x = date)) + 
        #geom_line(aes(y = carts_norm, col = "carts_norm"), size = 1) + 
        labs(title = sprintf("Smoothed and robust trend of product %s", p)) + 
        my_theme() + 
        geom_line(aes(y = smooth_norm, col = "smooth_norm"), size = 1) + 
        #geom_line(aes(y = trend, col = "trend"), size = 1) + 
        geom_line(aes(y = trend_robust, col = "trend_robust"), size = 1)

# Now we will select the trend_robust for every product on the day of interest, 31-Mar in this case
latest_trend <- cart_txns[date == "2017-03-31", 
                          c("product_id", "trend_robust")][order(trend_robust, decreasing = TRUE)]

Now lets print the top 6 trending products

print("Top Recommended Products")
## [1] "Top Recommended Products"
result <- apply(head(latest_trend), 
                MARGIN = 1, 
                function(x) cat("Product ", 
                                x[1], 
                                " (Trend Score ", x[2], ")", 
                                sep = "", 
                                fill = TRUE))
## Product 53 (Trend Score 0.3213265)
## Product 7 (Trend Score 0.286206)
## Product 32 (Trend Score 0.2800515)
## Product 57 (Trend Score 0.222346)
## Product 62 (Trend Score 0.2099849)
## Product 5 (Trend Score 0.2077364)

That’s it! These are the top 6 recommended products which are currently trending. These products can now be displayed on the website in real time if we take the unit of measurement to hour or event minutes instead of days.

02-Mar-2017

Acknowledgements: 1. A Simple Trending Products Recommendation Engine in Python