The desired outcome is guided by the following comment, provided by the person who uploaded this data to Kaggle:

…we are trying to determine which products we should continue to sell, and which products to remove from our inventory.

The poster discusses their motivations for uploading the data set:

We suspect that data science applied to the set…can help us generate a value (i.e., probability score) for each product, that can be used as the main determinant evaluating the inventory. Each row in the file represents one product.

Paraphrasing, the goal is to answer the following question: what is the value of an item, identified by an SKU, being in inventory? This ultimately relies on three components:

The ideal formula would be something like: \[value = P(sale) * profit\_of\_sale - cost\_of\_warehousing.\] The decision of which items to no longer stock can then be made by continuing to warehouse only those with a positive value or, alternatively, by dropping all items in a bottom portion of the stock, ranked by value.

The cost and value components are a mystery. We don’t have access to the cost of warehousing, so we assume that it is equal for every item, and ignore it in the value calculations. We don’t have access to the profit involved in a sale, but we can substitute it with one of the three price variables. This assumes a similar profit margin for each item, but it is the best we can do with so many unknowns.

Because the historical data is taken over a six-month period, we are forced to use this interval in our calculations. That is, the \(P(sale)\) in the above formula will be calculated as the probability of a sale over a six-month period.

We could model the number of sales as a discrete variable or a continuous variable, requiring a categorisation or regression model, respectively. Both options should be explored, with the decision based on what will most accurately define the solution above.


  • What currently determines the stock levels of an item?
  • What is the strength_factor of a item?
  • What can we say about the marketing_type and its impact on sales?
  • What is the relevance to sales of an item being recently released?
  • What is the relevance to sales of an item being a new release?


  • Many variables are undefined, and it is not clear how we should interpret them:
    • strength_factor
    • release_number and, consequently, new_release_flag
    • The three pricing variables: price_reg, low_user_price and low_net_price
  • Six months history is very limited, especially given that the company admits that “many of the products only have a single sale in the course of a year”. Moreover, this brief period ignores any seasonal trends. Do these items sell more over the Christmas period and, if so, do the past 6 months include December? Is the company selling Winter jackets or swimwear?
  • There is a lack of information about the actual products themselves, likely due to that data being considered commercially sensitive. Even a rough categorisation of product type would be helpful.
  • We don’t know how the item_count variable, presumably a stock level, is defined. There’s almost certainly some sort of business process behind this decision, even if an informal one. It’s possible that previous sales prompt greater stock numbers, and so we run the risk of modelling sales figures on a proxy of sales figures.

Initial impressions

In the setup stage, we’re going to load some essential packages, along with the data, which is saved as sales.csv. We’re also going to rename the variables to match the tidyverse style, with lower-case letters and words separated by underscores. Finally, we convert the flag columns to factors.

library(kableExtra) # for modifying table width
library(e1071) # for svm models

sales <- "sales.csv" %>% 
    read.csv %>% # for some reason, readr's read_csv throws many errors
    as_tibble %>%     
        order = Order, 
        file_type = File_Type, 
        SKU_number = SKU_number,
        sold_flag = SoldFlag, 
        sold_count = SoldCount, 
        marketing_type = MarketingType, 
        release_number = ReleaseNumber, 
        new_release_flag = New_Release_Flag, 
        strength_factor = StrengthFactor, 
        price_reg = PriceReg, 
        release_year = ReleaseYear, 
        item_count = ItemCount, 
        low_user_price = LowUserPrice, 
        low_net_price = LowNetPrice
    ) %>% 

sales %>% 
    count(file_type) %>% 
    knitr::kable(format = "html") %>% 
    kable_styling(full_width = F)
file_type n
Active 122921
Historical 75996

I’m a big fan of the visdat package for looking at data for the first time. It gives variable types along with a visualisation of missing data.

sales %>% sample_n(10000) %>% arrange(order) %>% visdat::vis_dat()

The missing data is entirely within the sold_flag and sold_count variables. We know the data is split into an historical observational period as well as active inventory, so it makes sense that we would only have sales data for historical observations. We confirm this with the naniar package:

sales %>% 
    select(file_type, sold_count, sold_flag) %>%
    group_by(file_type) %>% 
    naniar::miss_var_summary() %>% 
    knitr::kable(format = "html") %>% 
    kable_styling(full_width = F)
file_type variable n_miss pct_miss n_miss_cumsum
Historical sold_count 0 0 0
Historical sold_flag 0 0 0
Active sold_count 122921 100 122921
Active sold_flag 122921 100 245842

Given that we can only model on historical data, we now split the sales data into two data sets: historical and active. We remove the sold_flag and sold_count variables for the active tibble, since it serves no purpose.

historical <- sales  %>% 
    filter(file_type == "Historical")
active <- sales %>% 
    select(-one_of("sold_flag", "sold_count")) %>%
    filter(file_type == "Active")

This has the benefit of giving us making the data unique along SKU in each data set. For example, there are 75996 distinct SKUs in the historical data set, which has 75996 rows.

The poster of the data mentioned that sales are rare:

It is important to note that we have MANY products in our inventory, and very few of them tend to sell (only about 10% sell each year) and many of the products only have a single sale in the course of a year.

Let’s verify this claim with a histogram:

historical %>% 
    filter(sold_count <= 5) %>% 
    mutate(sold_count = sold_count %>% as.factor) %>% # makes graph look better
    ggplot(aes(x = sold_count)) + 
        geom_bar(na.rm = TRUE, fill = "#00BFC4") 

As expected, most of the items did not sell in the observation period, with only 17% of items experiencing a sale. This data imbalance will cause issues in modelling, as we will later see.

Data quality

Many issues with data quality could be resolved or mitigated with a data dictionary. We can make some assumptions, such that item_count refers to inventory levels. Other variables, like release_number and the related new_release_flag, are a mystery us.

4 items have a sold_count greater than the corresponding item_count. There are multiple possible explanations for this:

  • the item_count records the inventory at the end of the recording period,
  • the stock was replenished throughout the recording period,
  • this could be bad data, or
  • there is a relationship between the item_count variable and the release_number variable.

To complicate things even further, 83 products have an item_count of 0.

historical %>% 
    filter(sold_count > item_count) %>% 
    select(SKU_number, sold_count, item_count, release_number) %>% 
    knitr::kable(format = "html") %>% 
    kable_styling(full_width = F)
SKU_number sold_count item_count release_number
2277016 2 0 5
613864 69 44 0
1475488 12 8 2
809190 12 11 5

Moreover, the strength_factor of each item appears to vary between the active stock and the historical data sets. While we would expect the price variables to be time-dependent, it seems strange that the strength_factor—presumably an immutable property of the item—is not fixed.

To determine this we use the sqldf package to join the historical and active tibbles, finding those that match in SKU_number but not strength_factor. While not as fast as dplyr’s join verbs, it allows us to join on predicates rather than just columns.

strength_factor_mismatches <- sqldf::sqldf(
    SELECT       ACTIVE.SKU_number
                ,HISTORICAL.strength_factor AS Historicalstrength_factor
                ,ACTIVE.strength_factor AS Activestrength_factor
    FROM        historical AS HISTORICAL
        ON      HISTORICAL.SKU_number = ACTIVE.SKU_number
        AND     HISTORICAL.strength_factor <> ACTIVE.strength_factor

strength_factor_mismatches %>% nrow
## [1] 65556

The release_year variable is particularly strange. According to Kaggle, this data was uploaded in December of 2016, yet there are active %>% filter(release_year == 2017) %>% nrow products in inventory with a release_year of 2017, and active %>% filter(release_year == 2017) %>% nrow with a release year of 2018.

Variable exploration

We will now investigate each of the variables and their impact on the target variables. We will focus predominately on the sold_flag for now to simplify our analysis.

Numerical variables

We’re going to consider release_year as a numerical variable. We could consider it as a factor, but that would make the modelling steps much more difficult, since there 85 different years.

While we won’t explore this here, it might be easier to interpret the results of any modelling if we were to transform the variable to an age, so that we track how many years since each product was released. This changes the centre of the variable (a translation transformation), so shouldn’t have any impact on model results, but makes for a more practical interpretation.

numeric_cols <- c(
    "release_year", "price_reg", "low_net_price", "low_user_price", 
    "item_count", "strength_factor", "release_number"

The ggpairs function from the GGally package replicates the plotmatrix function that once exists in ggplot2. It displays three pieces of information: correlation coefficients in the top-right triangle, scatterplots in the bottom-left triangle, and histograms along the diagonal.

historical %>% select(numeric_cols) %>% GGally::ggpairs()

There is almost no correlation between any of the numerical columns in this data. Multicollinearity can impact the interpretability of some models such as logistic regressions, since it is difficult to separate the effects of correlated variables.

Curiously, none of the three price metrics correlate either. We will explore these in more detail later.

We may wish to see if certain release products released in certain years, perhaps later years, are more likely to sell. However, we need to be cautious about simply plotting sales against year, since we need to take into account how many items were released in a particular year as well. Below we plot the number of sales by release_year alongside the number of items available from each release_year:

Release year

x_labels <- seq(
    by = 5
historical %>% 
    ggplot(aes(x = release_year, y = sold_count)) + 
        geom_bar(stat = "identity", fill = "#00BFC4") +
        scale_x_continuous(breaks = x_labels, labels = x_labels)
historical %>% 
    ggplot(aes(x = release_year)) + 
        geom_bar(, fill = "#00BFC4") +
        scale_x_continuous(breaks = x_labels, labels = x_labels)

To accommodate for this, we instead consider the percentage of items sold each year. We’ll also limit the graph to show only the data with a release_year in or after 1990, since there is little sales data before this year.

historical %>%
    filter(release_year >= 1990) %>% 
    group_by(release_year) %>% 
    summarise(percent_sold = sum(sold_count) / sum(item_count)) %>%
    filter(!is.nan(percent_sold)) %>% # Remove division by zero errors
    ggplot(aes(x = release_year, y = percent_sold)) + 
        geom_bar(stat = "identity", fill = "#00BFC4") +
        scale_x_continuous(breaks = x_labels, labels = x_labels)