- Outcome
- Initial impressions
- Variable exploration
- Modelling options
- A note on time investment
- Categorisation with a random forest using stratified sampling
- Categorisation with a random forest using SMOTE
- Categorisation with logistic regression using SMOTE
- Linear regression using SMOTE
- Regression with support vector machines using SMOTE
- Multinomial classificiation with random forests using SMOTE

- Results

The desired outcome is guided by the following comment, provided by the person who uploaded this data to Kaggle:

…we are trying to determine which products we should continue to sell, and which products to remove from our inventory.

The poster discusses their motivations for uploading the data set:

We suspect that data science applied to the set…can help us generate a value (i.e., probability score) for each product, that can be used as the main determinant evaluating the inventory. Each row in the file represents one product.

Paraphrasing, the goal is to answer the following question: what is the value of an item, identified by an SKU, being in inventory? This ultimately relies on three components:

- the probability that an item will sell in the near-future,
- the value of that sale, and
- the cost of keeping an item in stock.

The ideal formula would be something like: \[value = P(sale) * profit\_of\_sale - cost\_of\_warehousing.\] The decision of which items to no longer stock can then be made by continuing to warehouse only those with a positive value or, alternatively, by dropping all items in a bottom portion of the stock, ranked by value.

The cost and value components are a mystery. We don’t have access to the cost of warehousing, so we assume that it is equal for every item, and ignore it in the value calculations. We don’t have access to the *profit* involved in a sale, but we can substitute it with one of the three price variables. This assumes a similar profit margin for each item, but it is the best we can do with so many unknowns.

Because the historical data is taken over a six-month period, we are forced to use this interval in our calculations. That is, the \(P(sale)\) in the above formula will be calculated as the probability of a sale over a six-month period.

We could model the number of sales as a discrete variable or a continuous variable, requiring a categorisation or regression model, respectively. Both options should be explored, with the decision based on what will most accurately define the solution above.

- What currently determines the stock levels of an item?
- What is the
`strength_factor`

of a item? - What can we say about the
`marketing_type`

and its impact on sales? - What is the relevance to sales of an item being recently released?
- What is the relevance to sales of an item being a new release?

- Many variables are undefined, and it is not clear how we should interpret them:
`strength_factor`

`release_number`

and, consequently,`new_release_flag`

- The three pricing variables:
`price_reg`

,`low_user_price`

and`low_net_price`

- Six months history is very limited, especially given that the company admits that “many of the products only have a single sale in the course of a year”. Moreover, this brief period ignores any seasonal trends. Do these items sell more over the Christmas period and, if so, do the past 6 months include December? Is the company selling Winter jackets or swimwear?
- There is a lack of information about the actual products themselves, likely due to that data being considered commercially sensitive. Even a rough categorisation of product type would be helpful.
- We don’t know how the
`item_count`

variable, presumably a stock level, is defined. There’s almost certainly some sort of business process behind this decision, even if an informal one. It’s possible that previous sales prompt greater stock numbers, and so we run the risk of modelling sales figures on a proxy of sales figures.

In the setup stage, we’re going to load some essential packages, along with the data, which is saved as `sales.csv`

. We’re also going to rename the variables to match the tidyverse style, with lower-case letters and words separated by underscores. Finally, we convert the flag columns to factors.

```
library(tidyverse)
library(purrr)
library(kableExtra) # for modifying table width
library(randomForest)
library(e1071) # for svm models
sales <- "sales.csv" %>%
read.csv %>% # for some reason, readr's read_csv throws many errors
as_tibble %>%
rename(
order = Order,
file_type = File_Type,
SKU_number = SKU_number,
sold_flag = SoldFlag,
sold_count = SoldCount,
marketing_type = MarketingType,
release_number = ReleaseNumber,
new_release_flag = New_Release_Flag,
strength_factor = StrengthFactor,
price_reg = PriceReg,
release_year = ReleaseYear,
item_count = ItemCount,
low_user_price = LowUserPrice,
low_net_price = LowNetPrice
) %>%
mutate_at(
vars(contains("flag")),
as.factor
)
sales %>%
count(file_type) %>%
knitr::kable(format = "html") %>%
kable_styling(full_width = F)
```

file_type | n |
---|---|

Active | 122921 |

Historical | 75996 |

I’m a big fan of the `visdat`

package for looking at data for the first time. It gives variable types along with a visualisation of missing data.

`sales %>% sample_n(10000) %>% arrange(order) %>% visdat::vis_dat()`

The missing data is entirely within the `sold_flag`

and `sold_count`

variables. We know the data is split into an historical observational period as well as active inventory, so it makes sense that we would only have sales data for historical observations. We confirm this with the `naniar`

package:

```
sales %>%
select(file_type, sold_count, sold_flag) %>%
group_by(file_type) %>%
naniar::miss_var_summary() %>%
knitr::kable(format = "html") %>%
kable_styling(full_width = F)
```

file_type | variable | n_miss | pct_miss | n_miss_cumsum |
---|---|---|---|---|

Historical | sold_count | 0 | 0 | 0 |

Historical | sold_flag | 0 | 0 | 0 |

Active | sold_count | 122921 | 100 | 122921 |

Active | sold_flag | 122921 | 100 | 245842 |

Given that we can only model on historical data, we now split the sales data into two data sets: `historical`

and `active`

. We remove the `sold_flag`

and `sold_count`

variables for the `active`

tibble, since it serves no purpose.

```
historical <- sales %>%
filter(file_type == "Historical")
active <- sales %>%
select(-one_of("sold_flag", "sold_count")) %>%
filter(file_type == "Active")
```

This has the benefit of giving us making the data unique along SKU in each data set. For example, there are 75996 distinct SKUs in the `historical`

data set, which has 75996 rows.

The poster of the data mentioned that sales are rare:

It is important to note that we have MANY products in our inventory, and very few of them tend to sell (only about 10% sell each year) and many of the products only have a single sale in the course of a year.

Let’s verify this claim with a histogram:

```
historical %>%
filter(sold_count <= 5) %>%
mutate(sold_count = sold_count %>% as.factor) %>% # makes graph look better
ggplot(aes(x = sold_count)) +
geom_bar(na.rm = TRUE, fill = "#00BFC4")
```

As expected, most of the items did not sell in the observation period, with only 17% of items experiencing a sale. This data imbalance will cause issues in modelling, as we will later see.

Many issues with data quality could be resolved or mitigated with a data dictionary. We can make some assumptions, such that `item_count`

refers to inventory levels. Other variables, like `release_number`

and the related `new_release_flag`

, are a mystery us.

4 items have a `sold_count`

greater than the corresponding `item_count`

. There are multiple possible explanations for this:

- the
`item_count`

records the inventory at the end of the recording period, - the stock was replenished throughout the recording period,
- this could be bad data, or
- there is a relationship between the
`item_count`

variable and the`release_number`

variable.

To complicate things even further, 83 products have an `item_count`

of 0.

```
historical %>%
filter(sold_count > item_count) %>%
select(SKU_number, sold_count, item_count, release_number) %>%
knitr::kable(format = "html") %>%
kable_styling(full_width = F)
```

SKU_number | sold_count | item_count | release_number |
---|---|---|---|

2277016 | 2 | 0 | 5 |

613864 | 69 | 44 | 0 |

1475488 | 12 | 8 | 2 |

809190 | 12 | 11 | 5 |

Moreover, the `strength_factor`

of each item appears to vary between the `active`

stock and the `historical`

data sets. While we would expect the price variables to be time-dependent, it seems strange that the `strength_factor`

—presumably an immutable property of the item—is not fixed.

To determine this we use the `sqldf`

package to join the `historical`

and `active`

tibbles, finding those that match in `SKU_number`

but not `strength_factor`

. While not as fast as `dplyr`

’s join verbs, it allows us to join on predicates rather than just columns.

```
strength_factor_mismatches <- sqldf::sqldf(
"
SELECT ACTIVE.SKU_number
,HISTORICAL.strength_factor AS Historicalstrength_factor
,ACTIVE.strength_factor AS Activestrength_factor
FROM historical AS HISTORICAL
INNER JOIN active AS ACTIVE
ON HISTORICAL.SKU_number = ACTIVE.SKU_number
AND HISTORICAL.strength_factor <> ACTIVE.strength_factor
"
)
strength_factor_mismatches %>% nrow
```

`## [1] 65556`

The `release_year`

variable is particularly strange. According to Kaggle, this data was uploaded in December of 2016, yet there are `active %>% filter(release_year == 2017) %>% nrow`

products in inventory with a `release_year`

of 2017, and `active %>% filter(release_year == 2017) %>% nrow`

with a release year of 2018.

We will now investigate each of the variables and their impact on the target variables. We will focus predominately on the `sold_flag`

for now to simplify our analysis.

We’re going to consider `release_year`

as a numerical variable. We could consider it as a factor, but that would make the modelling steps much more difficult, since there 85 different years.

While we won’t explore this here, it might be easier to interpret the results of any modelling if we were to transform the variable to an age, so that we track how many years since each product was released. This changes the centre of the variable (a translation transformation), so shouldn’t have any impact on model results, but makes for a more practical interpretation.

```
numeric_cols <- c(
"release_year", "price_reg", "low_net_price", "low_user_price",
"item_count", "strength_factor", "release_number"
)
```

The `ggpairs`

function from the `GGally`

package replicates the `plotmatrix`

function that once exists in `ggplot2`

. It displays three pieces of information: correlation coefficients in the top-right triangle, scatterplots in the bottom-left triangle, and histograms along the diagonal.

`historical %>% select(numeric_cols) %>% GGally::ggpairs()`

There is almost no correlation between any of the numerical columns in this data. Multicollinearity can impact the interpretability of some models such as logistic regressions, since it is difficult to separate the effects of correlated variables.

Curiously, none of the three price metrics correlate either. We will explore these in more detail later.

We may wish to see if certain release products released in certain years, perhaps later years, are more likely to sell. However, we need to be cautious about simply plotting sales against year, since we need to take into account how many items were released in a particular year as well. Below we plot the number of sales by `release_year`

alongside the number of items available from each `release_year`

:

```
x_labels <- seq(
min(historical$release_year),
max(historical$release_year),
by = 5
)
historical %>%
ggplot(aes(x = release_year, y = sold_count)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
scale_x_continuous(breaks = x_labels, labels = x_labels)
historical %>%
ggplot(aes(x = release_year)) +
geom_bar(, fill = "#00BFC4") +
scale_x_continuous(breaks = x_labels, labels = x_labels)
```

To accommodate for this, we instead consider the percentage of items sold each year. We’ll also limit the graph to show only the data with a `release_year`

in or after 1990, since there is little sales data before this year.

```
historical %>%
filter(release_year >= 1990) %>%
group_by(release_year) %>%
summarise(percent_sold = sum(sold_count) / sum(item_count)) %>%
filter(!is.nan(percent_sold)) %>% # Remove division by zero errors
ggplot(aes(x = release_year, y = percent_sold)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
scale_x_continuous(breaks = x_labels, labels = x_labels)
```