Marketing Multi-Channel Attribution model with R (part 2: practical issues)

This is the second post about the Marketing Multi-channel Attribution Model with Markov chains (here is the first one). Even though the concept of the first-order Markov chains is pretty simple, you can face other issues and challenges when implementing the approach in practice. In this article, we will review some of them. I tried to organize this article in a way that you can use it as a framework or can help you to create your own.

The main steps that we will review are the following:

  • splitting paths depending on purchases counts
  • replacing some channels/touch points
  • a unique channel/touchpoint case
  • consequent duplicated channels in the path and higher order Markov chains
  • paths that haven’t led to a conversion
  • customer journey duration
  • attributing revenue and costs comparisons

As usually, we start by simulating the data sample for experiments that includes customer ids, date stamp of contact with a marketing channel, marketing channel and conversion mark (0/1).

click to expand R code
library(tidyverse)
library(purrrlyr)
library(reshape2)
library(ChannelAttribution)
library(markovchain)
library(visNetwork)
library(expm)
library(stringr)


##### simulating the "real" data #####
set.seed(454)
df_raw <- data.frame(customer_id = paste0('id', sample(c(1:20000), replace = TRUE)),
 date = as.Date(rbeta(80000, 0.7, 10) * 100, origin = "2016-01-01"),
 channel = paste0('channel_', sample(c(0:7), 80000, replace = TRUE, prob = c(0.2, 0.12, 0.03, 0.07, 0.15, 0.25, 0.1, 0.08)))
 ) %>%
 group_by(customer_id) %>%
 mutate(conversion = sample(c(0, 1), n(), prob = c(0.975, 0.025), replace = TRUE)) %>%
 ungroup() %>%
 dmap_at(c(1, 3), as.character) %>%
 arrange(customer_id, date)

df_raw <- df_raw %>%
 mutate(channel = ifelse(channel == 'channel_2', NA, channel))

In addition, I’ve replaced channel_2 with NA values. The initial data sample looks like:

1. Splitting paths depending on purchases counts

It makes sense to attribute paths of the first purchase and of the n-th purchase separately.
We assume that each subsequent purchase moves the customer deeper in her lifecycle with a company. We can expect a huge difference between the first customer journey and, for instance, the tenth. Therefore, marketing channels work differently for customers on different phases of their lifecycles. I recommend splitting paths by purchase and compute the model for first-time buyers and n-times buyers separately. For rational splitting, the concept of Life-Cycle Grids can be very helpful.
For instance, if the customer’s path of her lifetime with the company looks like this:
C1 -> C4 -> C2 -> C3 -> conversion (first purchase) -> C2 -> C3 -> conversion (second purchase) -> C3 -> conversion (third purchase) -> C5.
We can split it like this:
a) C1 -> C4 -> C2 -> C3 -> conversion (first purchase),
b) C2 -> C3 -> conversion (second purchase),
c) C3 -> conversion (third purchase),
d) C5.
After this, compute the attribution models for the path a), and b) and c) separately. Path d) can be used for the generic probabilistic model (see point #5).
We can add the serial number of the path by using, for example, the lagged cumulative sum of conversion binary marks with the following simple code:
click to expand R code
##### splitting paths #####
df_paths <- df_raw %>%
 group_by(customer_id) %>%
 mutate(path_no = ifelse(is.na(lag(cumsum(conversion))), 0, lag(cumsum(conversion))) + 1) %>%
 ungroup()
You can see now that, for example, customer id18055 has 4 paths:
1) channel_1 -> conversion #1
2) channel_4 -> channel_0 -> channel _6 -> conversion #2
3) channel_4 -> conversion #3
4) channel_6 -> NA -> channel_5
Using the Life-Cycle Grids concept (or a different one) we can split the data set into different sets and compute attribution separately for different customer segments. For simplicity, we will compute attribution for first-purchasers only. Therefore, the code is the following:
click to expand R code
df_paths_1 <- df_paths %>%
 filter(path_no == 1) %>%
 select(-path_no)

2. Replacing/removing touchpoints

It makes sense to replace or remove some channels when:
1) the marketing channel is unknown (NA value) due to variety of reasons
2) there is a specific channel in the path that we don’t want to attribute such as Direct channel.
There are two main methods for these cases: either to remove NA/Direct channels or to replace them with the previous channel in the path or we can combine both methods: remove NAs and replace Direct channel.
Note: by using replacing with the first-order Markov chains, we will obtain the same outcomes as those comparing with the removing method because duplicated touchpoints don’t affect the outcomes (see point #4).
The following are examples of typical transformations from the combined approach:
1) initial path “C1 -> C2 -> Direct -> C3 -> conversion” we transform to “C1 -> C2 -> С2 -> C3 -> conversion”
2) initial path “Direct -> C3 -> C1 -> C2 -> conversion” we transform to “C3 -> C1 -> C2 -> conversion”
3) initial path “C1 -> C2 -> NA -> C3 -> conversion” we transform to “C1 -> C2 -> C3 -> conversion”
4) initial path “C3 -> NA -> Direct -> C2 -> conversion” we transform to “C3 -> C3 -> C2 -> conversion”.
Let’s assume that we want to replace channel_6 with the previous non-channel_6 and skip unknown (NA) touch points. Note the following assumptions:
1) we will remove paths NA -> conversion. My point is that it doesn’t make sense to attribute an unknown channel even if it brings a conversion. We would not use this information.
2) we will remove Direct (channel_6) if it is the first in the path because we don’t have a previous channel to be replaced with. Again, in the case Direct -> conversion path we will remove conversion and my idea is the same as with the NA case.
On the other hand, you can adapt the code to your own point of view. We will use the following code:
click to expand R code
##### replace some channels #####
df_path_1_clean <- df_paths_1 %>%
 # removing NAs
 filter(!is.na(channel)) %>%
 
 # adding order of channels in the path
 group_by(customer_id) %>%
 mutate(ord = c(1:n()),
 is_non_direct = ifelse(channel == 'channel_6', 0, 1),
 is_non_direct_cum = cumsum(is_non_direct)) %>%
 
 # removing Direct (channel_6) when it is the first in the path
 filter(is_non_direct_cum != 0) %>%
 
 # replacing Direct (channel_6) with the previous touch point
 mutate(channel = ifelse(channel == 'channel_6', channel[which(channel != 'channel_6')][is_non_direct_cum], channel)) %>%
 
 ungroup() %>%
 select(-ord, -is_non_direct, -is_non_direct_cum)

3. One- and multi-channel paths issue

It makes sense to split a unique channel and multi-channel paths.
As I mentioned in the previous article, when using the Removal Effect, you should calculate the weighted importance for each channel/touchpoint because the sum of the Removal Effects doesn’t equal to 1.
In case we have a path with a unique channel, the Removal Effect and importance of this channel for that exact path is 1. However, weighting with other multi-channel paths will decrease the importance of one-channel occurrences. That means that, in case we have a channel that occurs in one-channel paths, usually it will be underestimated if attributed with multi-channel paths.
There is also a pretty straight logic behind splitting – for one-channel paths, we definitely know the channel that brought a conversion and we don’t need to distribute that value into other channels.
We can do this with a simple algorithm:
  1. Split data for paths with one or more unique channels
  2. Calculate total conversions for one-channel paths and compute the Markov model for multi-channel paths
  3. Summarize results for each channel.
Let’s check the difference between when we don’t split the data and when we do split it with the following code:
click to expand R code
##### one- and multi-channel paths #####
df_path_1_clean <- df_path_1_clean %>%
 group_by(customer_id) %>%
 mutate(uniq_channel_tag = ifelse(length(unique(channel)) == 1, TRUE, FALSE)) %>%
 ungroup()

df_path_1_clean_uniq <- df_path_1_clean %>%
 filter(uniq_channel_tag == TRUE) %>%
 select(-uniq_channel_tag)

df_path_1_clean_multi <- df_path_1_clean %>%
 filter(uniq_channel_tag == FALSE) %>%
 select(-uniq_channel_tag)

### experiment ###
# attribution model for all paths
df_all_paths <- df_path_1_clean %>%
 group_by(customer_id) %>%
 summarise(path = paste(channel, collapse = ' > '),
 conversion = sum(conversion)) %>%
 ungroup() %>%
 filter(conversion == 1)

mod_attrib <- markov_model(df_all_paths,
 var_path = 'path',
 var_conv = 'conversion',
 out_more = TRUE)
mod_attrib$removal_effects
mod_attrib$result
d_all <- data.frame(mod_attrib$result)

# attribution model for splitted multi and unique channel paths
df_multi_paths <- df_path_1_clean_multi %>%
 group_by(customer_id) %>%
 summarise(path = paste(channel, collapse = ' > '),
 conversion = sum(conversion)) %>%
 ungroup() %>%
 filter(conversion == 1)

mod_attrib_alt <- markov_model(df_multi_paths,
 var_path = 'path',
 var_conv = 'conversion',
 out_more = TRUE)
mod_attrib_alt$removal_effects
mod_attrib_alt$result

# adding unique paths
df_uniq_paths <- df_path_1_clean_uniq %>%
 filter(conversion == 1) %>%
 group_by(channel) %>%
 summarise(conversions = sum(conversion)) %>%
 ungroup()

d_multi <- data.frame(mod_attrib_alt$result)

d_split <- full_join(d_multi, df_uniq_paths, by = c('channel_name' = 'channel')) %>%
 mutate(result = total_conversions + conversions)

sum(d_all$total_conversions)
sum(d_split$result)
As you can see, the total sum is equal but attribution is different because for the reason that I mentioned earlier.

4. Higher order of Markov chains and consequent duplicated channels in the path

It doesn’t matter to skip or not duplicates for the first-order Markov chains.
The ChannelAttribution package allows us to change the order of Markov chains via the “order” parameter. This means we can compute transition probabilities based on the previous two, three or more channels.
When using one-order Markov chains, a subsequence of the same channels in a path (duplicates) can be reduced to one channel (for example C2 in the path C1 → C2 → C2 → C2 → C3 can be reduced to C1 → C2 → C3). Because, mathematically, it doesn’t matter how many times each C2 goes through the loop with itself in the transition matrix, it will be in the C3 state finally. Therefore, we will obtain different transition matrices but the same Removal Effect for channels with or without subsequent duplicates.
However, increasing the order of the Markov graph will affect the results if we skip C2. This is due to computing probabilities based on, for example, two subsequent channels as one segment of the path in the case of the second-order Markov graph. Therefore, removing consequent duplicated channels can have a huge effect for higher order Markov chains.
In order to check the effect of skipping duplicates in the first-order Markov chain, we will use my script for “manual” calculation because the package skips duplicates automatically. The following is the code for computing the attribution “manually”:
click to expand R code
##### Higher order of Markov chains and consequent duplicated channels in the path #####

# computing transition matrix - 'manual' way
df_multi_paths_m <- df_multi_paths %>%
 mutate(path = paste0('(start) > ', path, ' > (conversion)'))
m <- max(str_count(df_multi_paths_m$path, '>')) + 1 # maximum path length

df_multi_paths_cols <- colsplit(string = df_multi_paths_m$path, pattern = ' > ', names = c(1:m))
colnames(df_multi_paths_cols) <- paste0('ord_', c(1:m))
df_multi_paths_cols[df_multi_paths_cols == ''] <- NA

df_res <- vector('list', ncol(df_multi_paths_cols) - 1)

for (i in c(1:(ncol(df_multi_paths_cols) - 1))) {
 
 df_cache <- df_multi_paths_cols %>%
 select(num_range("ord_", c(i, i+1))) %>%
 na.omit() %>%
 group_by_(.dots = c(paste0("ord_", c(i, i+1)))) %>%
 summarise(n = n()) %>%
 ungroup()
 
 colnames(df_cache)[c(1, 2)] <- c('channel_from', 'channel_to')
 df_res[[i]] <- df_cache
}

df_res <- do.call('rbind', df_res)

df_res_tot <- df_res %>%
 group_by(channel_from, channel_to) %>%
 summarise(n = sum(n)) %>%
 ungroup() %>%
 group_by(channel_from) %>%
 mutate(tot_n = sum(n),
 perc = n / tot_n) %>%
 ungroup()

df_dummy <- data.frame(channel_from = c('(start)', '(conversion)', '(null)'),
 channel_to = c('(start)', '(conversion)', '(null)'),
 n = c(0, 0, 0),
 tot_n = c(0, 0, 0),
 perc = c(0, 1, 1))

df_res_tot <- rbind(df_res_tot, df_dummy)

# comparing transition matrices
trans_matrix_prob_m <- dcast(df_res_tot, channel_from ~ channel_to, value.var = 'perc', fun.aggregate = sum)
trans_matrix_prob <- data.frame(mod_attrib_alt$transition_matrix)
trans_matrix_prob <- dcast(trans_matrix_prob, channel_from ~ channel_to, value.var = 'transition_probability')

# computing attribution - 'manual' way
channels_list <- df_path_1_clean_multi %>%
 filter(conversion == 1) %>%
 distinct(channel)
channels_list <- c(channels_list$channel)

df_res_ini <- df_res_tot %>% select(channel_from, channel_to)
df_attrib <- vector('list', length(channels_list))

for (i in c(1:length(channels_list))) {
 
 channel <- channels_list[i]
 
 df_res1 <- df_res %>%
 mutate(channel_from = ifelse(channel_from == channel, NA, channel_from),
 channel_to = ifelse(channel_to == channel, '(null)', channel_to)) %>%
 na.omit()
 
 df_res_tot1 <- df_res1 %>%
 group_by(channel_from, channel_to) %>%
 summarise(n = sum(n)) %>%
 ungroup() %>%
 
 group_by(channel_from) %>%
 mutate(tot_n = sum(n),
 perc = n / tot_n) %>%
 ungroup()
 
 df_res_tot1 <- rbind(df_res_tot1, df_dummy) # adding (start), (conversion) and (null) states
 
 df_res_tot1 <- left_join(df_res_ini, df_res_tot1, by = c('channel_from', 'channel_to'))
 df_res_tot1[is.na(df_res_tot1)] <- 0
 
 df_trans1 <- dcast(df_res_tot1, channel_from ~ channel_to, value.var = 'perc', fun.aggregate = sum)
 
 trans_matrix_1 <- df_trans1
 rownames(trans_matrix_1) <- trans_matrix_1$channel_from
 trans_matrix_1 <- as.matrix(trans_matrix_1[, -1])
 
 inist_n1 <- dcast(df_res_tot1, channel_from ~ channel_to, value.var = 'n', fun.aggregate = sum)
 rownames(inist_n1) <- inist_n1$channel_from
 inist_n1 <- as.matrix(inist_n1[, -1])
 inist_n1[is.na(inist_n1)] <- 0
 inist_n1 <- inist_n1['(start)', ]
 
 res_num1 <- inist_n1 %*% (trans_matrix_1 %^% 100000)
 
 df_cache <- data.frame(channel_name = channel,
 conversions = as.numeric(res_num1[1, 1]))
 
 df_attrib[[i]] <- df_cache
}

df_attrib <- do.call('rbind', df_attrib)

# computing removal effect and results
tot_conv <- sum(df_multi_paths_m$conversion)

df_attrib <- df_attrib %>%
 mutate(tot_conversions = sum(df_multi_paths_m$conversion),
 impact = (tot_conversions - conversions) / tot_conversions,
 tot_impact = sum(impact),
 weighted_impact = impact / tot_impact,
 attrib_model_conversions = round(tot_conversions * weighted_impact)
 ) %>%
 select(channel_name, attrib_model_conversions)
As you can see, even when we’ve obtained different transition matrices (trans_matrix_prob_m vs. trans_matrix_prob), the removal effects and attribution results are the same (df_attrib vs. mod_attrib_alt$result) for the package (that skipped duplicated subsequent channels) as with “manual” calculations (with duplicates).

5. Deal with paths that haven’t led to conversion

If you can track both user journeys that finished or not in conversions, doing this will help you obtain extra value from advanced methods.
We need paths with conversions only for attributing marketing channels. However, if you collect both customer journeys that finished or not in conversions, that will give you some advanced opportunities: a generic probabilistic model that has both positive (conversion) and negative (null) outcomes. The model can allow you to look at the complete “picture” of the business, not just valuable transitions.
Additionally, the complete probabilistic model can be used as a predictive model:
  1. At the least, you can model:
  • customers’ flow through marketing channels,
  • a projection of how many customers will touch the exact marketing channel after N touches or how many conversions you can obtain if you attract traffic (N visits) from this or that channel,
  • what channels have a higher probability of user churn and thus know whether or not we should change the acquisition through them.
  1. an advanced approach that allows to use transition probabilities and other features like visits recency, frequency, durations, demographics, etc. in order to predict conversions with machine learning algorithms.
These are rather complex questions that cover such topics as channels capacity (when a linear increase in the costs of acquisition doesn’t lead to a linear increase in conversions), marketing budgets allocation, a possible reduction in customers life-time value from attracting less loyal customers or one-time buyers, etc. These aspects are not a topic of this article. I just want you to pay attention to the fact that the real world is more complex than in the Markov chains model. Nonetheless, using the approach with the right amount of attention and understanding of the business domain can bring about good results.
Ok, let’s compute complete probabilistic model for the first paths we have with the following code:
click to expand R code
##### Generic Probabilistic Model #####
df_all_paths_compl <- df_path_1_clean %>%
 group_by(customer_id) %>%
 summarise(path = paste(channel, collapse = ' > '),
 conversion = sum(conversion)) %>%
 ungroup() %>%
 mutate(null_conversion = ifelse(conversion == 1, 0, 1))

mod_attrib_complete <- markov_model(
 df_all_paths_compl,
 var_path = 'path',
 var_conv = 'conversion',
 var_null = 'null_conversion',
 out_more = TRUE
)

trans_matrix_prob <- mod_attrib_complete$transition_matrix %>%
 dmap_at(c(1, 2), as.character)

##### viz #####
edges <-
 data.frame(
 from = trans_matrix_prob$channel_from,
 to = trans_matrix_prob$channel_to,
 label = round(trans_matrix_prob$transition_probability, 2),
 font.size = trans_matrix_prob$transition_probability * 100,
 width = trans_matrix_prob$transition_probability * 15,
 shadow = TRUE,
 arrows = "to",
 color = list(color = "#95cbee", highlight = "red")
 )

nodes <-
 data_frame(id = c(
 c(trans_matrix_prob$channel_from),
 c(trans_matrix_prob$channel_to)
 )) %>%
 distinct(id) %>%
 arrange(id) %>%
 mutate(
 label = id,
 color = ifelse(
 label %in% c('(start)', '(conversion)'),
 '#4ab04a',
 ifelse(label == '(null)', '#ce472e', '#ffd73e')
 ),
 shadow = TRUE,
 shape = "box"
 )

visNetwork(nodes,
 edges,
 height = "2000px",
 width = "100%",
 main = "Generic Probabilistic model's Transition Matrix") %>%
 visIgraphLayout(randomSeed = 123) %>%
 visNodes(size = 5) %>%
 visOptions(highlightNearest = TRUE)
This time I used the really cool R package visNetwork for Markov graph visualization. It looks a bit messy with nine nodes but it is interactive and you can move nodes, highlight them as well as edges and do other transformation thanks to the flexibility this package provides.
The following is an example of how you can model customers transition through marketing channels and project the number of conversions. Let’s assume that we are going to attract 1000 visits from channel_5 and we want to model how many conversions we will obtain or see what channels customers will make contact with in several steps. The script involves manipulating the matrix of transition probabilities associated with the Markov chain.
Once we have the transition matrix computed, we can project in what state (channel) that customer contact will be, for example, in 5 steps (5th degree) or how many conversion we can obtain (we use 100,000 degrees to makes sure that all customers transited through the transition matrix).
click to expand R code
##### modeling states and conversions #####
# transition matrix preprocessing
trans_matrix_complete <- mod_attrib_complete$transition_matrix
trans_matrix_complete <- rbind(trans_matrix_complete,
 df_dummy %>%
 mutate(transition_probability = perc) %>%
 select(channel_from, channel_to, transition_probability))
trans_matrix_complete$channel_to <- factor(trans_matrix_complete$channel_to, levels = c(levels(trans_matrix_complete$channel_from)))
trans_matrix_complete <- dcast(trans_matrix_complete, channel_from ~ channel_to, value.var = 'transition_probability')
trans_matrix_complete[is.na(trans_matrix_complete)] <- 0
rownames(trans_matrix_complete) <- trans_matrix_complete$channel_from
trans_matrix_complete <- as.matrix(trans_matrix_complete[, -1])

# creating empty matrix for modeling
model_mtrx <- matrix(data = 0,
 nrow = nrow(trans_matrix_complete), ncol = 1,
 dimnames = list(c(rownames(trans_matrix_complete)), '(start)'))
# adding modeling number of visits
model_mtrx['channel_5', ] <- 1000

c(model_mtrx) %*% (trans_matrix_complete %^% 5) # after 5 steps
c(model_mtrx) %*% (trans_matrix_complete %^% 100000) # after 100000 steps

6. Customer journey duration

Usually, we compute the model regularly for the company’s standard reporting period.
When speaking about the Attribution model, it is quite simple to deal with durations. Once we have a conversion, we extract retrospective contacts with marketing channels and we can limit a retrospective period of for example 30 or 90 days based on our knowledge about the typical duration of customer lifetime cycle and omit contacts that happened earlier (or even compute the model without limitations). For defining retrospective periods, we can look at the distribution of durations from the first touch to conversion and choose e.g. 95% of occurrences.
In R we can obtain these analytics with the following code:
click to expand R code
##### Customer journey duration #####
# computing time lapses from the first contact to conversion/last contact
df_multi_paths_tl <- df_path_1_clean_multi %>%
 group_by(customer_id) %>%
 summarise(path = paste(channel, collapse = ' > '),
 first_touch_date = min(date),
 last_touch_date = max(date),
 tot_time_lapse = round(as.numeric(last_touch_date - first_touch_date)),
 conversion = sum(conversion)) %>%
 ungroup()

# distribution plot
ggplot(df_multi_paths_tl %>% filter(conversion == 1), aes(x = tot_time_lapse)) +
 theme_minimal() +
 geom_histogram(fill = '#4e79a7', binwidth = 1)

# cumulative distribution plot
ggplot(df_multi_paths_tl %>% filter(conversion == 1), aes(x = tot_time_lapse)) +
 theme_minimal() +
 stat_ecdf(geom = 'step', color = '#4e79a7', size = 2, alpha = 0.7) +
 geom_hline(yintercept = 0.95, color = '#e15759', size = 1.5) +
 geom_vline(xintercept = 23, color = '#e15759', size = 1.5, linetype = 2)
You can see that 23 days period covers 95% of paths.
It is much more complex to deal with duration for the generic probabilistic model. We can use the same approach for paths that finished in a conversion as of reporting date but, in addition, we need to manage paths that did not. We can’t be sure which of them we should accept as fruitless but which of them has a higher chance to bring us a conversion in the next reporting date/period.
Let’s visualize this issue for better understanding with the following code assuming reporting date as of January 10, 2016:
click to expand R code
### for generic probabilistic model ###
df_multi_paths_tl_1 <- melt(df_multi_paths_tl[c(1:50), ] %>% select(customer_id, first_touch_date, last_touch_date, conversion),
 id.vars = c('customer_id', 'conversion'),
 value.name = 'touch_date') %>%
 arrange(customer_id)
rep_date <- as.Date('2016-01-10', format = '%Y-%m-%d')

ggplot(df_multi_paths_tl_1, aes(x = as.factor(customer_id), y = touch_date, color = factor(conversion), group = customer_id)) +
 theme_minimal() +
 coord_flip() +
 geom_point(size = 2) +
 geom_line(size = 0.5, color = 'darkgrey') +
 geom_hline(yintercept = as.numeric(rep_date), color = '#e15759', size = 2) +
 geom_rect(xmin = -Inf, xmax = Inf, ymin = as.numeric(rep_date), ymax = Inf, alpha = 0.01, color = 'white', fill = 'white') +
 theme(legend.position = 'bottom',
 panel.border = element_blank(),
 panel.grid.major = element_blank(),
 panel.grid.minor = element_blank(),
 axis.ticks.x = element_blank(),
 axis.ticks.y = element_blank()) +
 guides(colour = guide_legend(override.aes = list(size = 5)))
You can see that id10072 finished the path in a conversion so we can add its retrospective touchpoints into the model. On the other hand, id10010‘s, id1001‘s and id10005‘s paths are fruitless as of reporting date but customer id10010 will purchase on January 19, 2016, customer id1001 will contact with a marketing channel on January 15, 2016, but won’t purchase and customer id10005 won’t have any new contacts with marketing channels, it is fruitless.
Therefore, our task is to compute generic model trying to identify which paths are completed as of reporting date both in a conversion or not. For example, we should use paths of id10072 and id10005 customers for computing the model, because we don’t expect new contacts or first purchases from them anymore.
We need to develop some criteria for identifying if an exact path has a higher chance to finish in a conversion or not. Again, we can analyze stats that characterize successful paths and make the assumption that if a path violates common stats towards success then it is fruitless with high probability.
There are at least two values I can suggest to start with:
1) time lapse from the first contact,
2) time lapse between the conversion date and a previous contact.
These values can be used as a combination of rules. From the above analysis we know that 95% of purchases were done in 23 days and we can compute time lapses between the conversion date and the previous contact with the following code:
click to expand R code
df_multi_paths_tl_2 <- df_path_1_clean_multi %>%
 group_by(customer_id) %>%
 mutate(prev_touch_date = lag(date)) %>%
 ungroup() %>%
 filter(conversion == 1) %>%
 mutate(prev_time_lapse = round(as.numeric(date - prev_touch_date)))
 
# distribution
ggplot(df_multi_paths_tl_2, aes(x = prev_time_lapse)) +
 theme_minimal() +
 geom_histogram(fill = '#4e79a7', binwidth = 1)

# cumulative distribution
ggplot(df_multi_paths_tl_2, aes(x = prev_time_lapse)) +
 theme_minimal() +
 stat_ecdf(geom = 'step', color = '#4e79a7', size = 2, alpha = 0.7) +
 geom_hline(yintercept = 0.95, color = '#e15759', size = 1.5) +
 geom_vline(xintercept = 12, color = '#e15759', size = 1.5, linetype = 2)
We can see that 95% of customers have made a purchase during 12 days from the previous contact. Therefore, we can assume that if a customer made contact with a marketing channel the first time for more than 23 days and/or hasn’t made contact with a marketing channel for the last 12 days, then it is a fruitless path. Hint: for more accurate assumptions, these rules can be computed depending on the exact marketing channels.
We can extract a data for the generic probabilistic model when both rules are true with the following code:
click to expand R code
# extracting data for generic model
df_multi_paths_tl_3 <- df_path_1_clean_multi %>%
 group_by(customer_id) %>%
 mutate(prev_time_lapse = round(as.numeric(date - lag(date)))) %>%
 summarise(path = paste(channel, collapse = ' > '),
 tot_time_lapse = round(as.numeric(max(date) - min(date))),
 prev_touch_tl = prev_time_lapse[which(max(date) == date)],
 conversion = sum(conversion)) %>%
 ungroup() %>%
 mutate(is_fruitless = ifelse(conversion == 0 & tot_time_lapse > 20 & prev_touch_tl > 10, TRUE, FALSE)) %>%
 filter(conversion == 1 | is_fruitless == TRUE)

7. Attributing revenue and costs comparison

Once we have the marketing channels attributed, we can compute their effectiveness. The first way is by comparing cost per action or CPA (conversion in our case) for different channels. For this, we need to divide the total cost spent on the exact channel by the modeled number of conversions and compare. For instance:
Additionally, the ChannelAttribution package allows to distribute revenue through channels. For this, we need to add the parameter “var_value” with a column of revenues into the markov_model() function. Therefore, it is possible to compare channels’ gross margin.

Conclusions

As you can see, using the Markov chains approach provides an effective way for marketing channels attribution. There are quite a lot of other uses of the approach:
  • the Markov model can be effectively used for LTV prediction
  • we can include additional types of interactions into the model (for example, banner impressions, tv ads, calls to callcenter and so on) and evaluate their influence
  • develop “likely to buy” predictive model using ML algorithms
and so on.
I’m going to share another method for attributing marketing channels that is a mix of probabilistic and heuristic models based on sales funnel. It is very interesting, don’t miss it!
  • Pingback: Marketing Multi-Channel Attribution model with R (part 2: practical issues) — AnalyzeCore – data is beautiful, data is a story – Marketing Analytics()

  • Charles_F

    Could be me but full_join threw an error because channel was an Integer and the other was a factor. Mutated df_unique_paths$channel to as.character and ran it.

  • khanh dinh

    Thanks very much for this informative tutorial!

  • Pingback: Multi-Channel Attribution With R – Curated SQL()

  • Thanks for an interesting application of Markov chains.

    One thing that seems improvable, unless I’ve missed something in the deep details, is that the estimate of transition probabilities could be made more robust. In particular, I believe the estimates are being obtained as ratios of counts. If this is not correct, please advise.

    If they are, there’s an opportunity to get better in the small counts and other cases. Estimating the transition probabilities from one particular state to another can be interpreted as the attempt to estimate the choice probabilities of a Multinomial. In practice, such statistical estimation can suffer from a number of problems, including over- or underdispersion, where the observed variance in the probabilities differs markedly from what theory would say if it were a true Multinomial where, as I believe is done, the expected value of the probability is the ratio of counts. There are a number of ways of addressing this, but one of the more robust means is to do the estimate in a Bayesian manner, and posit a prior. The conjugate prior for the Multinomial is the Dirichlet. Using an empirical Bayesian approach, one could build a prior using large sets of observations of users, and perhaps similar structures or even the same Web process with different products. Then the estimates used for the transition probabilities are those obtained from the Dirichlet-Multinomial posterior density, which has the form of a Dirichlet. This also gives an estimate of the variance.

    Using such an approach lets the estimates go gracefully from the case of too little data to that of data which can potentially overwhelm (in a good sense) the prior estimate.

    There are many cases of using Dirichlet-Multinomials written up in the literature. Eric Thorsén in his 2014 Bachelor’s Thesis from Stockholms universistet found the Dirichlet-Multinomial superior to the Multinomial-logit for estimation when the effects of dispersion were considered.

    • Jon

      Is this still practicable where there are circa 60 distinct channels in play? Typically there will be 5-6 channels where transitions are in the 100,000s bracket: then most in the 1,000s range with just a few in the 100s.

  • Jon

    Before splitting your data, I’d advocate considering whether you believe your data is a true reflection of user behaviour.

    How likely is it that a user performs a single search, clicks and converts? Are they not as likely to have changed device, reset their cookies, or come in with tracking disabled: hence appearing to be a single step user where in reality this is not the case?

    Where you suggest partitioning in to 1-click and multi-click paths to help protect the value of single click paths, may be the reality is that this credit should be re-apportioned elsewhere. Maintaining these steps and using higher order Markov Chains pushes some of this credit in to likely precursor steps: in essence seeing ‘between’ user paths.

    Just a thought.

    Also: partitioning single-purchase users and multi-purchase users might be better approached as a pre-site and post-site activity partition, within a given time window. Activity leading up to that first click is distinct from return-to-site (which arguably is a user finding a navigational path of least resistance). You may be able to identify repeat purchasers where a log-in / identification is required, but you are skewing slightly as you cannot identify non-purchasing users in the same way.

  • david

    At point 6 I believe that the correct code will be

    # extracting data for generic model
    df_multi_paths_tl_3 %
    group_by(customer_id) %>%
    mutate(prev_time_lapse = round(as.numeric(date – lag(date)))) %>%
    summarise(path = paste(channel, collapse = ‘ > ‘),
    tot_time_lapse = round(as.numeric(max(date) – min(date))),
    prev_touch_tl = prev_time_lapse[max(which(max(date) == date))],
    conversion = sum(conversion)) %>%
    ungroup() %>%
    mutate(is_fruitless = ifelse(conversion == 0 & tot_time_lapse > 20 & prev_touch_tl > 10, TRUE, FALSE)) %>%
    filter(conversion == 1 | is_fruitless == TRUE)

  • Andriy Kosetsky

    Hi Sergey,
    Thank you for the article. Trying to implement your code, but have a question.

    In the code below dataset “df_all_paths” (used in markov_model function) is aggregated on customer level and conversion column is always equal to 1. As far as I understand, input data for markov_model function should be aggregate on path level. At least test dataset provided with package is aggregated on path level. Isn’t it a mistake?

    ### experiment ###
    # attribution model for all paths
    df_all_paths %
    group_by(customer_id) %>%
    summarise(path = paste(channel, collapse = ‘ > ‘),
    conversion = sum(conversion)) %>%
    ungroup() %>%
    filter(conversion == 1)

    mod_attrib <- markov_model(df_all_paths,
    var_path = 'path',
    var_conv = 'conversion',
    out_more = TRUE)

    • Andriy Kosetsky

      Now, I’ve tried dataset aggregated on the path level, and results of model is pretty similar.

    • AnalyzeCore

      Hi Andriy,
      I can say it should be aggregated on customer-path level (I mean we have to aggregate each path for each customer). This code looks correct because we chose the first path only earlier in the code. Thus, we can aggregate path on customer level.
      Conversion is not always 1, it can be 0 or 1 depending on was the 1st path finished with or without conversion.
      But anyway, this is an example only and the core of the method is to aggregate customer paths as you mentioned.

  • Thomas Beistial Vannier

    When I plot the graph like you did, all the nodes (channels) have an edge which loops on the node. How it’s possible whith an order equal to 1?
    I thought it was not possible, with a order = 1 to go from one channel to the same channel. Am I wrong?

    On your graph, you don’t have this, while I just copy your code…

    Thanks a lot

  • Darima Pivovarova

    Hi Sergey, thank you for your articles. I have a question to predictive accuracy of this model: Anderl is calculating an area under the ROC curve as measure of predictive performance (correct prediction of conversion events), but I am not sure how to predict if the journey leads to conversion or not based on the given markov model. Could you please share your solutions to model evaluation? I would like to compare the predictive accuracy of markov models of higher orders and to compare them also with last-click model for example.
    I’d greatly appreciate your help!

    • AnalyzeCore

      Hi Darima!
      You are asking about predictive model that would allow you to predict will journey be finished with a purchase. This is not a topic of my article. The article is about conversions distribution (attribution). That means we just take historical data with conversions and distribute them to channels using Markov chains method. Predictive perspectives were mentioned only.
      However, if you train predictive model based on Markov chains and want to compare accuracy with another algorithms you can use area under the ROC curve as the universal accuracy metric. Just check a way it calculates: it doesn’t depend on the method how have you obtained a prediction.

  • bhupendra kumar
  • Masked Statistician

    Hi,

    Thanks for you article, it is amazing!
    I now have a better understanding on how to do an attribution model with markov and removal effect. There is also google proposing shapley value which is almost similar.
    But I fail to understand how to Maximize the ROI ( allocate the budget efficiently) after having computed the attribution model.
    On your last line you are showing how we can get the most efficient channel but it is not enough because it doesn’t mean that you are going to shift all your budget on the most effcient one since there is interaction. ( spending on channel 1+ channel 2 is better than spending on channel 1 even though it’s the most efficient)
    Are you planning to write an article on how to efficiently allocate the budget after having the attribution model? How would love
    to have your point of view on how you would approach the problem.
    Thanks

    • AnalyzeCore

      In case I have and can something to share regarding budget allocation I will )))