## Towards optimal personalization: synthesisizing machine learning and operations research

Last post I talked about how data scientists probably ought to spend some time talking about optimization (but not too much time - I need topics for my blog posts!). While I provided a basic optimization example in that post, that may have not been so interesting, and there definitely wasn't any machine learning involved.

Right now, I think that the most exciting industrial applications of optimization are those that synthesize machine learning and optimization in order to obtain optimal personalization at scale.

Here, I'll talk about a more concrete use case of this synthesis that you might see at a company.

## All the ML and nowhere to go¶

Let's say you start working at a Software-As-A-Service (SAAS) company, and you end up in a meeting with the Marketing team. Everybody's talking about churn. Marketing has been trying all sorts of things - they've sent coupons, they've called customers, they've sent emails, and everything else in order to decrease churn. Some things work, some things are expensive, and there are lots of questions. Nobody knows SQL, so you offer to look at the data.

It turns out that it seems like there might be some clear differences in customers who eventually churn and customers who do not. You offer to build an algorithm to predict customer churn broken out by intervention medium (e.g. email, phone call, no intervention, etc...).

You get the greenlight to hack away. Of course, this takes much longer than you or Marketing expects (because pretty much all machine learning does), but in the end you're left with multiple classification models that are well-tuned with a bunch of features.

*You're in a great place. You actually built machine learning models that work.*

But what now?

You can go the common route. You write a long script that will run the churn model every so often and populate a database with the results. You tell Marketing and everybody else that this information is now available, and you hope that they will use it.

And they might.

Or, those numbers will sit there.

Or, Marketing will randomly target the top X% of people most likely to churn with their expensive intervention (say, phone call) and email the rest.

None of this is optimal.

## Optimization to the rescue¶

When there's lots of decisions to make and there's a clear goal, then optimization is a great friend to have. The goal here is to prevent churn. We will have some constraints (mainly money). Let's make up some data and walk through how to solve this in python.

## Defining (making up) the problem¶

We'll assume that we have 4 different types of churn prevention messages at 4 different prices:

Media | Price |
---|---|

0.25 | |

Push notification | 0.30 |

Text message | 0.85 |

Phone Call | 5.00 |

Also, Marketing has a monthly budget of $2,000 to spend on messaging 2,000 customers. Let's optimize! We'll start by making up some fake outputs of your supposed machine learning model.

*Author's note: There are probably more dimensions to this problem that you would care about in reality. For example, a customer lifetime value (LTV) model would help in order to decide which customers are more valuable post churn prevention.*

```
import numpy as np
import pandas as pd
np.random.seed(2016)
```

```
n_customers = 2000
customer_ids = np.random.randint(1000000, 2000000, size=n_customers)
# Define retention success probabilities.
email_prob = np.random.random(size=n_customers) * 0.2
push_prob = np.random.random(size=n_customers) * 0.3
text_prob = np.random.random(size=n_customers) * 0.4
phone_prob = np.random.random(size=n_customers) * 0.9
prob_df = pd.DataFrame({'email': email_prob,
'push': push_prob,
'text': text_prob,
'phone': phone_prob},
index=customer_ids)
# assure column order
message_types = ['email', 'push', 'text', 'phone']
prob_df = prob_df[message_types]
prob_df.head()
```

## Setting up the math: Objective + Constraints¶

As with all optimization problems, you should start with what you want to optimize (this is your objective function, cost function, loss function, or whatever else you want to call it). Here, we want to minimize churn, or, conversely, maximize retention. Let us define a matrix $M$ consisting of element variables $m_{ij} \in \{0, 1\}$ which indicate whether or not customer $i$ receives a message from intervention medium $j$.

We can then use our retention probabilities as coeffecients to $M$ in order to create our objective function to maximize. We define $p_{ij}$ as the probability that customer $i$ is retained by intervention medium $j$. Our objective function is thus

$$R = \sum\limits_{i,j}p_{ij}m_{ij}$$Let's reflect for a second on how cool this is. The *outputs* of a fancy machine learning model are being fed in as *inputs* to this new optimization problem.

Okay enough reflection - we have our function to maximize. The next question is, say it with me class, "What are the constraints?"

We only have two, and they're relatively simple:

Each customer can only get one message. $$\forall i \sum\limits_{j}m_{ij} = 1$$

There's a limited budget of 2K. We'll take $c_{j}$ to be the price of sending a message of type $j$. $$\sum\limits_{ij}m_{ij} c_{j} \leq 2,000$$

## Math $\Rightarrow$ Code¶

For ease of translation, I will use the same variable names in the code as I used in the above math. Note: some people might get angry if you do this in production code.

We start with defining index lists and the matrices for the constants in the problem.

```
# Indices
customer_index = range(prob_df.shape[0]) # i
message_index = range(len(message_types)) # j
tier_index = range(4) # k
```

```
# Matrices of constants
# Retention probabilities
p = prob_df.as_matrix()
p.round(decimals=4)
# Pricing
c = [0.25, 0.30, 0.85, 5.00]
budget = 2000
```

With the constants defined, we can now build the variables to our optimization problem. Just like how we wrote out the math, we'll define our variables and objective function before adding the constraints. Similarly to my last post, I'll use the Python library pulp.

```
from pulp import *
# Create variables
prob = LpProblem('Churn Prevention Problem', LpMaximize)
m = {}
a = {}
t = {}
for i in customer_index:
for j in message_index:
m[i, j] = LpVariable('m_{}_{}'.format(i, j),
lowBound=0,
upBound=1,
cat='Binary')
# Add the objective functions
prob += (lpSum([m[i, j] * p[i, j]
for i in customer_index
for j in message_index]),
'Total retention probability')
```

```
# Simple constraint
for i in customer_index:
prob += (lpSum([m[i, j] for j in message_index]) == 1,
'One message for cust {}'.format(i))
```

```
# Max budget constraint
prob += (lpSum([m[i, j] * c[j]
for i in customer_index
for j in message_index]) <= budget,
'Budget')
```

Our problem is now fully defined, and we'll use GLPK in order to solve it. I've also added a GLPK option to log the output so that we can view it in the Jupyter notebook.

```
prob.solve(GLPK_CMD(msg=3, options=['--log', 'glpk.log']))
```

```
with open('glpk.log', 'r') as f:
for line in f:
print(line.rstrip('\n'))
```

```
print("Status: {}".format(LpStatus[prob.status]))
```

Voilá! We now have a fully personalized strategy for churn prevention and can now explore the solution to our hearts' content.

```
for j, m_type in enumerate(message_types):
print('Total {} messages: {}'.format(
m_type, np.sum([m[i, j].varValue for i in customer_index])))
```

```
total_cost = np.sum([m[i, j].varValue * c[j]
for i in customer_index
for j in message_index])
print('Total cost: ${:.2f}'.format(np.round(total_cost, 2)))
```

```
obj = np.sum([m[i, j].varValue * p[i, j]
for i in customer_index
for j in message_index])
print('Probability sum (i.e. the objective function): {}'.format(obj))
```

## Tricks of the optimization trade¶

That was fun, but now I'd like to complicate things (and make them more realistic) by assuming that there are pricing plans for each of these message types that change with number of messages sent:

Media | <500 | 500-999 | 1,000-1,499 | >1,500 |
---|---|---|---|---|

0.25 | 0.20 | 0.15 | 0.10 | |

Push notification | 0.30 | 0.25 | 0.225 | 0.20 |

Text message | 0.85 | 0.65 | 0.50 | 0.30 |

Phone Call | 5.00 | 5.00 | 5.00 | 5.00 |

You employ your own employees for the phones, so this cost does not benefit from economies of scale.

The budget constraint will now have to be rewritten, and it will be much more difficult. Seriously, it's pretty tricky.

With the tiered pricing, we introduce step functions to the cost. What follows is the best method I could think up to solve this problem, but I would love to know if anybody has better ideas.

To start, we'll consider $<$500 messages to be our *base* tier. We will then call each subsequent tier Tier 1, Tier 2, and Tier 3 (500-1,000, 1,000-1,499, and $>$1,500 messages, respectively). We must create two new types of variables.

The first will be indicator variables which indicate whether or not we have "activated" a particular tier for a particular message type. Mathematically, let us define $a_{jk} \in \{0, 1\}$ to indicate whether or not we have sold *at least* the minimum amount of messages of type $j$ in Tier $k$.

The second type of variable will be integer variables $t_{jk} \in \mathbb{Z}_{\geq 0}$ which tell us how many messages of type $j$ that we have sent above the minimum amount of messages for tier $k$. The way that we will calculate the price for sending $N$ messages of type $j$ will be to calculate the base cost and then subtract "discounts" from the total cost as we move up in tiers.

Okay, to summarize, we need to define both $a_{jk}$, our tier activation indicators, and $t_{jk}$, our tier message counters. Let's let $l_{k}$ be the minimum (lower) amount of messages needed to be sent in order to move up in a Tier. With the following two constraints, the behavior of $a$ will be fully defined

$$\forall j, k \geq 1 \quad \sum\limits_{i}m_{ij} \leq l_{k} - 1 + X a_{jk}$$$$\forall j,k \geq 1 \quad \sum\limits_{i}m_{ij} \geq l_{k} - X(1 - a_{jk})$$where $X$ is a sufficiently large number (like the maximum of $\sum\limits_{i}m_{ij}$). This technique of using this large number is called the Big M method from the common nomenclature of using an $M$ instead of my choice of $X$ (I already used an $M$ earlier!). The two constraints above are quite confusing. I know this much is true because it took me a long time to make sure that I had them right. I would suggest walking through an example to make sure that you understand what's going on. You can pick a single $k$, like the first Tier where $l_{1} = 500$. Now, imagine if 499 messages of some type $j$ had been sent such that $\sum\limits_{i}m_{ij} = 499$. What do the constraints say that $a_{j1}$ must be? Now, perform the same test for 500 and 501 messages and assure yourself that the constraints never disagree with each other and that $a_{j1}$ flips from 0 to 1 when it is supposed to.

With $a$ defined, it is one more set of mental gymnastics in order to define $t$:

$$\forall j,k \geq 1 \quad t_{jk} \leq Xa_{jk}$$$$\forall j,k \geq 1 \quad t_{jk} + l_{jk} \geq \sum\limits_{i}m_{ij} - X (1 - a_{jk})$$$$\forall j,k \geq 1 \quad t_{jk} + l_{jk} \leq \sum\limits_{i}m_{ij} + X (1 - a_{jk})$$The first constraint ensures that if $a$ is 0 then $t$ must be 0, as well. The other two constraints guarantee that when $a$ is 1, then $t_{jk} = \sum\limits_{i}m_{ij} - l_{jk}$ which is exactly the number of messages greater than the minimum threshold for Tier $k$.

With all of these god forsaken constraints defined, we can finally write our budget constraint. We will take $c_{jk}$ to be the cost for sending a single message from Tier $k$ where $k=0$ is the base cost. For $k>0$, we will take $c_{jk}$ to be the extra discount obtained by reaching that tier (it is positive, and we subtract it from the base cost).

$$\sum\limits_{j}\Big(\sum\limits_{i}m_{ij}c_{j0} - \sum\limits_{k \geq 1}t_{jk}c_{jk}\Big) \leq 2,000$$Phew!

## Math $\Rightarrow$ Code Redux¶

Finally, we get back to the code. We can reuse some of the old variables, but we'll need to rewrite our cost matrix and add some tier information.

```
# New Indices
tier_index = range(4) # k
# New matrices of constants
# Tier pricing
c = np.array([[0.25, 0.20, 0.15, 0.10],
[0.30, 0.25, 0.225, 0.20],
[0.85, 0.65, 0.50, 0.30],
[5.00, 5.00, 5.00, 5.00]])
# Recall that we must transform c
# to get the marginal discount.
# There's probably a fancy numpy way to do this
# but ain't nobody got time for that.
# Note we must slice in reverse if we
# want to do this in place.
for j in range(c.shape[1] - 1, 0, -1):
c[:, j] -= c[:, j-1]
c = np.abs(c)
# Tier thresholds
l = [0, 500, 1000, 1500]
```

We solve the problem similarly to before but need our two new variables along with their associated constraints.

```
# Create variables
prob = LpProblem('Tiered Churn Prevention Problem', LpMaximize)
m = {}
a = {}
t = {}
for i in customer_index:
for j in message_index:
m[i, j] = LpVariable('m_{}_{}'.format(i, j),
lowBound=0,
upBound=1,
cat='Binary')
for j in message_index:
for k in tier_index[1:]:
a[j, k] = LpVariable('a_{}_{}'.format(j, k),
lowBound=0,
upBound=1,
cat='Binary')
t[j, k] = LpVariable('t_{}_{}'.format(j, k),
lowBound=0,
upBound=len(customer_index) - l[k],
cat='Integer')
# Add the objective functions
prob += (lpSum([m[i, j] * p[i, j]
for i in customer_index
for j in message_index]),
'Total retention probability')
```

```
# Simple constraint
for i in customer_index:
prob += (lpSum([m[i, j] for j in message_index]) == 1,
'One message for cust {}'.format(i))
```

```
# Hard pricing tier constraints
X = len(customer_index) # Max value of sum_{i}(m_{ij})
for j in message_index:
m_sum = lpSum([m[i, j] for i in customer_index])
for k in tier_index[1:]:
prob += (m_sum <= l[k] - 1 + X*a[j, k],
'hard constraint 1 {}_{}'.format(j, k))
prob += (m_sum >= l[k] - X*(1 - a[j, k]),
'hard constraint 2 {}_{}'.format(j, k))
prob += (t[j, k] <= X * a[j, k],
'hard constraint 3 {}_{}'.format(j, k))
prob += (t[j, k] + l[k] >= m_sum - X*(1 - a[j, k]),
'hard constraint 4 {}_{}'.format(j, k))
prob += (t[j, k] + l[k] <= m_sum + X*(1 - a[j, k]),
'hard constraint 5 {}_{}'.format(j, k))
```

```
# Max budget constraint
prob += (lpSum([lpSum([m[i, j] * c[j, 0] for i in customer_index])
- lpSum([t[j, k] * c[j, k] for k in tier_index[1:]])
for j in message_index]) <= budget,
'Budget')
```

Even with our relatively small problem of 2000 customers, GLPK has a bit of difficulty in converging when using the default optimization parameters. I'll add a slightly smaller `mipgap`

(difference between upper bound and current feasible solution) in order to lower the tolerance for this example. Note that this is not necessarily GLPK's fault. There are other constraints that I could have written that do help the solver e.g.

```
glpk_options = ['--log', 'glpk.log', '--mipgap', '0.0001']
prob.solve(GLPK_CMD(msg=3, options=glpk_options))
```

```
with open('glpk.log', 'r') as f:
for line in f:
print(line.rstrip('\n'))
```

```
print("Status: {}".format(LpStatus[prob.status]))
```

Took a little longer, huh?

```
for j, m_type in enumerate(message_types):
print('Total {} messages: {}'.format(
m_type, np.sum([m[i, j].varValue for i in customer_index])))
```

```
total_cost = np.sum([np.sum([m[i, j].varValue * c[j, 0] for i in customer_index])
- np.sum([t[j, k].varValue * c[j, k] for k in tier_index[1:]])
for j in message_index])
print('Total cost: ${:.2f}'.format(np.round(total_cost, 2)))
```

```
obj = np.sum([m[i, j].varValue * p[i, j]
for i in customer_index
for j in message_index])
print('Probability sum (i.e. the objective function): {}'.format(obj))
```

Note that we get to a higher objective value with these pricing tier discounts.