Data Piques

Aug 06, 2018

Quick and Dirty Serverless Integer Programming

We all know that Python has risen above its humble beginnings such that it now powers billion dollar companies. Let's not forget Python's roots, though! It's still an excellent language for running quick and dirty scripts that automate some task. While this works fine for automating my own tasks because I know how to navigate the command line, it's a bit much to ask a layperson to somehow install python and dependencies, open Terminal on a Mac (god help you if they have a Windows computer), type a random string of characters, and hit enter. Ideally, you would give the layperson a button, they hit it, and they get their result.

I recently deployed a solution which allowed me to do just this. Even better - it was all free! In this post, I'll talk about how I used Google Sheets as my input form, datasheets to convert Google Sheets to pandas, Zappa for deploying a serverless Flask app, and PuLP for solving a quick integer programming problem to make a simple and free ad hoc optimization service.

Note: all the code for this service is located on my github


Every project should start as a problem, and mine was no different. My wife competes in fantasy movie league. This is like fantasy football for movie geeks. The rules are simple:

You are a fantasy movie theater owner. You must decide which movies to play on your 8 screens. Each movie costs a different amount to screen, and the goal is to generate the most box office revenue over the weekend given your available budget. Talking with her, I realized that, if one can do a good job predicting box office revenue for the weekend (the hard part of the problem), then deciding how many screens to play each movie becomes a simple integer programming allocation problem.


Now that we have the problem, what are the requirements?

  1. A method for inputting a bunch of data:
    • Movie name
    • Expected revenue
    • Cost to screen
  2. Ability to run the allocation problem from a browser.
  3. A view of the solution

What's the easiest input form that data scientists hate?


What's worse than Excel?

Google Sheets


Thankfully, Squarespace created datasheets. This is a nice library that makes interactions between pandas and Google Sheets impressively painless. The library is worth it for the detailed OAuth page alone (I once spent 2 weeks struggling with Google OAuth pain and really wish this page had existed at that time). What's particularly nice about the OAuth page is that it walks through setting up a service account which does not require the end-user to go through the typical OAuth dance of browser redirects to and from the Google login page. This is nce because these redirects can get messed up when moving from local development to production systems in the cloud (or at least they always get messed up when I try to do it!).

Anywho, the first step was to setup my Google Sheets credentials and download the client_secrets.json and service_key.json files. With these handy, I can now access my Google Sheets spreadsheet using datasheets. The spreadsheet is named FML, and the inputs tab looks like


We can pull this into a pandas DataFrame by setting some datasheets environment variables to point to our credentials and then creating a Client

In [1]:
import os
import datasheets

os.environ['DATASHEETS_SECRETS_PATH'] = 'client_secrets.json'
os.environ['DATASHEETS_SERVICE_PATH'] = 'service_key.json'

client = datasheets.Client(service=True)

If that goes well, we can now grab our workbook (aka the Google Sheets file) and download the tab of data

In [2]:
workbook = client.fetch_workbook('FML')
input_tab = workbook.fetch_tab('inputs')
input_data = input_tab.fetch_data()

movie revenue cost
0 Hotel Transylvania 13600000.0 157.0
1 Ant Man 9100000.0 116.0
2 Skyscraper 5300000.0 61.0
3 Incredibles 2 7900000.0 89.0
4 Jurassic World 6700000.0 76.0
5 Purge 2400000.0 28.0
6 Sorry to Bother 1800000.0 18.0
7 MI: Fallout 63600000.0 756.0
8 Mamma Mia 19800000.0 227.0
9 Equalizer 18300000.0 201.0
10 Unfriended 1600000.0 18.0
11 Blindspotting 3000000.0 41.0
12 Teen Titans 13400000.0 149.0
13 Three Idential Strangers 1100000.0 16.0
14 Eighth Grade 946000.0 26.0

Allocating Movies

I've written previously about integer programming in Python using the PuLP package, so I will avoid the introductions to integer programming and pulp. For this post, I will just quickly summarize the optimization problem, as it's reasonably simple!

We only have a single decision variable in our problem. In the code, I call this movie_counts. In math, we can call it $S_{m}$ which corresponds to how many screens we will play movie $m$ on for the weekend. This is an integer decision variable with a lower bound of 0 and an upper bound of 8 (the number of screens we have available in our fantasy movie theater). It is an integer variable because we cannot screen a movie on 2.5 screens.

With our decision variable in hand, we must now define an objective function. We simply want to maximize expected revenue. Let's define a quantity $r_{m}$ which is the amount of money that we expect movie $m$ to bring in (this is the revenue column in the above DataFrame). Our objective function is then simply

$$\sum_{m} r_{m} * S_{m}$$

Lastly, we need some constraints. We only have two, but, before I introduce them, I need to introduce one slight wrinkle in fantasy movie league. You get charged $2 million for every screen that you leave empty. We can incorporate this into our optimization problem by assuming that there is an extra movie called "Empty Screen" and that the expected revenue for that movie is negative $2 million. Our two constraints can now be defined:

  1. Every screen must be assigned a movie $$ \sum_{m} S_{m} = 8 $$
  2. We have a limited budget of $1000. Let's say movie $m$ costs $c_{m}$ to screen. Our budget constraint is thus $$ \sum_{m} c_{m} * S_{m} \leq 1000 $$

And that's it: one type of decision variable, a simple objective function, and two constraints. If you're interested, I wrap all of the above steps into an Optimizer class in the fml code.

With the optimization problem complete, I can pack up the solution as a DataFrame and use datasheets to write the data back to the outputs tab of the spreadsheet

In [3]:
solution = ...
outputs_tab = workbook.fetch_tab('outputs')

Painless Serverless

The final step was to create a tiny Flask app with a button to launch the optimization. I made the simplest barebones site that I could, and then it was time to deploy.


Zappa is a ridiculously cool Python library that lets you run any Python application as an AWS Lambda function and make it all discoverable via API Gateway. What this means is that you can make a Python website and run it in the cloud without an actual server running the code (as long as your website runs quickly, and uses few resources). You only pay for each time the website runs, but the first million times per month are free. If my wife happens to run this more than 1 million times, then I'll happily pay money.

I was blown away by how easy Zappa was to use. Honestly, the hardest part was figuring out how to install python 3.6 on my linux computer because you have to use virtualenv instead of conda (though there's a PR to fix that).

I'm just going to copy the documentation on how to get Zappa working because this is literally all that I had to do:

pip install zappa

zappa init

zappa deploy

After all of your code gets zipped up and sent to the cloud, Zappa tells you what cryptic URL at which you can now find your app. You can use custom domains and a gazillion other options, but this is quick and dirty serverless integer programming, remember?

With the website deployed, my wife can now input data into the spreadsheet, hit the Calculate button on the website, and then watch the spreadsheet for the optimal movie screens with nary a command line in sight.