Back to Blogs

Creating a serverless ML application to beat the bookmakers

February 20, 2023
by
Matej Frnka

Project by Matej Frnka & Daniel Workinn

This blog post talks about what was needed to create and run a serverless ML application that predicts results of upcoming football games and attempts to make money by betting on the predicted winners. The focus is on the overall architecture rather than on the data science of beating the bookmakers.

You can find all the code behind this serverless ML app in this public GitHub repository. Don’t forget to give it a star ⭐

Our solution using serverless ML

The overall MLOps solution is based on 4 pipelines:

  • A backfill feature pipeline, to collect enough historical data to train ML models.
  • A model training pipeline, to fit the best model to historical data.
  • A (production) feature pipeline, to server data to our deployed model.
  • A model inference pipeline, to serve predictions to the end-user using features served by the production feature pipeline.

To build this serverless ML application, we used the following managed services for MLOps:

  • We used Modal Labs to scrape historical football games and their odds with 300 scrapers at once
  • Then, we scheduled python jobs to scrape new data and train a model every x hours and x days using Modal Labs again
  • We needed to store our features and models. For that, we went for a feature store called Hopsworks
  • Finally, we used Streamlit to display the outcomes

As you can see on the diagram below, we kept everything self-contained in it’s own application to be able change and run separately each individual part of the application.

Arrows represent the flow of data

All apps mentioned above have very generous free versions, so I can only recommend them for personal “messing around with things” type projects. Modal gives you 30 dollars to spend every month, Hopsworks gives you 25 GB of storage and Streamlit lets you make a free website to run however long you like.

1. Backfill feature pipeline

This initial step gets everything running. It requires a little bit of manual input, but it is only needed to run once.

First, as any good object-oriented project should be, we wrote a class that scrapes a given league and saves it as a dataframe to a parquet. We won’t go into implementation details in this blog.


instance = ScrapeInstance(country, league)
instance.run_scrape_local(url, save_path)

Then, we ran an instance for every league and country in modal with this class. We used modal persisting volume that we mounted to every instance to store all output files.


VOLUME_MOUNT_PATH = "/my_vol"
stub = modal.Stub(
   "scraping",
   image=(
       modal.Image.debian_slim().pip_install(
           ["psycopg2-binary", "pandas", "numpy", "beautifulsoup4", "backoff", "requests", "lxml", "pyarrow"])
   )
)


def run(country: str, league: str, save_path: str):
   url = LEAGUES[(country, league)] + "results/"
   i = ScrapeInstance(country, league)
   i.run_scrape_local(url, save_path)
   if i.scraped_everything:
       logging.info("SUCCESS")


@stub.function(timeout=24 * 60 * 60,
               shared_volumes={VOLUME_MOUNT_PATH: modal.SharedVolume.from_name("scrape_data")})
def fn(key):
   country, league = key
   run(country, league, VOLUME_MOUNT_PATH)


@stub.local_entrypoint
def main():
   to_scrape = [...] # all leagues and we wanted to scrape
   with stub.run():
       for result in fn.map(to_scrape):
           print(f"Completed: {result}")

We then monitored everything in Modal dashboard.

Even with the parallelization, this step took almost a day to complete. It could have been sped up more by parallelizing the scraping of individual leagues since some leagues were scraped almost instantly and others took a lot longer.

After everything finished running, we processed the data and uploaded it all to Hopsworks. To do so, all we need was a couple of lines and we could store our dataframe persistently and access it from anywhere:

# Log in to Hopsworks - you need to have environment variable with Hopsworks API KEY
project = hopsworks.login()

# Get the feature store
fs = project.get_feature_store()

# Create the feature_group
fg_football = fs.get_or_create_feature_group(
    name="fg_football",
    version=FG_VERSION, 
    primary_key=KEY_COLUMNS,
    event_time=['date'],
    description="Football data")


# … load and preprocess data
# create data_df pandas dataframe containing the data we want to insert

fg_football.insert(data_df)

Hopsworks automatically uses pandas dataframe dtypes to set column types. You can also specify the datatypes directly. Unfortunately, only numpy dtypes are supported at the time of writing. This limits you a little bit if you want to store null values, because numpy doesn’t support nulls for some data types like boolean. Luckily, support for pandas dtypes in Hopsworks is coming in the near future.

2. Model training pipeline

To train a model, we download all data from Hopsworks:


data_df = fg_football.read() # read all data from featuregroup

And then train the model using tensorflow and upload it to the Hopsworks’ model registry. This is done by first saving the model locally, and then uploading the local folder like so:


# We generate metrics based on backtests for the previous year
metrics = dict("roi": ..., "created_timestamp": ...) 
# Get the registry api client
mr = project.get_model_registry()
# Create an entry in the model registry that includes the model's name, desc, metrics
model_football = mr.python.create_model(
   name="model_football",
   description="Football close odds predictions",
   metrics=metrics
)
model_football.save(local_model_path)

Notice we also upload metrics. We later use them to download either the best-performing model or the newest one. Usually, you would just want to get the best-performing one, but since our test data is changing over time, we can’t 100% rely on the performance metric, so we just use the newest model - It is possible that 2 years old model had better performance on test data from 3 years ago, but maybe not on today's data.

3. Production feature pipeline

Continuous scaping wasn’t too different from the initial scrape, we changed our scraper to scrape from the newest matches and stop once it gets to a date we already have. 

We also made our predictions for the upcoming games. To do so, we simply downloaded the model from Hopsworks and used it to make predictions:


mr = project.get_model_registry()
# Get the latest models
hopsworks_model = mr.get_best_model(name="model_football", metric="created_timestamp", direction="max")
# Download the latest model
# It is stored in a temp system directory
model_dir = Path(hopsworks_model.download())
# Load the model and make predictions ...

We saved our newly scraped data directly to Hopsworks - our predictions to a new feature group and new results to the "fg_football" feature group, which was done using the exact same code as for creating the feature group. Super simple!

To run the app periodically, we used modal’s scheduled runs by updating the function annotation:


@stub.function(timeout=1200, schedule=modal.Period(days=1), secret=modal.Secret.from_name("HOPSWORKS_API_KEY")))
def fn(key):
   country, league = key
   run(country, league, VOLUME_MOUNT_PATH)

And then deployed it to modal using modal deploy scrape.py. 

4. Inference pipeline with Streamlit UI

The final step is to show our predictions to the user. All that was needed was to download data from the predictions feature group the same way as shown before and display them with streamlit table.

Join the Serverless ML Movement