ML Architecture from zero to one

If you would follow all best practices on the internet with regards to coding and technical architecture, start-ups would take multiple years and huge budgets to get an MVP, while you only have limited time, money and people. I created this list out of experience on what to focus on in which stage of building a new company, and which best practices you can safely ignore at different stages. Advice like this is always opinionated and you should pick and choose what actually applies to you or not. For example if you are working on a medical device, you should invest in security faster. Or if your customers need to be able to train their own models, you need to invest in training pipelines and serving architecture sooner.

To create a mature product, let’s say you need about one thousand development days. With 5 full time developers, that’s about a year’s work. If you are a start-up, you don’t have that time (and maybe not even those developers) so you need to ruthlessly prioritize.

So, here is a summary of some thoughts and advice.

Update October 2021: For a similar checklist solely focused on security, I’d recommend Goldfiglabs’ Saas CTO Security Checklist

General advice#

  • Learn, but then subsequently ignore “best practices” until you encounter the problem that would be solved. You do not need CI/CD, auto-scaling, Kubeflow training pipelines, prediction monitoring, automated retraining and multi-armed bandit A/B testing when you are working on your POC. Only solve the problems you actually have. Don’t solve problems other people had and subsequently made a best practice, unless you have the same problem.
  • Development speed matters. Do everything you can to optimize for development speed and make your own internal workflow as streamlined and fast as possible in the beginning. Most scalability, advanced security and code quality issues are perfectly fixable in the future. Make daily improvements to your internal workflow by automating things. I rather spend 2 hours automating something then doing the same 2-min task manually 10 times. That math doesn’t seem to work out at first glance, but it does in the long run: automation compounds over time. That leads to exponential differences and fat moats.
  • Write decoupled and easily extendable code. If all goes well, you will be working in this code base for the next 5 years, so try to structure it well from the beginning. You will - or should - Never rewrite from scratch.
  • Most decisions can easily be reverted and should be taken quickly. Except one type: adopting a new architecture/library/framework/database. When you choose to adopt Neo4j instead of Postgres, it’s a choice that can take months of development work to properly integrate and fine tune. If that turned out to be the wrong choice, you are looking at a huge refactor, frustration and wasted time. Take extra care and time to think and test extensively before adopting new libraries, frameworks or technology. Don’t rush these decisions, because they are very expensive to fix later. Make sure you check decisions like these with your technical advisors, if you are in doubt.
  • Don’t waste time. If you feel like you are losing time on something: fix it.

With those general principles out of the way, here’s what I found is a typical roadmap that I have applied when starting, growing and leading the ML teams for Chatlayer, Faktion and Metamaze.

POC / Seed stage#

In this stage, the most important aspect is proving your technology can work. The only thing that matters is you being able to give convincing demonstrations to potential early-adopter clients and investors.

In this stage, you should mostly do whatever gives you the highest development speed. That means developing and testing everything locally as much as possible. Focus is on building good habits and setting a long term foundation.

DO

  • Version and track all your code in git.
  • Log your experiments: code used, architecture, data version, training (hyper) parameters and results.
    • automatically in a experiment tracking framework like Weights and Biases (super easy and generous free tier) and MLFlow (requires quite some setup and configuration but open source).
    • manually in Confluence/Notion/GitHub/… which requires discipline
  • Write and run functional tests for the most crucial parts of your code base. You want your demo’s to go smoothly, and never have last minute fixes break something else unexpectedly.
    • No tests at all is bad and will hinder you in the future and cause frustration when things inevitably break.
    • Wanting >90% test coverage isn’t recommended either since it will slow you down too much: your code base changes dramatically from day to day since it is in flux all the time.
  • Add git pre-commit hooks for formatting, linting and type checking. Type annotate all your code (trust me on this one: you’ll thank me later). Optionally add more static code analysis like deepsource.io or sonarqube. Get into the habit of writing good code from the start.
  • If the data size allows, run your trainings locally on your laptop / desktop with GPU.
    • If you can avoid having to set-up cloud architecture it will save you a lot of time.
    • If you can’t, try using something simple like Azure Notebooks, Google Colab or AWS Sagemaker. Or - if you have funky dependencies - rent a VM with GPU’s, download your data, checkout your code and train. Keep it simple.
  • Work in a decent IDE like PyCharm or Visual Studio Code.
  • Run your predictions in a local Docker container. Manually build that docker container. This is fast enough for quick experiments while protecting you from dependency hell.
  • If you have a back-end/front-end to show, it’s okay to run that locally for now. You can fake a real domain name by adding e.g. app.myunicorn.com 127.0.0.1 to your /etc/hosts file.
  • Get used to working with issue tracking in JIRA/Github
  • Only write some in-code documentation for tricky functions.
  • Starting from boilerplate empty repo’s like https://github.com/hagopj13/node-express-boilerplate or https://github.com/pydanny/cookiecutter-django can help with some stuff you’d lose time with later.

MVP / Series A stage#

At this stage, there should be a running, bug-free version of your app working at least 95% of the time: some early adopters, your sales people regularly give demonstrations, … That means you need to deploy your app in the cloud somewhere and keep it kind-of stable. Your customers will forgive you some downtime but not as much as you think, so better keep it stable. You don’t need rolling deploys yet and can typically get away with just announcing a deploy and testing window for large releases. You are building the foundation of your code base for the next ten years are so, so you need to invest time into adopting good coding habits.

DO

  • Create a production environment running on Kubernetes.
  • Basic security like whitelisting API’s and simple password protection for front-end. Turn on production flags, disable introspection, turn on database IP whitelisting, …
  • Add automatic back-ups for your most important client data.
  • Create user facing documentation.
  • If your platform needs scaling at this stage, work on being able to manually scale the most important components. Running multiple replica’s can often do wonders, and the cloud cost of just overdimensioning your cluster will likely be way less than investing time in fine-tuning auto-scaling.
  • Adopt git flow methodology or a variant to track releases and versions.
  • Track and improve your test coverage.
  • Run your predictions in a Docker container that you push manually to the registry and then deploy manually on a bare VM, simple container orchestration service or simple Kubernetes cluster. You can deploy new version to Kubernetes manually with helm.
  • Add CI/CD to your back-end, front-end and prediction architecture: automatically build and test new builds for code pushes for front-end and back-end. Invest in making deploys easy and reversible, but not necessarily automatic.
  • Manually build and deploy models by just updating the reference to your trained model in the predict code. Store your trained model in git-lfs or on blob storage (but NOT just in git)
  • Add Sentry or a similar automated error reporting tool so you get alerts when things go wrong.
  • Need logs/monitoring? Just using kubectl get logs and kubectl top should be fine for now. Check out fubectl and Lens too for increased efficiency.
  • Manually back-up less important data from time to time
  • When new developers start, work on documentation for how to run, debug and test code. Add high-level architecture overview diagrams. As much as possible instead of explaining directly to people, update the documentation first.

Product-market fit / Series B stage#

Congrats to making it to this stage! You will now have real, paying customers that depend on the platform so you need to become way more mature. You might have an SLA to guarantee uptime.

DO

  • Add external and automatic health monitoring for your platform.
  • Improve logging, monitoring and tracing for all technical components. Create real-time dashboards to monitor load and health of your systems
  • Add and test backup and restore policies.
  • Add, test and finetune the ability to automatically scale to increased load. Solve bottlenecks and plan for the future.
  • Invest in decent security by fine-tuning RBAC / SSO to your internal systems.
  • Add dev and staging environments so that you can test all components working together easily, test architecture changes, …
  • Add bug tracking and reporting capabilities for your customers.
  • Add rolling deploys with blue/green deployments. While releasing new changes, the availability of the platform should not be impacted.
  • Score a 12 on Joel’s 12 steps to better code
  • Add monitoring for prediction quality and data drift
  • Adopt a good serving framework that solves your needs
  • Adopt a training pipeline framework like Kubeflow. Kubeflow is hard to get used too and hard to set up correctly and kinda-hard to secure, but once it’s there, it is a powerful collaboration and productivity tool for ML Engineers.
  • Add decent and automatic accuracy testing. Create specific and separate test sets for data that had bad predictions in the past. E.g. bad scans, or low lighting videos, … Add these to your evaluation pipeline and automatically report them per test set so that you know improvements to the model won’t negatively impact certain subsets.
  • Use static code analysis to find hidden bugs and monitor code quality.

Series C and beyond#

By then, you have an established and polished product. You have talented people working for you and working on scaling the company, long term product roadmaps and maturing your code base.

If you have made it this far, you managed to have a successful strategy, decent development process and a product that provides value. You - and your team - know where your architectural issues are. It will never be perfect. But it will work.

Read more#