Early this year I interviewed for Data Science roles with a number of organisations. This post collects my thoughts on the direction of the discipline, which became clearer through this process. I’ve already created a Twitter thread on the topic, so you could just read that if you’re in a hurry. This post expands on content from the original thread.
Interviewing for Data Science positions early this year has made clear to me the way the field appears to be specialising. A thread 🧵— Ed (@datasci_ed) February 10, 2022
Be prepared for production
One this that is pretty clear, if you want to do Machine Learning in an organisation (aside from R&D), you need to be able to put those models into production. The days of throwing your model over the fence are numbered for several reasons.
- Tooling. The tooling to put models into production is ever improving such that you need to know less about Data (Platform) Engineering and more about ML. For example, think of how accessible Kubeflow Pipelines, Airflow, TFX etc. are for someone who knows Python
- Expectations. The expectations for production-grade AI s are increasing. It’s not enough to just have a model running, you also need to monitor it, evaluate it, explain the outputs and test different scenarios. All this requires knowledge of Machine Learning.
- Efficiency. Given the advances in tooling and the expectations on an ML system, why would we chuck a model over the fence to an engineer who isn’t likely to know as much about ML? We avoid these hand-offs by having Data Scientists put their models into production.
Having Data Scientist put their own models into production is also good from an ownership perspective. We know from DevOps the issues with developers siloed away from operations. In recent years we’ve seen the growth of MLOps to describe methods to build robust ML systems, taking inspiration from DevOps1 2.
The world beyond Machine Learning
Despite all this, there is a world beyond ML in terms of what Data Scientists can do for an organisation3. Parallel to the growth of MLOps, we’re seeing more sophisticated statistical techniques used in industry for experimentation and causal inference4 5. There aren’t many people with the capacity to be an expert in ML, MLOps and all this other stuff. This therefore represents another form of specialisation for Data Scientists/Analysts to have an impact.
But what of Data Engineers?
Some teams might be used to handing their models off to an engineer to put into prod. If they evolve into a ML Engineer instead and do it for themselves, what do the other engineers do?
First off, ML is only as good as the data coming in, and only has an impact if the data coming out can get to people. All this requires expertise in Data Engineering, even if Data Scientist build their own production pipelines. Organisations will still require Data Engineer working on ELT/ETL6 to the left of Data Science and Analytics.
If engineers are freed from putting ML models into production, they could work on ingesting and transforming new data sources to support analytics as well as ML. They could also work on getting data (including from models) out to other bits of the org. It’s no good your model outputs sitting in the data warehouse if someone downstream wants them on Kafka. Expecting ML Engineers to do this work probably pushes at the boundaries of what one person can know. Equally, ML Engineers might be able to self serve model endpoints using various tools, but possibly not build something entirely bespoke.
The expectation on Data Scientists to build their own prod pipelines also relies on decent tooling. While there are various off-the-shelf options, there will be cases where these need supplementing with in-house tooling. Here we have another key role for Data (Platform) Engineers or, perhaps, ML Platform Engineers as some organisations have.
Another possibility for this freed up resource is the growing idea of Analytics Engineering7 8 9 10. To really have a data-driven organisation, you need to maintain robust curated data and present it effectively. Whatever we mean by self-serve11, this is surely a key part of it. Relatedly, we have the concept of the metric layer12 13 14 15 as an example of something an Analytics Engineering team could maintain.
I’ll leave it to the many excellent posts in the footnotes to explain what Analytics Engineering involves in more detail. What is worth saying, is that this feels like a natural activity for engineers who have worked in Data Science teams or supported putting models in production. Data Science, like Analytics, sits downstream of your standard ELT/ETL activities but still requires plenty of engineering.
Dashboards, like ML models, are far easier to maintain if you’ve already got a set of key metrics (or features) pre-calculated for everyone to use. This way, analysts have a single source of truth and can be confident everyone is working from the same metrics. People will feel a lot more confident self-serving insight if they know pre-calculated robustly engineered data has been curated for them.
On top of all this, quality analytics requires well maintain platforms. Again, dedicating people to this task (Analytics Platform Engineering, as Deliveroo calls it), creates the foundation for analyst to focus on driving the organisation forward with flexible, robust and timely analytics.
A lumper or a splitter?
I’ve thrown around a lot of terms in this post: Data Scientist, Machine Learning Engineer, Analytics Engineer, Data Engineer, Data Platform Engineer and more. Here I’m reminded on one of my lecturers who remarked that in psychiatry everyone is either a lumper or a splitter. You either prefer broad categories or want to subdivide into smaller, more specific groups. We could say the same about Data organisations. You might prefer to keep your job titles pretty general (Data Scientist, Data Engineer), but have various activities that people with those titles do. Some Data Scientists might do machine learning engineering, others might do experimentation. One Data Engineer could be working on maintaining a Looker instance while another builds ELT pipelines. Alternatively you might prefer to have job titles that are more specific.
You could also keep your job titles generic but have more specifically named teams; an Analytics Engineering Team populated with Data Engineers. All this is really a matter of organisational preference and culture, provided individuals understand what their role is and what different teams are trying to achieve. Below is a nice diagram of the different data roles out there and how they overlap.
Data folks, thoughts on this title overlap illustration? pic.twitter.com/xe41a4JZJz— Elena Dyachkova (@ElenaRusAthletx) January 19, 2022
Paths for the aspiring Data Scientist
I wanted to close this post, as I did my Twitter thread, with some thoughts for aspiring Data Scientists. In my thread I identified three routes you could go down.
- ML Engineer. If you love building ML product, look into ML Engineering roles (mostly still called Data Scientist). Expect to focus on ML and MLOps, as well as the software and data engineering skills necessary to build solid products.
- (Type A) Data Scientist. If you’re more interested in statistics & experimentation, there are places where this is all Data Scientists do. You might occasionally use ML but more likely you’ll be building statistical models, designing experiments, and building domain knowledge
- The Generalist. Maybe you don’t know which you want to be. That’s fine too, just make sure to find a role that will let you get involved in a range of things. This will probably be easier in a small team where there’s less pressure to specialise.
Another key question you should ask yourself is, do I have to work on Data Science? Data Science and Machine Learning are one piece of how organisations use Data to inform their decisions. For every Data Scientist there are many more roles in engineering and analytics. I would recommend people reflect of what appeals about Data Science and whether they could get this from any other data role. Ultimately, the more things you’re interested in, the more opportunities come up.
Casual Inference: Causal inference for data science with Sean Taylor. This podcast episode is a great discussions of using causal inference in industry. The podcast is great in general if you want to learn about Causal Inference in a casual way↩︎
The Emergence and Evolution of Analytics Engineering at Deliveroo. There’s a lot of great info here. I really like the idea of starting off by embedding Analytics Engineers in Data Science teams to leverage established relationships with stakeholders.↩︎