Trends in Data Science

Some thoughts on the direction of the discipline after having interviewed for a number of Data Science roles in 2022. Topics include ML Engineering, Analytics Engineering and the drive to specialisation.

February 21, 2022 · 8 min · Ed

Why we standardise using training statistics when doing Machine Learning

This post runs through why it’s important to preprocess new data that you’re passing to a Machine Learning model using statistics calculated from the training data.

February 4, 2022 · 6 min · Ed

Generating fake personal data in Python with faker

This post gives a quick example of using the faker package in Python to generate fake customer data.

December 11, 2021 · 3 min · Ed

Introducing group sequential designs for early stopping of A/B tests

...

May 1, 2020 · 13 min · Ed

Simulating A/B tests with data.table

...

February 28, 2020 · 10 min · Ed

Adjusting for covariates and baseline differences in A/B testing

...

July 12, 2019 · 11 min · Ed

Approximate Nearest Neighbours in R and Spark

...

April 14, 2019 · 14 min · Ed

Feature selection by cross-validation with sparklyr

...

December 12, 2018 · 8 min · Ed

Cross-validation with sparklyr 2: Electric Boogaloo

...

November 23, 2018 · 15 min · Ed

SparkR vs sparklyr for interacting with Spark from R

...

December 5, 2017 · 14 min · Ed