Some thoughts on the direction of the discipline after having interviewed for a number of Data Science roles in 2022. Topics include ML Engineering, Analytics Engineering and the drive to specialisation.
Why we standardise using training statistics when doing Machine Learning
This post runs through why it’s important to preprocess new data that you’re passing to a Machine Learning model using statistics calculated from the training data.
Generating fake personal data in Python with faker
This post gives a quick example of using the faker package in Python to generate fake customer data.
Introducing group sequential designs for early stopping of A/B tests
...
Simulating A/B tests with data.table
...
Adjusting for covariates and baseline differences in A/B testing
...
Approximate Nearest Neighbours in R and Spark
...
Feature selection by cross-validation with sparklyr
...
Cross-validation with sparklyr 2: Electric Boogaloo
...
SparkR vs sparklyr for interacting with Spark from R
...