Introducing group sequential designs for early stopping of A/B tests

library(data.table) library(tidyverse) library(scales) library(gsDesign) library(kableExtra) Typically the sample size of a study is fixed, ideally based on a power analysis. We collect all our data then analyse it once with the appropriate statistical model. This is an ideal, which people often fall short of in practice. It’s common for scientists and analysts alike to check in on an experiment as the data is coming in (Albers, 2019, Miller, 2010). There’s a lot of potential for error here if experiments are stopped or changed based on ad hoc peeks at the data. [Read More]

Simulating A/B tests with data.table

library(tidyverse) library(data.table) set.seed(63) knitr::opts_chunk$set(warning = F, message = F) A/B tests involve comparing two different groups, typically with a randomised trial. Traditionally this might be a clinical trial where a new drug is given to a treatment group who are compared with a control group who did not receive the drug. Other early A/B tests were done in agriculture, where fields would be split into sections that were treated differently. [Read More]

Adjusting for covariates and baseline differences in A/B testing

Background A/B Testing is a term typically use in commercial settings to refer to the analysis and design of controlled experiments. Typically introductory posts on A/B Testing focus on simple between-subject randomised controlled trials (RCTs). In such experiments, participants (e.g. users, customers) are randomised into one of two groups. One group gets some treatment (e.g. a voucher code) and the other receives nothing (‘business-as-usual’), or some lesser treatment without the feature the trial is aiming to investigate. [Read More]

Approximate Nearest Neighbours in R and Spark

Background K-Nearest Neighbour is a commonly used algorithm, but is difficult to compute for big data. Spark implements a couple of methods for getting approximate nearest neighbours using Local Sensitivity Hashing; Bucketed Random Projection for Euclidean Distance and MinHash for Jaccard Distance. The work to add these methods was done in collaboration with Uber, which you can read about here. Whereas traditional KNN algorithms find the exact nearest neighbours, these approximate methods will only find the nearest neighbours with high probability. [Read More]

Feature selection by cross-validation with sparklyr

Overview In this post we’ll run through how to do feature selection by cross-validation in sparklyr. You can see previous posts for some background on cross-validation and sparklyr. Our aim will be to loop over a set of features refitting a model with each feature excluded. We can then compare the performance of these reduced models to a model containing all the features. This way, we’ll quantify the effect of removing a particular feature on performance. [Read More]

Cross-validation with sparklyr 2: Electric Boogaloo

Overview I’ve previously written about doing cross-validation with sparklyr. This post will serve as an update, given the changes that have been made to sparklyr. The first half of my previous post may be worth reading, but the section on cross-validation is wrong, in that the function provided no longer works. If you want an overview of sparklyr, and how it compared to SparkR, see this post. Bear in mind, however, that post was written in December 2017, and both packages have added functionality since then. [Read More]

SparkR vs sparklyr for interacting with Spark from R

This post grew out of some notes I was making on the differences between SparkR and sparklyr, two packages that provide an R interface to Spark. I’m currently working on a project where I’ll be interacting with data in Spark, so wanted to get a sense of options using R. Those unfamiliar with sparklyr might benefit from reading the first half of this previous post, where I cover the idea of having R objects for connections to Spark DataFrames. [Read More]
Code  Spark  R  sparklyr 

Machine learning and k-fold cross validation with sparklyr

Update, 2019. I have now written an updated post on cross-validation with sparklyr, as well as a follow-up on using cross-validation for feature selection. These posts would be better to read as the code here no longer works following changes to sparklyr. In this post I’m going to run through a brief example of using sparklyr in R. This package provides a way to connect to Spark from within R, while using the dplyr functions we all know and love. [Read More]

Writing your thesis with bookdown

This post details some tips and tricks for writing a thesis/dissertation using the bookdown R package by Yihui Xie. The idea of this post is to supplement the fantastic book that Xie has written about bookdown, which can be found here. I will assume that readers know a bit about R Markdown; a decent knowledge of R Markdown is going to be essential to using bookdown. The first thing to highlight is that I’m not a pandoc or LaTeX expert. [Read More]

Intro to R slides

For the Perception Action and Cognition Lab Open Science Week, 2017 (University of Leeds) I gave two talks introducing R. You can see the slides below. The code for the slides can be found over at GitHub. An introduction to R In this introduction to R I focused on tools from the tidyverse, as well as trying to provide some motivation for learning R. The audience was academics and postgraduates in a psychology department. [Read More]
R  Talks