Bayesian AB Testing: Leveraging Past Experiments for Better Insights
Written on
Causal Data Science
Understanding Priors in Randomized Experiments
Randomized experiments, commonly known as AB tests, are the industry benchmark for estimating causal relationships. By randomly assigning a treatment—such as a new product or feature—to a subset of participants, we can, on average, attribute differences in outcomes (like revenue or user engagement) directly to that treatment. Established companies, such as Booking.com, routinely conduct thousands of concurrent AB tests, while emerging firms like Duolingo credit much of their progress to a robust culture of experimentation.
With the frequency of these experiments, a pertinent question arises: Can we utilize data from previous tests to inform a specific experiment? This article aims to address that query through the lens of the Bayesian approach to AB testing. The Bayesian framework is particularly advantageous for such tasks, as it allows the integration of existing knowledge (the prior) with new data. However, it's important to note that this method is sensitive to the assumptions regarding functional forms; seemingly minor choices, such as the skewness of the prior distribution, can lead to significantly varied estimates.
Search and Infinite Scrolling
For the subsequent analysis, we will employ a simplified example inspired by Azavedo et al. (2019): a search engine aiming to enhance its ad revenue without compromising search quality. Our company, which has a strong experimentation ethos, continuously tests ideas to optimize our landing page. One promising concept we've devised is infinite scrolling—allowing users to scroll indefinitely for more results, rather than navigating through discrete pages.
To assess the efficacy of infinite scrolling, we conducted an AB test by randomly assigning users to either a treatment group, which experienced the new scrolling feature, or a control group. Utilizing the data-generating process dgp_infinite_scroll() from src.dgp, I created a parent class that manages randomization and data creation, with specific subclasses for particular scenarios. Additionally, I incorporated plotting functions and libraries from src.utils, and utilized Deepnote, a collaborative Jupyter-like platform, to manage both code and data.
We gathered data from 10,000 website visitors, tracking the monthly ad_revenue they generated, their assignment to the treatment group using infinite_scroll, and their average monthly past_revenue.
The random assignment of treatment groups renders the difference-in-means estimator unbiased. This means we expect the treatment and control groups to be comparable on average, thus allowing us to attribute the observed differences in outcomes directly to the treatment effect. We can estimate this effect using linear regression, interpreting the coefficient for infinite_scroll as the treatment effect.
It appears that implementing infinite_scroll was indeed beneficial, increasing average monthly revenue by $0.1524, with the effect being statistically significant at the 1% confidence level.
To enhance the precision of our estimator, we can control for past_revenue in our regression model. While we do not anticipate a significant change in the estimated coefficient, we expect improved precision (for further insights on control variables, refer to my other articles on CUPED and DAGs).
Indeed, past_revenue is a strong predictor of current ad_revenue, resulting in a one-third increase in the precision of the estimated coefficient for infinite_scroll.
Thus far, the process has been standard. However, consider that this is not the only experiment we've conducted to optimize our browser (and consequently, ad revenue). Infinite scrolling is just one among numerous ideas we've tested previously. Is there a way to effectively incorporate this additional information?
Bayesian Statistics
One of the primary benefits of Bayesian statistics over the frequentist methodology is its capacity to seamlessly integrate supplementary information into a model. This principle is rooted in Bayes' Theorem, which enables us to reformulate the inference problem: moving from the model's probability given the data to the data's probability given the model, a significantly simpler task.
The right-hand side of Bayes' Theorem can be divided into two components: the prior and the likelihood. The likelihood represents the information about the model derived from the data, while the prior embodies any additional information concerning the model.
Let's relate Bayes' theorem to our specific context. We need to identify the data, model, and object of interest:
- The data consists of our outcome variable ad_revenue (denoted as y), the treatment infinite_scroll (denoted as D), and other variables like past_revenue and a constant (collectively represented as X).
- The model refers to the distribution of ad_revenue given past_revenue and the infinite_scroll feature (expressed as y|D,X).
- Our object of interest is the posterior probability Pr(model | data), particularly the relationship between ad_revenue and infinite_scroll.
How can we utilize prior information in the context of AB testing, potentially incorporating additional covariates?
Bayesian Regression
To facilitate a direct comparison with the frequentist approach, we'll utilize a linear model:
This parametric model includes two sets of parameters: the linear coefficients ?? and ??, as well as the variance of the residuals ?². A more Bayesian representation of the model is:
Here, the semi-colon separates the data from the model parameters. Unlike the frequentist method, Bayesian regressions do not depend on the central limit theorem to approximate the conditional distribution of y; instead, we directly assume it to be normal.
Our focus is on inferring the model parameters, ??, ??, and ?². A key distinction between the frequentist and Bayesian methodologies is that the former treats model parameters as fixed and unknown, while the latter views them as random variables.
This perspective has significant implications: it allows the straightforward integration of prior information about the model parameters through prior distributions. As the term suggests, priors encapsulate information available before examining the data. This raises a critical question in Bayesian statistics: How do you select a prior?
Priors
When selecting a prior, a mathematically appealing constraint is to choose a prior distribution such that the posterior remains within the same family. These are termed conjugate priors. For instance, prior to examining the data, one might assume the treatment effect is normally distributed and wish for it to remain so after incorporating the data's insights.
In Bayesian linear regression, the conjugate priors for ??, ??, and ?² are typically normally and inverse-gamma distributed. To begin, let's use a standard normal and inverse gamma distribution as priors.
We will utilize the probabilistic programming library PyMC for inference. Initially, we must define the model: specifying the prior distributions for various parameters and the data's likelihood.
PyMC offers an excellent function called model_to_graphviz, which visualizes the model as a graph.
From this graphical representation, we can observe the various model components, their distributions, and their interactions.
We are now prepared to compute the posterior of the model. The process involves sampling realizations of the model parameters, calculating the likelihood of the data given those values, and deriving the corresponding posterior.
Historically, the requirement for sampling in Bayesian inference has posed a significant challenge, making it considerably slower than the frequentist approach. However, this concern is diminishing with advancements in computational capabilities.
We are now set to examine the results. Using the summary() method, we can generate a model summary akin to those produced by the statsmodels package commonly used for linear regression.
The estimated parameters closely align with those obtained through the frequentist method, with the estimated effect of infinite_scroll at 0.157.
While sampling has the drawback of being slower, it provides a substantial advantage: it is highly transparent. We can easily visualize the posterior distribution. For instance, let’s plot the distribution for the treatment effect ??. The PyMC function plot_posterior allows us to display the posterior distribution, complete with a black bar representing the Bayesian equivalent of a 95% confidence interval.
As anticipated, since we selected conjugate priors, the posterior distribution appears Gaussian.
Thus far, we've chosen our prior without much guidance. However, suppose we had access to previous experiments. How can we integrate that specific information?
Past Experiments
Let’s consider that the infinite scroll was merely one among numerous other ideas we explored previously. For each idea, we possess data from the corresponding experiment, along with the estimated coefficients.
We have generated 1000 estimates from earlier experiments. How can we utilize this additional information?
#### Normal Prior
One approach could be to calibrate our prior to reflect the distribution of past data. By maintaining the normality assumption, we can use the estimated mean and standard deviation from prior experiments.
On average, these earlier ideas had a negligible effect on ad_revenue, with an average effect of 0.0009. Nevertheless, there was considerable variability across experiments, resulting in a standard deviation of 0.029.
Let’s reformulate the model, employing the mean and standard deviation of past estimates for the prior distribution of ??.
Let's sample from the model and visualize the posterior distribution of the treatment effect parameter ??.
The estimated coefficient is notably lower: 0.11 compared to the previous estimate of 0.16. Why is this the case?
The previous coefficient of 0.16 is exceedingly unlikely given our prior. We can compute the probability of achieving the same or a more extreme value based on the prior.
The likelihood of this value is practically zero. Consequently, the estimated coefficient shifts closer to the prior mean of 0.0009.
#### Student-t Prior
Thus far, we have assumed a normal distribution for all linear coefficients. Is this assumption valid? Let's visually assess this (see here for alternative methods of comparing distributions), starting with the intercept coefficient ??.
The distribution appears quite normal. Now, let's examine the treatment effect parameter ??.
The distribution exhibits a notably heavy-tailed nature! While it resembles a normal distribution at the center, the tails are significantly "fatter," with a few extreme values present. Excluding measurement errors, this scenario is common in the industry, where most ideas yield minimal or null effects, while a select few result in substantial breakthroughs.
One suitable modeling approach for this distribution is the student-t distribution. Specifically, we will use a t-student distribution with a mean of 0.0009, variance of 0.003, and 1.3 degrees of freedom to align with the empirical moments of past estimates.
Let's sample from the model and visualize the posterior distribution of the treatment effect parameter ??.
The estimated coefficient is once again similar to the one derived from the standard normal prior, at 0.11. However, this estimate is more precise, as the confidence interval has contracted from [0.077, 0.016] to [0.065, 0.015].
What accounts for this change?
#### Shrinking
The explanation lies in the shapes of the various prior distributions we have employed:
- Standard normal, N(0,1)
- Normal with matched moments, N(0, 0.03)
- t-student with matched moments, t(0, 0.003)
Let’s plot all of them together.
As illustrated, all distributions center around zero, but their shapes differ significantly. The standard normal distribution is relatively flat across the interval [-0.15, 0.15], implying equal probabilities for all values. In contrast, the last two distributions, while sharing the same mean and variance, exhibit markedly different shapes.
How does this impact our estimation? We can visualize the implied posterior for various estimates under each prior distribution.
As shown, the different priors influence experimental estimates in diverse ways. The standard normal prior has a negligible effect on estimates within the [-0.15, 0.15] range. The normal prior with matched moments tends to shrink each estimate by roughly two-thirds, while the t-student prior has a non-linear effect: it reduces smaller estimates towards zero while preserving larger estimates.
Conclusion
In this article, we explored how to enhance the analysis of AB tests by incorporating information from past experiments. We introduced the Bayesian approach to AB testing and highlighted the significance of selecting an appropriate prior distribution. Given identical mean and variance, utilizing a prior distribution with "fat tails" (high skewness) results in more pronounced shrinkage of smaller effects while allowing larger effects to remain largely unchanged.
The intuition here is that a prior distribution with "fat tails" suggests that breakthrough ideas are infrequent but not impossible. This understanding carries practical implications post-experiment, as discussed, and pre-experiment as well. According to Azevedo et al. (2020), if you believe the effects of your ideas are more "normal," it’s optimal to conduct fewer but larger experiments to detect smaller effects. Conversely, if you think your ideas follow a "breakthrough or nothing" model—where effects are fat-tailed—it is more strategic to run smaller but numerous experiments, as large sample sizes are unnecessary to identify substantial effects.
References
- Azevedo, A. Deng, J. Olea, G. Weyl, Empirical Bayes Estimation of Treatment Effects with Many A/B Tests: An Overview (2019). AEA Papers and Proceedings.
- Azevedo, A. Deng, J. Olea, J. Rao, G. Weyl, AB Testing with Fat Tails (2020). Journal of Political Economy.
- Deng, Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments (2016). WWW ’15 Companion.
Code
You can find the original Jupyter Notebook here:
Blog-Posts/bayes_ab.ipynb at main · matteocourthoud/Blog-Posts
Thank you for reading!
I really appreciate it! If you enjoyed the article and want to see more, consider **following me*. I publish weekly on topics related to causal inference and data analysis, aiming to keep my posts straightforward yet precise, always including code, examples, and simulations.*
Please note: I write to learn, so mistakes are common, although I strive for accuracy. If you notice any errors, please let me know. Suggestions for new topics are also welcome!