Propensity Score Matching in Python

In eCommerce and Digital Analytics is very common for product and marketing teams to implement changes without first going through a proper process of controlled release. In this post will elaborate on the details of Propensity Score Matching using a Python example and how we can still get an estimation of the “treatment effect”.

This article is a high level overview of a Jupyter Notebook. You can always jump straight into it!

In ideal circumstances changes would be AB tested, their impact would be validated as part of a randomised controlled test and the business would be fully informed about the outcomes. However this is not the case as often as we would like. Chances are we all have experienced the frustration of having to measure changes after they have occured while stakeholders ask “for some numbers!”. In those cases it is impossible to draw accurate results as we would have if proper planning was in place. We have to rely on pre/post analysis, exclude noise through segmentation and hope we have not missed anything significant!

An alternative approach to the above is to utilise matching methods & propensity score.

For the remaining article, please consider the following:

  • the Titanic dataset from Kaggle was used as demonstration. It is an easy and well-known dataset.
  • Treatment status (T) refers to if the passenger has a cabin (treatment T = 1) or does not have a cabin (treatment T = 0)
  • The treatment (T) variable is the one we want to assess its impact upon the survival outcome (Y) of the passenger
  • The rest of the variables (X) will be used to estimate the propensity score
  • Propensity Score is the probability a passenger has a cabin given a set of variables X. -> P(T=1|X)

Matching Methods

A brief and well written definition from Wikipedia can introduce us to the topic of matching.

Matching is a statistical technique which is used to evaluate the effect of a treatment by comparing the treated and the non-treated units in an observational study or quasi-experiment (i.e. when the treatment is not randomly assigned). The goal of matching is, for every treated unit, to find one (or more) non-treated unit(s) with similar observable characteristics against whom the effect of the treatment can be assessed.

The above describes exactly the kind of cases mentioned earlier; changes on site (treatments) need to be analysed but are not randomly assigned as part of a randomised test. Examples of that could be:

  • Specific user-experience that is aimed to particular segment of users (i.e. logged-in pages, promotional pages, marketing pages)
  • Pop-ups and various messaging functionalities
  • Promo and marketing campaigns

Through matching, we intent to control for all independent variables and the treatment variable of our data and “simulate” an appropriate control group. Subsequently we can use this control group to measure the impact of the treatment versus the counterfactuals. A counterfactual is a matched element (i.e. passenger) that did not receive the treatment but its the “closest similar” case to our treated passenger.

The key point is defining what “closest similar” means. In the example below we have 4 treated (T=1) and 4 untreated units (T=0). We have variable X that provides the measurements. Element #2 has X=2 and it is matched with element #6 which also has X=2. Element #6 is the counterfactual of #2 since they are both similar in terms of X but it comes from the the untreated group.

Do you have challenges in your analytics processes?

Do you need advice or just want to discuss something?

Propensity Score

Propensity score is the estimated probability that an observation receives the treatment. We use the existing independent variables (i.e. demographics, fare paid) in order to estimate it. Usually this is done using logistic regression where we can obtain the probability that T equals 1 given the set of variables. In order words: P(T=1|X).

It is important the treatment variable T is dependent as much as possible on the observable characteristics that we have already. For example, the criteria for a marketing campaign targeting should be part of the logistic regression model. If a key piece of information that affects the selection in treatment is missing, this will likely lead to inaccurate results. This is where domain knowledge plays a key factor.

Once the propensity scores are calculated we end up with the respective probabilities (0,1). Propensity score then is used as the single variable to perform the matching (similar to what X did in the example above).

Propensity Score Matching

Central research paper on the topic is “The central role of the propensity score in observational studies for causal effects” from Paul R. Rosenbaum & Donald B. Rubin (1983). Highly recommended read.

As soon as the propensity scores are calculated, we can proceed to the matching operation. There are a few strategies to approach this but the simplest is 1-to-1 with replacement.

  • Each treated element (T=1) will be matched with 1 untreated element (T=0)
  • Untreated elements can be matched with more than one treated elements
  • Nearest Neighbours with a maximum radius is used to identify the best match

In the Titanic dataset example, logit of propensity score was used to perform the matching. When we compare the density of the variable between treated and untreated elements, we observe that the two groups have complete overlap after matching.

propensity score overview
Logit of Propensity Score – Before and After matching

The above provides only a high level view of the matching as propensity score contains compressed information. Furthermore, because the matching takes place using the above variable, it is reasonable that post-matching densities are identical.

If we examine an individual variable (i.e. Age), a similar pattern is observed. The density is not identical but the plots look very balanced.

age density comparison

If we add more dimensions to the plot, we can easily observe that matched elements do not always have the same values. They might have big differences but their propensity scores are very similar. This is because propensity score is estimated on a range of dimensions and each dimension contributes differently.

matching result demonstration
Illustration of matched elements

This chart demonstrates the result of the matching. Let’s unpack the information it contains:

  • Triangles depict treated samples and Circles untreated samples
  • A triangle is always matched with one (or more) a circles
  • Matched elements have similar propensity score (i.e. same colour)
  • Matched elements might have big difference in some of their dimensions (i.e. Age) but their propensity score is always very close. This also demonstrates the fact that propensity scores lead to loss of information (due to the compression of multiple dimensions to a single number).

Evaluation of results

Finally we can compare the standardised mean differences between treated (T=1) and untreated (T=0) groups before and after matching. In an ideal scenario we would like the differences to drop substantially (i.e. less than 0.1) for most dimensions. As a matter of fact, the particular example we have done pretty well! Orange below is after matching.

standardised mean differences

Finally we want to estimate the Average Treatment Effect of the Treated (ATT) elements:

The translation of the metric above is:

  • What is the expected impact of the treatment on the outcome variable compared to the counterfactual outcome


Propensity Score Matching can be used to infer the directional impact of a treatment and to a lesser degree the magnitude of the impact. The latter is because we discard part of information (i.e. not all passengers are utilised in the assessment). In any case the process very helpful in cases where a full AB testing is very costly or just impossible.

It’s important to note there are conditions that need to be satisfied in order for the process to be reliable:

  • Common support – there is overlap in the propensity scores of treated and untreated elements
  • Unconfoundedness assumption – all the variables affecting both the treatment T and the outcome Y are observed

The above means it is not always possible or that the results are always accurate. Additionally there is research suggesting that due to the information compression nature of logistic regression, it is not the optimum method.

For more details about the Titanic dataset, please have a look at the Python notebook (

This is still work in progress but look forward to suggestions/recommendations on how its most appropriate to approach this or similar cases.