# Causal Analysis Introduction - Examples in Python and PyMC

By Paul English, Sun 27 November 2016, in category Statistics

Correlation is not causation is a common mantra. It's easy to find examples of spurious correlation in the world and in many data sets. It's wise not to assume causal effect in many cases, but there are times when we want to infer causal relationships in datasets.

One model that attempts to look at this is the Rubin model of causal effect. We'll explore this idea, and recreate some analysis. This model is typically useful for observational data. It's not the only paradigm for estimating causality, another approach revolves around ideas of structural equation modeling, and might be more general and useful.

## The Core Problem in Measuring Causal Effect

Borrowing from the wikipedia article on the Rubin causal model, image we have a few people partaking in a study where we want to see if some treatment helps or hinders a subject's outcome. We first observe what happens to each subject without the treatment, then we grab a time machine and restart the experiment having administered treatment to all the subjects. From this we would be able to directly observe how the treatment effected each subject.

$Y(1)$ $Y(0)$ $Y(1) - Y(0)$
Joe $-5$ $5$ $-10$
Mary $-10$ $-5$ $-5$
Sally $0$ $10$ $-10$
Bob $-20$ $-5$ $-15$
Mean $-10$

In this table we use the notation $Y(1)$ as outcome having been treated (treatment), and $Y(0)$ as the outcome if there was no treatment (control). With full observations we could then simply look at the difference in outcome $Y(1) - Y(0)$, then average them.

### Causal Effect since we don't have a time machine

In general we're typically not going to be able to fully observe the outcomes for each subject. Typically we'll have a dataset with partial observations,

$Y(1)$ $Y(0)$ $Y(1) - Y(0)$
Joe $?$ $5$ $?$
Mary $-10$ $?$ $?$
Sally $?$ $10$ $?$
Bob $-20$ $?$ $?$
Mean $?$

We could try averaging each treatment and control group, then estimating the differences.

$Y(1)$ $Y(0)$ $Y(1) - Y(0)$
Joe $?$ $5$ $?$
Mary $-10$ $?$ $?$
Sally $?$ $10$ $?$
Bob $-20$ $?$ $?$
Mean $-15$ $7.5$ $?$

Average treatment effect: $\bar{Y}(1) - \bar{Y}(0) = -22.5$

This might be a useful approach, but in this example our treatment effect is quite a bit different from what we had with full observations earlier. It is also relatively easy to come up with examples where missing observations can greatly skew or change the estimated average treatment. Increasing samples in general will lead to better estimates, but there's more that can be done to improve simple causal estimates.

## Estimating Causal Effect

In general we cannot directly observe causal effect. We have to look at methods that estimate this, similar to how we averaged the treatment & control group above.

### Substitutes to Potential Outcome

A common and sometimes accidentally applied method is substituting the potential outcome.

Examples:

• Maybe you can repeat treatment, e.g. drinking tea before bed.
• Dividing up a piece of plastic and exposing it to a corrosive chemical.
• The effect of a diet over time by measuring weight.

These tend to carry strong assumptions that may be implicit in the choice of substitution. They can be regularly found in research, and though the assumptions can be left out or be part of a bad study, are generally ok in many situations.

Drinking tea before bed one night probably doesn't have a huge effect on successive nights, unless you're a habitual caffeine drinker, in which case you might have less than useful results.

### The Randomized Controlled Trial

The typical gold standard when it comes to measuring causal effects is the randomized controlled trial. The goal is to attempt comparing apples with apples by controlling for various factors. We use randomization to try to avoid introducing bias.

This model is not always perfect and has a few issues that can come up, that might make it not possible to conduct an experiment in this manner:

• It could be cost prohibitive.
• Participants could self-select into the treatment group, e.g. company benefits programs.
• You may not be involved in the study design, and only receive data post-hoc.
• It might be unethical to control treatment.

The goal of statistical adjustment is to approximate what a randomized experiment would achieve. The idea is to use statistical ideas to find similar units to compare. This is what we'll explore further with our causal model, and with pairwise matching, though there are many things that could fall into this category.

## The Rubin Causal Model

$Y(1)$ $Y(0)$ $Y(1) - Y(0)$
Joe $?$ $5$ $?$
Mary $-10$ $?$ $?$
Sally $?$ $10$ $?$
Bob $-20$ $?$ $?$
Mean $?$ $?$ $?$

What if we can estimate the unknown values which we can't observe in practice? We could look at building a linear regression on whether or not a subject had treatment, we'll use the indicator $D_i$ to represent treatment for some subject $i$.

$$\hat{Y}_i = \alpha + \tau D_i + \epsilon_i$$

From this, we can make use of the typical interpretation of linear model coefficients and say that $\tau$ is a value the represents how the outcome differes whether the subject received treatment or not.

We formulate the least squares estimator, $$(\hat{\tau}, \hat{\alpha}) = \arg\min_{\tau, \alpha} \sum_{i=1}^N (Y - \alpha - \tau D_i)^2$$

and solve for $\hat{\tau}$,

$$\hat{\tau} = \frac{\sum_{i=1}^N (D_i - \bar{D}) (\hat{Y} - \bar{Y})}{\sum_{i=1}^N (D_i - \bar{D})^2}$$

which can be shown to be equal to, $\hat{\tau} = \bar{Y}(1) - \bar{Y}(0)$. This estimator $\hat{tau}$ is going to be similar to the difference in average outcomes of treatment status. This is the core idea for estimating causal effect with this framework.

We can continue further by adding in features that help improve the model. One might wish to control for various covariates $X_i$, or might believe that the treatment has interaction effects with those covariates. These lead to the corresponding regression models:

• Controlling for covariates: $\hat{Y}_i = \alpha + \tau D_i + \beta X_i + \epsilon_i$
• Interaction with the treatment: $\hat{Y}_i = \alpha + \tau D_i + \beta X_i + \gamma D_i (X_i - \bar{X}) + \epsilon_i$

In both of these models we're still only concerned with interpreting the coefficient $\tau$.

### Assumptions in This Model

What are we assuming here?

We won't look in depth at these assumptions here, but it's important to know that the formulations of these models assume that the treatment is Strongly Ignorable. This means that we want the following two traits,

• Unconfoundedness: $D$ is independent of $(Y(0), Y(1))$ conditional on $X=x$, e.g. the treatment of one group does not affect the other group.
• Overlap: $c < \mathbb{P}(D=1 | X=x) < 1 - c$, for $c > 0$.

These assumptions are likely where you'll find the most criticism of this type of model, and warrants actual study of the source material which I've referenced below.

## Matching

I mentioned that we were going to look at statistical adjustment, in particular matching. Matching is used when you have imbalance data sets. Usually a large amount of control samples and a small amount of treatment samples. It's an attempt to remove the imbalance in these groups and approximate what you might have had given a randomized or controlled experiment.

We'll look at two methods: propensity matching, and using Mahalanobis distance between samples. Propensity matching is definitely the most common method, though there are criticisms against using it blindly.

### Propensity Score Matching

A propensity score is an approximate model of how likely a subject is to have been in the treatment group, $P(D_i = 1 | X_i)$. It's solved easily with an intermediate regression, typically something like

$$D_i = \text{logistic}(\alpha + \beta X_i + \epsilon_i)$$

Once built, we can then estimate propensities for both of our treatment and control groups. Hopefully we'll find that there are some subjects in the treatment group with low propensity, but the majority have high propensity (they did actually end up in the treatment group), and the opposite for the control group.

We would then find paired (or many) matches in the control group for each sample in the treatment group picking matches that minimize the differences in propensity score. Since we're dealing with estimates, it's ok if this isn't an optimal match, and we can usually use a simple algorithm like a greedy random match.

### Distance Matching

Using a measure like Mahalanobis distance is also an effective way to match groups, and has the bonus of not introducing an intermediate model.

We would look at picking a match that minimizes distance between covariates,

$$m(i) = \arg \min_{j:W_j \ne W_i} || X_j - X_i ||$$

Where we define $||X_j - X_i||$ as a distance between the covariate vectors $X_j$ and $X_i$ as follows:

$$||X_j - X_i|| := (X_j - X_i)^\prime W (X_j - X_i).$$

where $W$ is defined as

$$W = \text{diag} { \hat{\sigma}^{-2}_1 , \dots, \hat{\sigma}^{-2}_K }.$$

This is a special case of Mahalanobis distance that's called the Normalized Euclidean distance.

## Examples

Let's start looking at this with a few different examples on the same data set. This dataset can be found here, and is one that was used in particular with a propensity matched study. The data is a population survey including some basic demographics about each subject, and information about their salary. Our treatment group is filled with subjects who partook in an occupational training program. We expect to see that having gone through training has a positive end effect on salary.

# NSW Data Files (Dehejia-Wahha Sample)
# http://users.nber.org/~rdehejia/data/nswdata2.html

cps = pd.read_csv('data/lalonde/cps_controls.txt', delimiter=' ', names=[
'treat', 'age', 'educ', 'black', 'hisp', 'married', 'nodegr',
're74', 're75', 're78'
])

original_treatment = pd.read_csv('data/lalonde/nswre74_treated.txt', delimiter=' ', names=[
'treat', 'age', 'educ', 'black', 'hisp', 'married', 'nodegr',
're74', 're75', 're78'
])

unmatched_raw = pd.concat([
cps, original_treatment
], axis=0)

unmatched_raw['u74'] = (unmatched_raw.re74 == 0).astype(np.int32)
unmatched_raw['u75'] = (unmatched_raw.re75 == 0).astype(np.int32)

unmatched_raw[['re74_div', 're75_div', 're78_div']] = unmatched_raw[['re74', 're75', 're78']] / 1000


[code gutter="false"] (16177, 15) [/code]

There are a lot of samples, and a lot of imbalance between the control and treatment group. We have only 185 treatment observations, and 15,992 control observations. It's likely that if we built a causal model on these all of these samples, that we'd see the large number of control samples wash out the value of the treatment indicator. Thus this is a great dataset to use with matching.

If we look at the compared density plots of the controls and the treatments we see the following:

sns.kdeplot(unmatched_raw[unmatched_raw.treat == 0].re78, label='Control')
sns.kdeplot(unmatched_raw[unmatched_raw.treat == 1].re78, label='Treatment');


Unfortunately this hides how imbalanced these two sets are. If we look instead at a histogram of our data (which doesn't normalize like the above plot) we will see the actual imbalance.

plt.hist(unmatched_raw[unmatched_raw.treat == 0].re78, label='Control', bins=100);
plt.hist(unmatched_raw[unmatched_raw.treat == 1].re78, label='Treatment', bins=100);
plt.legend();


### Using the causalinference library

The first thing we're going to do is look at a well made library implemented to specifically deal with this type of causal model, the causalinference package in python. It's very straightforward to run several models quickly.

from causalinference import CausalModel

causal_model = CausalModel(
X=unmatched_raw[['age', 'educ', 'black', 'hisp', 'married', 'nodegr', 're74_div', 're75_div', 'u74', 'u75']].values,
D=unmatched_raw.treat.values,
Y=unmatched_raw.re78_div.values
)
causal_model.est_propensity()
causal_model.trim_s()
causal_model.stratify_s()
causal_model.est_via_ols()
causal_model.est_via_weighting()

print(causal_model.summary_stats)
print(causal_model.estimates)
print(causal_model.propensity)
print(causal_model.strata)

Summary Statistics

Controls (N_c=373)         Treated (N_t=149)
Variable         Mean         S.d.         Mean         S.d.     Raw-diff
--------------------------------------------------------------------------------
Y        4.814        6.084        6.038        8.006        1.224

Controls (N_c=373)         Treated (N_t=149)
Variable         Mean         S.d.         Mean         S.d.     Nor-diff
--------------------------------------------------------------------------------
X0       26.416       11.046       25.765        7.355       -0.069
X1       10.595        2.698       10.221        2.066       -0.156
X2        0.882        0.323        0.973        0.162        0.357
X3        0.118        0.323        0.027        0.162       -0.357
X4        0.214        0.411        0.134        0.342       -0.212
X5        0.611        0.488        0.745        0.437        0.289
X6        2.349        4.572        1.428        3.731       -0.221
X7        1.275        2.167        0.859        1.609       -0.218
X8        0.547        0.498        0.772        0.421        0.487
X9        0.491        0.501        0.671        0.471        0.371

Treatment Effect Estimates: Matching

Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
ATE      1.576      1.359      1.160      0.246     -1.087      4.239
ATC      1.357      1.632      0.832      0.406     -1.842      4.557
ATT      2.123      1.448      1.466      0.143     -0.716      4.962

Treatment Effect Estimates: Weighting

Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
ATE      1.750      0.817      2.143      0.032      0.149      3.351

Treatment Effect Estimates: OLS

Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
ATE      1.960      8.445      0.232      0.816    -14.593     18.513
ATC      1.947     10.284      0.189      0.850    -18.210     22.104
ATT      1.993     14.603      0.136      0.891    -26.629     30.615

Estimated Parameters of Propensity Score

Coef.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
Intercept     -6.341      0.812     -7.804      0.000     -7.933     -4.748
X0     -0.018      0.011     -1.683      0.092     -0.039      0.003
X1      0.019      0.049      0.401      0.688     -0.076      0.115
X2      4.286      0.263     16.319      0.000      3.771      4.801
X3      1.836      0.394      4.663      0.000      1.064      2.607
X4     -0.994      0.241     -4.120      0.000     -1.467     -0.521
X5      0.903      0.275      3.282      0.001      0.364      1.442
X6      0.063      0.028      2.197      0.028      0.007      0.118
X7     -0.177      0.036     -4.923      0.000     -0.248     -0.107
X8      1.572      0.264      5.956      0.000      1.055      2.089
X9      0.235      0.239      0.984      0.325     -0.233      0.702

Stratification Summary

Propensity Score         Sample Size     Ave. Propensity   Outcome
Stratum      Min.      Max.  Controls   Treated  Controls   Treated  Raw-diff
--------------------------------------------------------------------------------
1     0.077     0.193       229        33     0.128     0.127    -0.472
2     0.193     0.284        52        14     0.235     0.238     2.194
3     0.286     0.408        31        35     0.350     0.363     5.653
4     0.408     0.638        61        67     0.562     0.576     1.391


This library include faculties to perform matching for us. Methodology details can be found in their documentation. We run three models, and see positive treatment effects in each one, though only one is "significant". These are seen in the output above as the "ATE" output lines, where ATE stands for average treatment effect.

While I like this library and recommend it, I've had numerical issues using it on larger and more complex data sets (it's implementations try to invert some matrices which is typically a recipe for singular matrix errors). Additionally I like to use Bayesian frameworks where I can, since I tend to prefer the probabilistic interpretation of results over p-values and hypothesis tests.

### Using a simple Bayesian linear regression (PyMC)

Now I'll look at implementing my own matching routines, and using a Bayesian linear regression.

For $y \in \mathbb{R}$, and $\mathbf{x} \in \mathbb{R}^M$ with $N$ data points we have $(\mathbf{X}, \mathbf{y}) = {(\mathbf{x}_n, y_n)}$. With the following distributions,

\begin{aligned} p(\mathbf{w}) &= \mathcal{N}(\mathbf{w} | 0, \sigma_w^2 \mathbf{I}) \ p(b) &= \mathcal{N}(b | 0, \sigma_b^2) \ p(\mathbf{y} | \mathbf{w}, b, \mathbf{X}) &= \prod_{n=1}^N \mathcal{N} (y_n | \mathbf{x}_n^{\intercal} \mathbf{w} + b, \sigma_y^2) \end{aligned}

This is already implemented in PyMC with the glm functionality. Thus, all we will need to do is pass in the formula for our model.

### Propensity Matching

For propensity matching we first need some estimate of propensity score. In the interest of computational time, we'll use a standard frequentist logistic model to act as our propensity function. We do this since it's fast, and we don't plan to interpret this intermediate model in this particular analysis. We want to estimate $p(\text{treat} | X)$, we do this with the following using the popular statsmodel package.

propensity_formula = "treat ~ age + educ + black + hisp + married + nodegr + u74 + u75 + re75_div + re74_div + I(re74_div*re75_div)"
propensity_model = smf.glm(formula=propensity_formula, data=unmatched_raw, family=sm.families.Binomial()).fit()
propensity_model.summary()

Dep. Variable: No. Observations: treat 16177 GLM 16165 Binomial 11 logit 1.0 IRLS -473.41 Sat, 26 Nov 2016 946.82 14:57:12 8.10e+03 14
coef std err z P>|z| [95.0% Conf. Int.] -6.6364 0.825 -8.045 0.000 -8.253 -5.020 -0.0191 0.011 -1.788 0.074 -0.040 0.002 0.0195 0.048 0.404 0.686 -0.075 0.114 4.2899 0.263 16.329 0.000 3.775 4.805 1.8322 0.394 4.655 0.000 1.061 2.604 -0.9954 0.240 -4.147 0.000 -1.466 -0.525 0.9100 0.274 3.318 0.001 0.372 1.448 1.7318 0.280 6.178 0.000 1.182 2.281 0.3934 0.253 1.555 0.120 -0.102 0.889 -0.0871 0.056 -1.559 0.119 -0.197 0.022 0.0985 0.033 2.968 0.003 0.033 0.164 -0.0072 0.004 -1.800 0.072 -0.015 0.001

From this fitted model, we can then make propensity score estimates, and come up with some matches.

_, X_propensity = dmatrices(propensity_formula, unmatched_raw, return_type='dataframe')
propensities = propensity_model.predict(exog=X_propensity)

split = cps.shape[0]

treatment = np.expand_dims(propensities[split:], 1)
control = np.expand_dims(propensities[:split], 1)

# Calculating the distance in propensity of each treatment sample to each control sample
distances = cdist(treatment, control)

# Computing an optimal match between the two groups
treatment_idx, control_idx = linear_sum_assignment(distances)

treatment_data = unmatched_raw[split:].iloc[treatment_idx]
control_data = unmatched_raw[:split].iloc[control_idx]


Here, we compute an optimal match, though this is not always very efficient (or needed). As mentioned earlier typically we can compute a random greedy match with or without replacement and achieve similar results. This will run much faster since it's just a simple loop. Since our treatment group is small enough I wanted to try the optimal matching method.

Now if we compare our density plots we see much more similarity in our groups,

sns.kdeplot(treatment_data.re78, label='Treatment')
sns.kdeplot(control_data.re78, label='Control')


And if we look at the comparative histograms we see that the control group doesn't overwhelm our treatment group like it did before,

plt.hist(treatment_data.re78, label='Treatment', bins=100)
plt.hist(control_data.re78, label='Control', bins=100)
plt.legend();


Now that we've got a treatment & matched control group we can put these together and build a model to estimate treatment effect. I do include the covariates in this model ($\hat{Y}_i = \alpha + \tau D_i + \beta X_i + \epsilon_i$, the second regression noted above), but it would just require updating the formula to look at just a regression on the treatment indicator, or to add in the interaction terms with the treatment indicator.

data_propensity_matched = pd.concat([
treatment_data,
control_data
], axis=0)

causal_formula = "re78_div ~ treat + age + educ + black + hisp + married + nodegr + u74 + u75 + re74_div + re75_div"

with pm.Model() as model:
pm.glm.glm(causal_formula, data_propensity_matched)

plt.plot(-np.log10(-propensity_v_params.elbo_vals))

print('Sampling')
propensity_samples = pm.sample(
2000,
step=pm.NUTS(scaling=propensity_v_params.means),
start=propensity_v_params.means,
progressbar=True,
njobs=3
)

# Iteration 0 [0%]: ELBO = -45564.78
# Iteration 8000 [10%]: Average ELBO = -179941.26
# Iteration 16000 [20%]: Average ELBO = -1669.65
# Iteration 24000 [30%]: Average ELBO = -1434.98
# Iteration 32000 [40%]: Average ELBO = -1425.37
# Iteration 40000 [50%]: Average ELBO = -1424.45
# Iteration 48000 [60%]: Average ELBO = -1424.24
# Iteration 56000 [70%]: Average ELBO = -1423.82
# Iteration 64000 [80%]: Average ELBO = -1423.97
# Iteration 72000 [90%]: Average ELBO = -1423.87
# Finished [100%]: Average ELBO = -1423.64
# Sampling
# 100%|██████████| 2000/2000 [00:21<00:00, 94.80it/s]


pm.traceplot(propensity_samples[1000:], varnames=['treat']);


We only output the trace plot for the treatment variable, since it's what we're interested in currently. If you show all the sampled parameters they do look like they've converged sufficiently, one could check this more thoroughly with an autocorrelation on these samples. If we look at what the parameters means are,

df_propensity_samples = pm.trace_to_dataframe(propensity_samples[1000:])
df_propensity_samples.mean(0)

# age         -0.046268
# nodegr      -0.380227
# educ         0.385656
# u74          1.459713
# Intercept    2.051145
# u75         -0.343626
# hisp         0.218121
# married     -0.473846
# re75_div     0.323274
# re74_div     0.233762
# treat        1.463496
# sd           6.977029
# black       -1.679199
# dtype: float64


The average for treat, is 1.46. That is similar to the above ATE values we saw with the causalinference package. What is the sampled probability that this effect is above zero?

(df_propensity_samples['treat'] &gt; 0).mean()

# 0.97499999999999998


### Mahalanobis Matching

Now let's use the Mahalanobis idea for matching. This is going to be roughly the same once we've gotten a newly matched data set.

treatment = X_propensity[split:]
control = X_propensity[:split]

VI = np.diag(X_propensity.var())
distances = cdist(treatment, control, 'mahalanobis', VI=VI)

treatment_idx, control_idx = linear_sum_assignment(distances)

treatment_data = unmatched_raw[split:].iloc[treatment_idx]
control_data = unmatched_raw[:split].iloc[control_idx]


Here, we specify the diagonal matrix VI so that we have the normalized Euclidean distance case. The other lines are similar to the propensity matching. Once again we can look at the distribution plot estimation, and we see that it's very similar to the propensity matched grouping.

sns.kdeplot(treatment_data.re78, label='Treatment')
sns.kdeplot(control_data.re78, label='Control');


Likewise, our histograms are similar.

plt.hist(treatment_data.re78, label='Treatment', bins=100)
plt.hist(control_data.re78, label='Control', bins=100)
plt.legend();


Using this matched group we run the same model.

data_mahalanobis_matched = pd.concat([
treatment_data,
control_data
], axis=0)

with pm.Model() as model:
pm.glm.glm(causal_formula, data_propensity_matched)

plt.plot(-np.log10(-mahalanobis_v_params.elbo_vals))

print('Sampling')
mahalanobis_samples = pm.sample(
2000,
step=pm.NUTS(scaling=mahalanobis_v_params.means),
start=mahalanobis_v_params.means,
progressbar=True,
njobs=4
)

# Iteration 0 [0%]: ELBO = -16156.39
# Iteration 8000 [10%]: Average ELBO = -216377.4
# Iteration 16000 [20%]: Average ELBO = -1637.49
# Iteration 24000 [30%]: Average ELBO = -1434.52
# Iteration 32000 [40%]: Average ELBO = -1425.37
# Iteration 40000 [50%]: Average ELBO = -1424.39
# Iteration 48000 [60%]: Average ELBO = -1424.08
# Iteration 56000 [70%]: Average ELBO = -1424.0
# Iteration 64000 [80%]: Average ELBO = -1423.79
# Iteration 72000 [90%]: Average ELBO = -1423.99
# Finished [100%]: Average ELBO = -1423.85
# Sampling
# 100%|██████████| 2000/2000 [00:21<00:00, 93.26it/s]


pm.traceplot(mahalanobis_samples[1000:], varnames=['treate']);


df_mahalanobis_samples = pm.trace_to_dataframe(mahalanobis_samples[1000:])
df_mahalanobis_samples.mean(0)

# age         -0.045524
# nodegr      -0.404798
# educ         0.382626
# u74          1.406951
# Intercept    2.089522
# u75         -0.377069
# hisp         0.218565
# married     -0.501467
# re75_div     0.326605
# re74_div     0.229498
# treat        1.490877
# sd           6.975974
# black       -1.652019
# dtype: float64

(df_mahalanobis_samples['treat'] &gt; 0).mean()

# 0.97799999999999998


Here we get very similar results to the propensity model, 1.49 for a treatment effect and a high probability of 0.977 that this coefficient is above zero, given our samples.

### Bonus - Estimating Causal Effect with Counterfactual Estimators

Now, let's take a step back and see what treatment effect might be if we build two estimators,

\begin{aligned} \hat{Y}_1(1) &= \alpha_1 + \beta_1 X_i(1) + \epsilon_i \ \hat{Y}_0(0) &= \alpha_0 + \beta_0 X_i(0) + \epsilon_i \end{aligned}

where the coefficients of $\hat{Y}_1$ are learned exclusively on samples of $X$ where we've observed $Y(1)$, and those of $\hat{Y}_2$ are likewise are learned from samples where $Y(0)$.

treatment_propensity = data_propensity_matched[data_propensity_matched.treat == 1]
control_propensity = data_propensity_matched[data_propensity_matched.treat == 0]

formula = 're78_div ~ age + educ + black + hisp + married + nodegr + u74 + u75 + re74_div + re75_div'

treatment_model = smf.glm(formula=formula, data=treatment_propensity).fit()
control_model = smf.glm(formula=formula, data=control_propensity).fit()


Now, recalling the fundamental problem of causal inference, we want to estimate what the counterfactual outcomes would be, e.g. fill in the gaps.

We can do this if we swap our data and estimate the effect from the two groups using the corresponding models.

\begin{aligned} \hat{Y}_1(0) = \alpha_1 + \beta_1 X_i(0) + \epsilon_i \ \hat{Y}_0(1) = \alpha_0 + \beta_0 X_i(1) + \epsilon_i \end{aligned}

Y_treatment, X_treatment = dmatrices(causal_formula, treatment_propensity, return_type='dataframe')
Y_control, X_control = dmatrices(causal_formula, control_propensity, return_type='dataframe')

Y_0 = np.expand_dims(treatment_model.predict(exog=X_control), 1)
Y_1 = np.expand_dims(control_model.predict(exog=X_treatment), 1)


Now we can define $\hat{\tau}_i = Y_i - \hat{Y}_1(0)$ for our control group, and $\hat{\tau}_i = \hat{Y}_0(1) - Y_i$ for our treatment group.

We can then calculate average treatment effect as $\hat{\tau} = \frac{1}{N} \sum_{i=1}^N \hat{\tau}_i$.

np.concatenate([
(Y_treatment - Y_1),
(Y_0 - Y_control)
]).mean()

# 1.5483643168987649


Once again we get a very similar treatment effect. The rationale here is that our earlier regression models are largely derived from this idea.

## Results

We've seen a few different implementations of the Rubin model of causal effect, and looked at matching with propensity scores, and Mahalanobis distance. With all of these models we've come up with similar positive treatment effects. Note, that if we had just used the OLS or Matching treatment effects from the causalinference library we might not have claimed any significance. However, given that each of these models comes up with similar treatment effects, we can be a bit more confident that there is in fact some positive treatment effect. In fact, with our Bayesian models we're relatively sure that there's a positive effect $P(treat > 0) \approx 0.97$.

The difference between the propensity matching and Mahalanobis distance matching was very small. In other models this might not be the case and there might be good reasons to use one over the other. In general propensity matching is a very popular approach for matching, but one might want to be careful about understanding the strong ignorability assumption that goes along.

Lastly, the Rubin model of causal effect is relatively simple, and more advanced approaches have been studied with structural equation modeling. The work by Judea Pearl is very good and his book, Causality: Models, Reasoning and Inference, is a good reference on the topic.

The full code in this post is available in my Github repo at log0ymxm/causal-inference-notes.

As always I welcome any comments or feedback on this post. Please let me know if I've made any mistakes or omissions here.