By Paul English, Sun 27 November 2016, in category Statistics

Correlation is not causation is a common mantra. It's easy to find examples of spurious correlation in the world and in many data sets. It's wise not to assume causal effect in many cases, but there are times when we want to infer causal relationships in datasets.

One model that attempts to look at this is the Rubin model of causal effect. We'll explore this idea, and recreate some analysis. This model is typically useful for observational data. It's not the only paradigm for estimating causality, another approach revolves around ideas of structural equation modeling, and might be more general and useful.

Borrowing from the wikipedia article on the Rubin causal model, image we have a few people partaking in a study where we want to see if some treatment helps or hinders a subject's outcome. We first observe what happens to each subject without the treatment, then we grab a time machine and restart the experiment having administered treatment to all the subjects. From this we would be able to directly observe how the treatment effected each subject.

$Y(1)$ | $Y(0)$ | $Y(1) - Y(0)$ | |
---|---|---|---|

Joe | $-5$ | $5$ | $-10$ |

Mary | $-10$ | $-5$ | $-5$ |

Sally | $0$ | $10$ | $-10$ |

Bob | $-20$ | $-5$ | $-15$ |

Mean | $-10$ |

In this table we use the notation $Y(1)$ as outcome having been treated (treatment), and $Y(0)$ as the outcome if there was no treatment (control). With full observations we could then simply look at the difference in outcome $Y(1) - Y(0)$, then average them.

In general we're typically not going to be able to fully observe the outcomes for each subject. Typically we'll have a dataset with partial observations,

$Y(1)$ | $Y(0)$ | $Y(1) - Y(0)$ | |
---|---|---|---|

Joe | $?$ | $5$ | $?$ |

Mary | $-10$ | $?$ | $?$ |

Sally | $?$ | $10$ | $?$ |

Bob | $-20$ | $?$ | $?$ |

Mean | $?$ |

We could try averaging each treatment and control group, then estimating the differences.

$Y(1)$ | $Y(0)$ | $Y(1) - Y(0)$ | |
---|---|---|---|

Joe | $?$ | $5$ | $?$ |

Mary | $-10$ | $?$ | $?$ |

Sally | $?$ | $10$ | $?$ |

Bob | $-20$ | $?$ | $?$ |

Mean | $-15$ | $7.5$ | $?$ |

**Average treatment effect**: $\bar{Y}(1) - \bar{Y}(0) = -22.5$

This might be a useful approach, but in this example our treatment effect is quite a bit different from what we had with full observations earlier. It is also relatively easy to come up with examples where missing observations can greatly skew or change the estimated average treatment. Increasing samples in general will lead to better estimates, but there's more that can be done to improve simple causal estimates.

In general we cannot directly observe causal effect. We have to look at methods that estimate this, similar to how we averaged the treatment & control group above.

A common and sometimes accidentally applied method is substituting the potential outcome.

Examples:

- Maybe you can repeat treatment, e.g. drinking tea before bed.
- Dividing up a piece of plastic and exposing it to a corrosive chemical.
- The effect of a diet over time by measuring weight.

These tend to carry strong assumptions that may be implicit in the choice of substitution. They can be regularly found in research, and though the assumptions can be left out or be part of a bad study, are generally ok in many situations.

Drinking tea before bed one night probably doesn't have a huge effect on successive nights, unless you're a habitual caffeine drinker, in which case you might have less than useful results.

The typical gold standard when it comes to measuring causal effects is the randomized controlled trial. The goal is to attempt comparing apples with apples by controlling for various factors. We use randomization to try to avoid introducing bias.

This model is not always perfect and has a few issues that can come up, that might make it not possible to conduct an experiment in this manner:

- It could be cost prohibitive.
- Participants could self-select into the treatment group, e.g. company benefits programs.
- You may not be involved in the study design, and only receive data post-hoc.
- It might be unethical to control treatment.

The goal of statistical adjustment is to approximate what a randomized experiment would achieve. The idea is to use statistical ideas to find similar units to compare. This is what we'll explore further with our causal model, and with pairwise matching, though there are many things that could fall into this category.

Switching gears for a moment, let's revisit our earlier example:

$Y(1)$ | $Y(0)$ | $Y(1) - Y(0)$ | |
---|---|---|---|

Joe | $?$ | $5$ | $?$ |

Mary | $-10$ | $?$ | $?$ |

Sally | $?$ | $10$ | $?$ |

Bob | $-20$ | $?$ | $?$ |

Mean | $?$ | $?$ | $?$ |

What if we can estimate the unknown values which we can't observe in practice? We could look at building a linear regression on whether or not a subject had treatment, we'll use the indicator $D_i$ to represent treatment for some subject $i$.

$$ \hat{Y}_i = \alpha + \tau D_i + \epsilon_i $$

From this, we can make use of the typical interpretation of linear model coefficients and say that $\tau$ is a value the represents how the outcome differes whether the subject received treatment or not.

We formulate the least squares estimator, $$ (\hat{\tau}, \hat{\alpha}) = \arg\min_{\tau, \alpha} \sum_{i=1}^N (Y - \alpha - \tau D_i)^2 $$

and solve for $\hat{\tau}$,

$$ \hat{\tau} = \frac{\sum_{i=1}^N (D_i - \bar{D}) (\hat{Y} - \bar{Y})}{\sum_{i=1}^N (D_i - \bar{D})^2} $$

which can be shown to be equal to, $\hat{\tau} = \bar{Y}(1) - \bar{Y}(0)$. This estimator $\hat{tau}$ is going to be similar to the difference in average outcomes of treatment status. This is the core idea for estimating causal effect with this framework.

We can continue further by adding in features that help improve the model. One might wish to control for various covariates $X_i$, or might believe that the treatment has interaction effects with those covariates. These lead to the corresponding regression models:

- Controlling for covariates: $\hat{Y}_i = \alpha + \tau D_i + \beta X_i + \epsilon_i$
- Interaction with the treatment: $\hat{Y}_i = \alpha + \tau D_i + \beta X_i + \gamma D_i (X_i - \bar{X}) + \epsilon_i$

In both of these models we're still only concerned with interpreting the coefficient $\tau$.

What are we assuming here?

We won't look in depth at these assumptions here, but it's important to know that the formulations of these models assume that the treatment is **Strongly Ignorable**. This means that we want the following two traits,

**Unconfoundedness**: $D$ is independent of $(Y(0), Y(1))$ conditional on $X=x$, e.g. the treatment of one group does not affect the other group.**Overlap**: $c < \mathbb{P}(D=1 | X=x) < 1 - c$, for $c > 0$.

These assumptions are likely where you'll find the most criticism of this type of model, and warrants actual study of the source material which I've referenced below.

I mentioned that we were going to look at statistical adjustment, in particular matching. Matching is used when you have imbalance data sets. Usually a large amount of control samples and a small amount of treatment samples. It's an attempt to remove the imbalance in these groups and approximate what you might have had given a randomized or controlled experiment.

We'll look at two methods: propensity matching, and using Mahalanobis distance between samples. Propensity matching is definitely the most common method, though there are criticisms against using it blindly.

A propensity score is an approximate model of how likely a subject is to have been in the treatment group, $P(D_i = 1 | X_i)$. It's solved easily with an intermediate regression, typically something like

$$ D_i = \text{logistic}(\alpha + \beta X_i + \epsilon_i) $$

Once built, we can then estimate propensities for both of our treatment and control groups. Hopefully we'll find that there are some subjects in the treatment group with low propensity, but the majority have high propensity (they did actually end up in the treatment group), and the opposite for the control group.

We would then find paired (or many) matches in the control group for each sample in the treatment group picking matches that minimize the differences in propensity score. Since we're dealing with estimates, it's ok if this isn't an optimal match, and we can usually use a simple algorithm like a greedy random match.

Using a measure like Mahalanobis distance is also an effective way to match groups, and has the bonus of not introducing an intermediate model.

We would look at picking a match that minimizes distance between covariates,

$$ m(i) = \arg \min_{j:W_j \ne W_i} || X_j - X_i || $$

Where we define $||X_j - X_i||$ as a distance between the covariate vectors $X_j$ and $X_i$ as follows:

$$ ||X_j - X_i|| := (X_j - X_i)^\prime W (X_j - X_i). $$

where $W$ is defined as

$$ W = \text{diag} { \hat{\sigma}^{-2}_1 , \dots, \hat{\sigma}^{-2}_K }. $$

This is a special case of Mahalanobis distance that's called the **Normalized Euclidean distance**.

Let's start looking at this with a few different examples on the same data set. This dataset can be found here, and is one that was used in particular with a propensity matched study. The data is a population survey including some basic demographics about each subject, and information about their salary. Our treatment group is filled with subjects who partook in an occupational training program. We expect to see that having gone through training has a positive end effect on salary.

After downloading the files, we load them in Python, and add a few useful features.

```
# NSW Data Files (Dehejia-Wahha Sample)
# http://users.nber.org/~rdehejia/data/nswdata2.html
cps = pd.read_csv('data/lalonde/cps_controls.txt', delimiter=' ', names=[
'treat', 'age', 'educ', 'black', 'hisp', 'married', 'nodegr',
're74', 're75', 're78'
])
original_treatment = pd.read_csv('data/lalonde/nswre74_treated.txt', delimiter=' ', names=[
'treat', 'age', 'educ', 'black', 'hisp', 'married', 'nodegr',
're74', 're75', 're78'
])
unmatched_raw = pd.concat([
cps, original_treatment
], axis=0)
unmatched_raw['u74'] = (unmatched_raw.re74 == 0).astype(np.int32)
unmatched_raw['u75'] = (unmatched_raw.re75 == 0).astype(np.int32)
unmatched_raw[['re74_div', 're75_div', 're78_div']] = unmatched_raw[['re74', 're75', 're78']] / 1000
```

[code gutter="false"] (16177, 15) [/code]

There are a lot of samples, and a lot of imbalance between the control and treatment group. We have only 185 treatment observations, and 15,992 control observations. It's likely that if we built a causal model on these all of these samples, that we'd see the large number of control samples wash out the value of the treatment indicator. Thus this is a great dataset to use with matching.

If we look at the compared density plots of the controls and the treatments we see the following:

```
sns.kdeplot(unmatched_raw[unmatched_raw.treat == 0].re78, label='Control')
sns.kdeplot(unmatched_raw[unmatched_raw.treat == 1].re78, label='Treatment');
```

Unfortunately this hides how imbalanced these two sets are. If we look instead at a histogram of our data (which doesn't normalize like the above plot) we will see the actual imbalance.

```
plt.hist(unmatched_raw[unmatched_raw.treat == 0].re78, label='Control', bins=100);
plt.hist(unmatched_raw[unmatched_raw.treat == 1].re78, label='Treatment', bins=100);
plt.legend();
```

`causalinference`

libraryThe first thing we're going to do is look at a well made library implemented to specifically deal with this type of causal model, the `causalinference`

package in python. It's very straightforward to run several models quickly.

```
from causalinference import CausalModel
causal_model = CausalModel(
X=unmatched_raw[['age', 'educ', 'black', 'hisp', 'married', 'nodegr', 're74_div', 're75_div', 'u74', 'u75']].values,
D=unmatched_raw.treat.values,
Y=unmatched_raw.re78_div.values
)
causal_model.est_propensity()
causal_model.trim_s()
causal_model.stratify_s()
causal_model.est_via_ols()
causal_model.est_via_matching(bias_adj=True)
causal_model.est_via_weighting()
print(causal_model.summary_stats)
print(causal_model.estimates)
print(causal_model.propensity)
print(causal_model.strata)
```

```
Summary Statistics
Controls (N_c=373) Treated (N_t=149)
Variable Mean S.d. Mean S.d. Raw-diff
--------------------------------------------------------------------------------
Y 4.814 6.084 6.038 8.006 1.224
Controls (N_c=373) Treated (N_t=149)
Variable Mean S.d. Mean S.d. Nor-diff
--------------------------------------------------------------------------------
X0 26.416 11.046 25.765 7.355 -0.069
X1 10.595 2.698 10.221 2.066 -0.156
X2 0.882 0.323 0.973 0.162 0.357
X3 0.118 0.323 0.027 0.162 -0.357
X4 0.214 0.411 0.134 0.342 -0.212
X5 0.611 0.488 0.745 0.437 0.289
X6 2.349 4.572 1.428 3.731 -0.221
X7 1.275 2.167 0.859 1.609 -0.218
X8 0.547 0.498 0.772 0.421 0.487
X9 0.491 0.501 0.671 0.471 0.371
Treatment Effect Estimates: Matching
Est. S.e. z P>|z| [95% Conf. int.]
--------------------------------------------------------------------------------
ATE 1.576 1.359 1.160 0.246 -1.087 4.239
ATC 1.357 1.632 0.832 0.406 -1.842 4.557
ATT 2.123 1.448 1.466 0.143 -0.716 4.962
Treatment Effect Estimates: Weighting
Est. S.e. z P>|z| [95% Conf. int.]
--------------------------------------------------------------------------------
ATE 1.750 0.817 2.143 0.032 0.149 3.351
Treatment Effect Estimates: OLS
Est. S.e. z P>|z| [95% Conf. int.]
--------------------------------------------------------------------------------
ATE 1.960 8.445 0.232 0.816 -14.593 18.513
ATC 1.947 10.284 0.189 0.850 -18.210 22.104
ATT 1.993 14.603 0.136 0.891 -26.629 30.615
Estimated Parameters of Propensity Score
Coef. S.e. z P>|z| [95% Conf. int.]
--------------------------------------------------------------------------------
Intercept -6.341 0.812 -7.804 0.000 -7.933 -4.748
X0 -0.018 0.011 -1.683 0.092 -0.039 0.003
X1 0.019 0.049 0.401 0.688 -0.076 0.115
X2 4.286 0.263 16.319 0.000 3.771 4.801
X3 1.836 0.394 4.663 0.000 1.064 2.607
X4 -0.994 0.241 -4.120 0.000 -1.467 -0.521
X5 0.903 0.275 3.282 0.001 0.364 1.442
X6 0.063 0.028 2.197 0.028 0.007 0.118
X7 -0.177 0.036 -4.923 0.000 -0.248 -0.107
X8 1.572 0.264 5.956 0.000 1.055 2.089
X9 0.235 0.239 0.984 0.325 -0.233 0.702
Stratification Summary
Propensity Score Sample Size Ave. Propensity Outcome
Stratum Min. Max. Controls Treated Controls Treated Raw-diff
--------------------------------------------------------------------------------
1 0.077 0.193 229 33 0.128 0.127 -0.472
2 0.193 0.284 52 14 0.235 0.238 2.194
3 0.286 0.408 31 35 0.350 0.363 5.653
4 0.408 0.638 61 67 0.562 0.576 1.391
```

This library include faculties to perform matching for us. Methodology details can be found in their documentation. We run three models, and see positive treatment effects in each one, though only one is "significant". These are seen in the output above as the "ATE" output lines, where ATE stands for average treatment effect.

While I like this library and recommend it, I've had numerical issues using it on larger and more complex data sets (it's implementations try to invert some matrices which is typically a recipe for singular matrix errors). Additionally I like to use Bayesian frameworks where I can, since I tend to prefer the probabilistic interpretation of results over p-values and hypothesis tests.

Now I'll look at implementing my own matching routines, and using a Bayesian linear regression.

For $y \in \mathbb{R}$, and $\mathbf{x} \in \mathbb{R}^M$ with $N$ data points we have $(\mathbf{X}, \mathbf{y}) = {(\mathbf{x}_n, y_n)}$. With the following distributions,

$$ \begin{aligned} p(\mathbf{w}) &= \mathcal{N}(\mathbf{w} | 0, \sigma_w^2 \mathbf{I}) \ p(b) &= \mathcal{N}(b | 0, \sigma_b^2) \ p(\mathbf{y} | \mathbf{w}, b, \mathbf{X}) &= \prod_{n=1}^N \mathcal{N} (y_n | \mathbf{x}_n^{\intercal} \mathbf{w} + b, \sigma_y^2) \end{aligned} $$

This is already implemented in PyMC with the `glm`

functionality. Thus, all we will need to do is pass in the formula for our model.

For propensity matching we first need some estimate of propensity score. In the interest of computational time, we'll use a standard frequentist logistic model to act as our propensity function. We do this since it's fast, and we don't plan to interpret this intermediate model in this particular analysis. We want to estimate $p(\text{treat} | X)$, we do this with the following using the popular `statsmodel`

package.

```
propensity_formula = "treat ~ age + educ + black + hisp + married + nodegr + u74 + u75 + re75_div + re74_div + I(re74_div*re75_div)"
propensity_model = smf.glm(formula=propensity_formula, data=unmatched_raw, family=sm.families.Binomial()).fit()
propensity_model.summary()
```

Dep. Variable: | treat | No. Observations: | 16177 |
---|---|---|---|

Model: | GLM | Df Residuals: | 16165 |

Model Family: | Binomial | Df Model: | 11 |

Link Function: | logit | Scale: | 1.0 |

Method: | IRLS | Log-Likelihood: | -473.41 |

Date: | Sat, 26 Nov 2016 | Deviance: | 946.82 |

Time: | 14:57:12 | Pearson chi2: | 8.10e+03 |

No. Iterations: | 14 |

coef | std err | z | P>|z| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|

Intercept | -6.6364 | 0.825 | -8.045 | 0.000 | -8.253 -5.020 |

age | -0.0191 | 0.011 | -1.788 | 0.074 | -0.040 0.002 |

educ | 0.0195 | 0.048 | 0.404 | 0.686 | -0.075 0.114 |

black | 4.2899 | 0.263 | 16.329 | 0.000 | 3.775 4.805 |

hisp | 1.8322 | 0.394 | 4.655 | 0.000 | 1.061 2.604 |

married | -0.9954 | 0.240 | -4.147 | 0.000 | -1.466 -0.525 |

nodegr | 0.9100 | 0.274 | 3.318 | 0.001 | 0.372 1.448 |

u74 | 1.7318 | 0.280 | 6.178 | 0.000 | 1.182 2.281 |

u75 | 0.3934 | 0.253 | 1.555 | 0.120 | -0.102 0.889 |

re75_div | -0.0871 | 0.056 | -1.559 | 0.119 | -0.197 0.022 |

re74_div | 0.0985 | 0.033 | 2.968 | 0.003 | 0.033 0.164 |

I(re74_div * re75_div) | -0.0072 | 0.004 | -1.800 | 0.072 | -0.015 0.001 |

From this fitted model, we can then make propensity score estimates, and come up with some matches.

```
_, X_propensity = dmatrices(propensity_formula, unmatched_raw, return_type='dataframe')
propensities = propensity_model.predict(exog=X_propensity)
split = cps.shape[0]
treatment = np.expand_dims(propensities[split:], 1)
control = np.expand_dims(propensities[:split], 1)
# Calculating the distance in propensity of each treatment sample to each control sample
distances = cdist(treatment, control)
# Computing an optimal match between the two groups
treatment_idx, control_idx = linear_sum_assignment(distances)
treatment_data = unmatched_raw[split:].iloc[treatment_idx]
control_data = unmatched_raw[:split].iloc[control_idx]
```

Here, we compute an optimal match, though this is not always very efficient (or needed). As mentioned earlier typically we can compute a random greedy match with or without replacement and achieve similar results. This will run much faster since it's just a simple loop. Since our treatment group is small enough I wanted to try the optimal matching method.

Now if we compare our density plots we see much more similarity in our groups,

```
sns.kdeplot(treatment_data.re78, label='Treatment')
sns.kdeplot(control_data.re78, label='Control')
```

And if we look at the comparative histograms we see that the control group doesn't overwhelm our treatment group like it did before,

```
plt.hist(treatment_data.re78, label='Treatment', bins=100)
plt.hist(control_data.re78, label='Control', bins=100)
plt.legend();
```

Now that we've got a treatment & matched control group we can put these together and build a model to estimate treatment effect. I do include the covariates in this model ($\hat{Y}_i = \alpha + \tau D_i + \beta X_i + \epsilon_i$, the second regression noted above), but it would just require updating the formula to look at just a regression on the treatment indicator, or to add in the interaction terms with the treatment indicator.

```
data_propensity_matched = pd.concat([
treatment_data,
control_data
], axis=0)
causal_formula = "re78_div ~ treat + age + educ + black + hisp + married + nodegr + u74 + u75 + re74_div + re75_div"
with pm.Model() as model:
pm.glm.glm(causal_formula, data_propensity_matched)
print('ADVI')
propensity_v_params = pm.variational.advi(n=80000)
plt.plot(-np.log10(-propensity_v_params.elbo_vals))
print('Sampling')
propensity_samples = pm.sample(
2000,
step=pm.NUTS(scaling=propensity_v_params.means),
start=propensity_v_params.means,
progressbar=True,
njobs=3
)
# ADVI
# Iteration 0 [0%]: ELBO = -45564.78
# Iteration 8000 [10%]: Average ELBO = -179941.26
# Iteration 16000 [20%]: Average ELBO = -1669.65
# Iteration 24000 [30%]: Average ELBO = -1434.98
# Iteration 32000 [40%]: Average ELBO = -1425.37
# Iteration 40000 [50%]: Average ELBO = -1424.45
# Iteration 48000 [60%]: Average ELBO = -1424.24
# Iteration 56000 [70%]: Average ELBO = -1423.82
# Iteration 64000 [80%]: Average ELBO = -1423.97
# Iteration 72000 [90%]: Average ELBO = -1423.87
# Finished [100%]: Average ELBO = -1423.64
# Sampling
# 100%|██████████| 2000/2000 [00:21<00:00, 94.80it/s]
```

```
pm.traceplot(propensity_samples[1000:], varnames=['treat']);
```

We only output the trace plot for the treatment variable, since it's what we're interested in currently. If you show all the sampled parameters they do look like they've converged sufficiently, one could check this more thoroughly with an autocorrelation on these samples. If we look at what the parameters means are,

```
df_propensity_samples = pm.trace_to_dataframe(propensity_samples[1000:])
df_propensity_samples.mean(0)
# age -0.046268
# nodegr -0.380227
# educ 0.385656
# u74 1.459713
# Intercept 2.051145
# u75 -0.343626
# hisp 0.218121
# married -0.473846
# re75_div 0.323274
# re74_div 0.233762
# treat 1.463496
# sd 6.977029
# black -1.679199
# dtype: float64
```

The average for treat, is 1.46. That is similar to the above ATE values we saw with the `causalinference`

package. What is the sampled probability that this effect is above zero?

```
(df_propensity_samples['treat'] > 0).mean()
# 0.97499999999999998
```

Now let's use the Mahalanobis idea for matching. This is going to be roughly the same once we've gotten a newly matched data set.

```
treatment = X_propensity[split:]
control = X_propensity[:split]
VI = np.diag(X_propensity.var())
distances = cdist(treatment, control, 'mahalanobis', VI=VI)
treatment_idx, control_idx = linear_sum_assignment(distances)
treatment_data = unmatched_raw[split:].iloc[treatment_idx]
control_data = unmatched_raw[:split].iloc[control_idx]
```

Here, we specify the diagonal matrix `VI`

so that we have the normalized Euclidean distance case. The other lines are similar to the propensity matching. Once again we can look at the distribution plot estimation, and we see that it's very similar to the propensity matched grouping.

```
sns.kdeplot(treatment_data.re78, label='Treatment')
sns.kdeplot(control_data.re78, label='Control');
```

Likewise, our histograms are similar.

```
plt.hist(treatment_data.re78, label='Treatment', bins=100)
plt.hist(control_data.re78, label='Control', bins=100)
plt.legend();
```

Using this matched group we run the same model.

```
data_mahalanobis_matched = pd.concat([
treatment_data,
control_data
], axis=0)
with pm.Model() as model:
pm.glm.glm(causal_formula, data_propensity_matched)
print('ADVI')
mahalanobis_v_params = pm.variational.advi(n=80000)
plt.plot(-np.log10(-mahalanobis_v_params.elbo_vals))
print('Sampling')
mahalanobis_samples = pm.sample(
2000,
step=pm.NUTS(scaling=mahalanobis_v_params.means),
start=mahalanobis_v_params.means,
progressbar=True,
njobs=4
)
# ADVI
# Iteration 0 [0%]: ELBO = -16156.39
# Iteration 8000 [10%]: Average ELBO = -216377.4
# Iteration 16000 [20%]: Average ELBO = -1637.49
# Iteration 24000 [30%]: Average ELBO = -1434.52
# Iteration 32000 [40%]: Average ELBO = -1425.37
# Iteration 40000 [50%]: Average ELBO = -1424.39
# Iteration 48000 [60%]: Average ELBO = -1424.08
# Iteration 56000 [70%]: Average ELBO = -1424.0
# Iteration 64000 [80%]: Average ELBO = -1423.79
# Iteration 72000 [90%]: Average ELBO = -1423.99
# Finished [100%]: Average ELBO = -1423.85
# Sampling
# 100%|██████████| 2000/2000 [00:21<00:00, 93.26it/s]
```

```
pm.traceplot(mahalanobis_samples[1000:], varnames=['treate']);
```

```
df_mahalanobis_samples = pm.trace_to_dataframe(mahalanobis_samples[1000:])
df_mahalanobis_samples.mean(0)
# age -0.045524
# nodegr -0.404798
# educ 0.382626
# u74 1.406951
# Intercept 2.089522
# u75 -0.377069
# hisp 0.218565
# married -0.501467
# re75_div 0.326605
# re74_div 0.229498
# treat 1.490877
# sd 6.975974
# black -1.652019
# dtype: float64
```

```
(df_mahalanobis_samples['treat'] > 0).mean()
# 0.97799999999999998
```

Here we get very similar results to the propensity model, 1.49 for a treatment effect and a high probability of 0.977 that this coefficient is above zero, given our samples.

Now, let's take a step back and see what treatment effect might be if we build two estimators,

$$ \begin{aligned} \hat{Y}_1(1) &= \alpha_1 + \beta_1 X_i(1) + \epsilon_i \ \hat{Y}_0(0) &= \alpha_0 + \beta_0 X_i(0) + \epsilon_i \end{aligned} $$

where the coefficients of $\hat{Y}_1$ are learned exclusively on samples of $X$ where we've observed $Y(1)$, and those of $\hat{Y}_2$ are likewise are learned from samples where $Y(0)$.

```
treatment_propensity = data_propensity_matched[data_propensity_matched.treat == 1]
control_propensity = data_propensity_matched[data_propensity_matched.treat == 0]
formula = 're78_div ~ age + educ + black + hisp + married + nodegr + u74 + u75 + re74_div + re75_div'
treatment_model = smf.glm(formula=formula, data=treatment_propensity).fit()
control_model = smf.glm(formula=formula, data=control_propensity).fit()
```

Now, recalling the fundamental problem of causal inference, we want to estimate what the counterfactual outcomes would be, e.g. fill in the gaps.

We can do this if we swap our data and estimate the effect from the two groups using the corresponding models.

$$ \begin{aligned} \hat{Y}_1(0) = \alpha_1 + \beta_1 X_i(0) + \epsilon_i \ \hat{Y}_0(1) = \alpha_0 + \beta_0 X_i(1) + \epsilon_i \end{aligned} $$

```
Y_treatment, X_treatment = dmatrices(causal_formula, treatment_propensity, return_type='dataframe')
Y_control, X_control = dmatrices(causal_formula, control_propensity, return_type='dataframe')
Y_0 = np.expand_dims(treatment_model.predict(exog=X_control), 1)
Y_1 = np.expand_dims(control_model.predict(exog=X_treatment), 1)
```

Now we can define $\hat{\tau}_i = Y_i - \hat{Y}_1(0)$ for our control group, and $\hat{\tau}_i = \hat{Y}_0(1) - Y_i$ for our treatment group.

We can then calculate average treatment effect as $\hat{\tau} = \frac{1}{N} \sum_{i=1}^N \hat{\tau}_i$.

```
np.concatenate([
(Y_treatment - Y_1),
(Y_0 - Y_control)
]).mean()
# 1.5483643168987649
```

Once again we get a very similar treatment effect. The rationale here is that our earlier regression models are largely derived from this idea.

We've seen a few different implementations of the Rubin model of causal effect, and looked at matching with propensity scores, and Mahalanobis distance. With all of these models we've come up with similar positive treatment effects. Note, that if we had just used the OLS or Matching treatment effects from the `causalinference`

library we might not have claimed any significance. However, given that each of these models comes up with similar treatment effects, we can be a bit more confident that there is in fact some positive treatment effect. In fact, with our Bayesian models we're relatively sure that there's a positive effect $P(treat > 0) \approx 0.97$.

The difference between the propensity matching and Mahalanobis distance matching was very small. In other models this might not be the case and there might be good reasons to use one over the other. In general propensity matching is a very popular approach for matching, but one might want to be careful about understanding the **strong ignorability** assumption that goes along.

Lastly, the Rubin model of causal effect is relatively simple, and more advanced approaches have been studied with structural equation modeling. The work by Judea Pearl is very good and his book, Causality: Models, Reasoning and Inference, is a good reference on the topic.

The full code in this post is available in my Github repo at log0ymxm/causal-inference-notes.

As always I welcome any comments or feedback on this post. Please let me know if I've made any mistakes or omissions here.

- Lalonde, R. J. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data. Journal of Chemical Information and Modeling, 76(4), 604–620. http://doi.org/10.1017/CBO9781107415324.004
- Gelman, a, & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Policy Analysis, 1–651. http://doi.org/10.2277/0521867061
- Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences an Introduction. Cambridge University Press. Retrieved from http://www.cambridge.org/US/academic/subjects/statistics-probability/statistical-theory-and-methods/causal-inference-statistics-social-and-biomedical-sciences-introduction