Causal Analysis Introduction - Examples in Python and PyMC

By Paul English, Sun 27 November 2016, in category Statistics

python, statistics

Correlation is not causation is a common mantra. It's easy to find examples of spurious correlation in the world and in many data sets. It's wise not to assume causal effect in many cases, but there are times when we want to infer causal relationships in datasets.

One model that attempts to look at this is the Rubin model of causal effect. We'll explore this idea, and recreate some analysis. This model is typically useful for observational data. It's not the only paradigm for estimating causality, another approach revolves around ideas of structural equation modeling, and might be more general and useful.

The Core Problem in Measuring Causal Effect

Borrowing from the wikipedia article on the Rubin causal model, image we have a few people partaking in a study where we want to see if some treatment helps or hinders a subject's outcome. We first observe what happens to each subject without the treatment, then we grab a time machine and restart the experiment having administered treatment to all the subjects. From this we would be able to directly observe how the treatment effected each subject.

$Y(1)$ $Y(0)$ $Y(1) - Y(0)$
Joe $-5$ $5$ $-10$
Mary $-10$ $-5$ $-5$
Sally $0$ $10$ $-10$
Bob $-20$ $-5$ $-15$
Mean $-10$

In this table we use the notation $Y(1)$ as outcome having been treated (treatment), and $Y(0)$ as the outcome if there was no treatment (control). With full observations we could then simply look at the difference in outcome $Y(1) - Y(0)$, then average them.

Causal Effect since we don't have a time machine

In general we're typically not going to be able to fully observe the outcomes for each subject. Typically we'll have a dataset with partial observations,

$Y(1)$ $Y(0)$ $Y(1) - Y(0)$
Joe $?$ $5$ $?$
Mary $-10$ $?$ $?$
Sally $?$ $10$ $?$
Bob $-20$ $?$ $?$
Mean $?$

We could try averaging each treatment and control group, then estimating the differences.

$Y(1)$ $Y(0)$ $Y(1) - Y(0)$
Joe $?$ $5$ $?$
Mary $-10$ $?$ $?$
Sally $?$ $10$ $?$
Bob $-20$ $?$ $?$
Mean $-15$ $7.5$ $?$

Average treatment effect: $\bar{Y}(1) - \bar{Y}(0) = -22.5$

This might be a useful approach, but in this example our treatment effect is quite a bit different from what we had with full observations earlier. It is also relatively easy to come up with examples where missing observations can greatly skew or change the estimated average treatment. Increasing samples in general will lead to better estimates, but there's more that can be done to improve simple causal estimates.

Estimating Causal Effect

In general we cannot directly observe causal effect. We have to look at methods that estimate this, similar to how we averaged the treatment & control group above.

Substitutes to Potential Outcome

A common and sometimes accidentally applied method is substituting the potential outcome.

Examples:

These tend to carry strong assumptions that may be implicit in the choice of substitution. They can be regularly found in research, and though the assumptions can be left out or be part of a bad study, are generally ok in many situations.

Drinking tea before bed one night probably doesn't have a huge effect on successive nights, unless you're a habitual caffeine drinker, in which case you might have less than useful results.

The Randomized Controlled Trial

The typical gold standard when it comes to measuring causal effects is the randomized controlled trial. The goal is to attempt comparing apples with apples by controlling for various factors. We use randomization to try to avoid introducing bias.

This model is not always perfect and has a few issues that can come up, that might make it not possible to conduct an experiment in this manner:

Statistical Adjustment

The goal of statistical adjustment is to approximate what a randomized experiment would achieve. The idea is to use statistical ideas to find similar units to compare. This is what we'll explore further with our causal model, and with pairwise matching, though there are many things that could fall into this category.

The Rubin Causal Model

Switching gears for a moment, let's revisit our earlier example:

$Y(1)$ $Y(0)$ $Y(1) - Y(0)$
Joe $?$ $5$ $?$
Mary $-10$ $?$ $?$
Sally $?$ $10$ $?$
Bob $-20$ $?$ $?$
Mean $?$ $?$ $?$

What if we can estimate the unknown values which we can't observe in practice? We could look at building a linear regression on whether or not a subject had treatment, we'll use the indicator $D_i$ to represent treatment for some subject $i$.

$$ \hat{Y}_i = \alpha + \tau D_i + \epsilon_i $$

From this, we can make use of the typical interpretation of linear model coefficients and say that $\tau$ is a value the represents how the outcome differes whether the subject received treatment or not.

We formulate the least squares estimator, $$ (\hat{\tau}, \hat{\alpha}) = \arg\min_{\tau, \alpha} \sum_{i=1}^N (Y - \alpha - \tau D_i)^2 $$

and solve for $\hat{\tau}$,

$$ \hat{\tau} = \frac{\sum_{i=1}^N (D_i - \bar{D}) (\hat{Y} - \bar{Y})}{\sum_{i=1}^N (D_i - \bar{D})^2} $$

which can be shown to be equal to, $\hat{\tau} = \bar{Y}(1) - \bar{Y}(0)$. This estimator $\hat{tau}$ is going to be similar to the difference in average outcomes of treatment status. This is the core idea for estimating causal effect with this framework.

We can continue further by adding in features that help improve the model. One might wish to control for various covariates $X_i$, or might believe that the treatment has interaction effects with those covariates. These lead to the corresponding regression models:

In both of these models we're still only concerned with interpreting the coefficient $\tau$.

Assumptions in This Model

What are we assuming here?

We won't look in depth at these assumptions here, but it's important to know that the formulations of these models assume that the treatment is Strongly Ignorable. This means that we want the following two traits,

These assumptions are likely where you'll find the most criticism of this type of model, and warrants actual study of the source material which I've referenced below.

Matching

I mentioned that we were going to look at statistical adjustment, in particular matching. Matching is used when you have imbalance data sets. Usually a large amount of control samples and a small amount of treatment samples. It's an attempt to remove the imbalance in these groups and approximate what you might have had given a randomized or controlled experiment.

We'll look at two methods: propensity matching, and using Mahalanobis distance between samples. Propensity matching is definitely the most common method, though there are criticisms against using it blindly.

Propensity Score Matching

A propensity score is an approximate model of how likely a subject is to have been in the treatment group, $P(D_i = 1 | X_i)$. It's solved easily with an intermediate regression, typically something like

$$ D_i = \text{logistic}(\alpha + \beta X_i + \epsilon_i) $$

Once built, we can then estimate propensities for both of our treatment and control groups. Hopefully we'll find that there are some subjects in the treatment group with low propensity, but the majority have high propensity (they did actually end up in the treatment group), and the opposite for the control group.

We would then find paired (or many) matches in the control group for each sample in the treatment group picking matches that minimize the differences in propensity score. Since we're dealing with estimates, it's ok if this isn't an optimal match, and we can usually use a simple algorithm like a greedy random match.

Distance Matching

Using a measure like Mahalanobis distance is also an effective way to match groups, and has the bonus of not introducing an intermediate model.

We would look at picking a match that minimizes distance between covariates,

$$ m(i) = \arg \min_{j:W_j \ne W_i} || X_j - X_i || $$

Where we define $||X_j - X_i||$ as a distance between the covariate vectors $X_j$ and $X_i$ as follows:

$$ ||X_j - X_i|| := (X_j - X_i)^\prime W (X_j - X_i). $$

where $W$ is defined as

$$ W = \text{diag} { \hat{\sigma}^{-2}_1 , \dots, \hat{\sigma}^{-2}_K }. $$

This is a special case of Mahalanobis distance that's called the Normalized Euclidean distance.

Examples

Let's start looking at this with a few different examples on the same data set. This dataset can be found here, and is one that was used in particular with a propensity matched study. The data is a population survey including some basic demographics about each subject, and information about their salary. Our treatment group is filled with subjects who partook in an occupational training program. We expect to see that having gone through training has a positive end effect on salary.

After downloading the files, we load them in Python, and add a few useful features.

# NSW Data Files (Dehejia-Wahha Sample)
# http://users.nber.org/~rdehejia/data/nswdata2.html

cps = pd.read_csv('data/lalonde/cps_controls.txt', delimiter=' ', names=[
    'treat', 'age', 'educ', 'black', 'hisp', 'married', 'nodegr',
    're74', 're75', 're78'
])

original_treatment = pd.read_csv('data/lalonde/nswre74_treated.txt', delimiter=' ', names=[
    'treat', 'age', 'educ', 'black', 'hisp', 'married', 'nodegr',
    're74', 're75', 're78'
])

unmatched_raw = pd.concat([
    cps, original_treatment
], axis=0)

unmatched_raw['u74'] = (unmatched_raw.re74 == 0).astype(np.int32)
unmatched_raw['u75'] = (unmatched_raw.re75 == 0).astype(np.int32)

unmatched_raw[['re74_div', 're75_div', 're78_div']] = unmatched_raw[['re74', 're75', 're78']] / 1000

[code gutter="false"] (16177, 15) [/code]

There are a lot of samples, and a lot of imbalance between the control and treatment group. We have only 185 treatment observations, and 15,992 control observations. It's likely that if we built a causal model on these all of these samples, that we'd see the large number of control samples wash out the value of the treatment indicator. Thus this is a great dataset to use with matching.

If we look at the compared density plots of the controls and the treatments we see the following:

sns.kdeplot(unmatched_raw[unmatched_raw.treat == 0].re78, label='Control')
sns.kdeplot(unmatched_raw[unmatched_raw.treat == 1].re78, label='Treatment');

Control and treatment KDE

Unfortunately this hides how imbalanced these two sets are. If we look instead at a histogram of our data (which doesn't normalize like the above plot) we will see the actual imbalance.

plt.hist(unmatched_raw[unmatched_raw.treat == 0].re78, label='Control', bins=100);
plt.hist(unmatched_raw[unmatched_raw.treat == 1].re78, label='Treatment', bins=100);
plt.legend();

Control and treatment histogram

Using the causalinference library

The first thing we're going to do is look at a well made library implemented to specifically deal with this type of causal model, the causalinference package in python. It's very straightforward to run several models quickly.

from causalinference import CausalModel

causal_model = CausalModel(
    X=unmatched_raw[['age', 'educ', 'black', 'hisp', 'married', 'nodegr', 're74_div', 're75_div', 'u74', 'u75']].values,
    D=unmatched_raw.treat.values,
    Y=unmatched_raw.re78_div.values
)
causal_model.est_propensity()
causal_model.trim_s()
causal_model.stratify_s()
causal_model.est_via_ols()
causal_model.est_via_matching(bias_adj=True)
causal_model.est_via_weighting()

print(causal_model.summary_stats)
print(causal_model.estimates)
print(causal_model.propensity)
print(causal_model.strata)
Summary Statistics

                      Controls (N_c=373)         Treated (N_t=149)             
      Variable         Mean         S.d.         Mean         S.d.     Raw-diff
--------------------------------------------------------------------------------
              Y        4.814        6.084        6.038        8.006        1.224

                      Controls (N_c=373)         Treated (N_t=149)             
      Variable         Mean         S.d.         Mean         S.d.     Nor-diff
--------------------------------------------------------------------------------
            X0       26.416       11.046       25.765        7.355       -0.069
            X1       10.595        2.698       10.221        2.066       -0.156
            X2        0.882        0.323        0.973        0.162        0.357
            X3        0.118        0.323        0.027        0.162       -0.357
            X4        0.214        0.411        0.134        0.342       -0.212
            X5        0.611        0.488        0.745        0.437        0.289
            X6        2.349        4.572        1.428        3.731       -0.221
            X7        1.275        2.167        0.859        1.609       -0.218
            X8        0.547        0.498        0.772        0.421        0.487
            X9        0.491        0.501        0.671        0.471        0.371


Treatment Effect Estimates: Matching

                    Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
          ATE      1.576      1.359      1.160      0.246     -1.087      4.239
          ATC      1.357      1.632      0.832      0.406     -1.842      4.557
          ATT      2.123      1.448      1.466      0.143     -0.716      4.962

Treatment Effect Estimates: Weighting

                    Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
          ATE      1.750      0.817      2.143      0.032      0.149      3.351

Treatment Effect Estimates: OLS

                    Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
          ATE      1.960      8.445      0.232      0.816    -14.593     18.513
          ATC      1.947     10.284      0.189      0.850    -18.210     22.104
          ATT      1.993     14.603      0.136      0.891    -26.629     30.615


Estimated Parameters of Propensity Score

                    Coef.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
    Intercept     -6.341      0.812     -7.804      0.000     -7.933     -4.748
            X0     -0.018      0.011     -1.683      0.092     -0.039      0.003
            X1      0.019      0.049      0.401      0.688     -0.076      0.115
            X2      4.286      0.263     16.319      0.000      3.771      4.801
            X3      1.836      0.394      4.663      0.000      1.064      2.607
            X4     -0.994      0.241     -4.120      0.000     -1.467     -0.521
            X5      0.903      0.275      3.282      0.001      0.364      1.442
            X6      0.063      0.028      2.197      0.028      0.007      0.118
            X7     -0.177      0.036     -4.923      0.000     -0.248     -0.107
            X8      1.572      0.264      5.956      0.000      1.055      2.089
            X9      0.235      0.239      0.984      0.325     -0.233      0.702


Stratification Summary

              Propensity Score         Sample Size     Ave. Propensity   Outcome
  Stratum      Min.      Max.  Controls   Treated  Controls   Treated  Raw-diff
--------------------------------------------------------------------------------
        1     0.077     0.193       229        33     0.128     0.127    -0.472
        2     0.193     0.284        52        14     0.235     0.238     2.194
        3     0.286     0.408        31        35     0.350     0.363     5.653
        4     0.408     0.638        61        67     0.562     0.576     1.391

This library include faculties to perform matching for us. Methodology details can be found in their documentation. We run three models, and see positive treatment effects in each one, though only one is "significant". These are seen in the output above as the "ATE" output lines, where ATE stands for average treatment effect.

While I like this library and recommend it, I've had numerical issues using it on larger and more complex data sets (it's implementations try to invert some matrices which is typically a recipe for singular matrix errors). Additionally I like to use Bayesian frameworks where I can, since I tend to prefer the probabilistic interpretation of results over p-values and hypothesis tests.

Using a simple Bayesian linear regression (PyMC)

Now I'll look at implementing my own matching routines, and using a Bayesian linear regression.

For $y \in \mathbb{R}$, and $\mathbf{x} \in \mathbb{R}^M$ with $N$ data points we have $(\mathbf{X}, \mathbf{y}) = {(\mathbf{x}_n, y_n)}$. With the following distributions,

$$ \begin{aligned} p(\mathbf{w}) &= \mathcal{N}(\mathbf{w} | 0, \sigma_w^2 \mathbf{I}) \ p(b) &= \mathcal{N}(b | 0, \sigma_b^2) \ p(\mathbf{y} | \mathbf{w}, b, \mathbf{X}) &= \prod_{n=1}^N \mathcal{N} (y_n | \mathbf{x}_n^{\intercal} \mathbf{w} + b, \sigma_y^2) \end{aligned} $$

This is already implemented in PyMC with the glm functionality. Thus, all we will need to do is pass in the formula for our model.

Propensity Matching

For propensity matching we first need some estimate of propensity score. In the interest of computational time, we'll use a standard frequentist logistic model to act as our propensity function. We do this since it's fast, and we don't plan to interpret this intermediate model in this particular analysis. We want to estimate $p(\text{treat} | X)$, we do this with the following using the popular statsmodel package.

propensity_formula = "treat ~ age + educ + black + hisp + married + nodegr + u74 + u75 + re75_div + re74_div + I(re74_div*re75_div)"
propensity_model = smf.glm(formula=propensity_formula, data=unmatched_raw, family=sm.families.Binomial()).fit()
propensity_model.summary()
Generalized Linear Model Regression Results
Dep. Variable: treat No. Observations: 16177
Model: GLM Df Residuals: 16165
Model Family: Binomial Df Model: 11
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -473.41
Date: Sat, 26 Nov 2016 Deviance: 946.82
Time: 14:57:12 Pearson chi2: 8.10e+03
No. Iterations: 14
coef std err z P>|z| [95.0% Conf. Int.]
Intercept -6.6364 0.825 -8.045 0.000 -8.253 -5.020
age -0.0191 0.011 -1.788 0.074 -0.040 0.002
educ 0.0195 0.048 0.404 0.686 -0.075 0.114
black 4.2899 0.263 16.329 0.000 3.775 4.805
hisp 1.8322 0.394 4.655 0.000 1.061 2.604
married -0.9954 0.240 -4.147 0.000 -1.466 -0.525
nodegr 0.9100 0.274 3.318 0.001 0.372 1.448
u74 1.7318 0.280 6.178 0.000 1.182 2.281
u75 0.3934 0.253 1.555 0.120 -0.102 0.889
re75_div -0.0871 0.056 -1.559 0.119 -0.197 0.022
re74_div 0.0985 0.033 2.968 0.003 0.033 0.164
I(re74_div * re75_div) -0.0072 0.004 -1.800 0.072 -0.015 0.001

From this fitted model, we can then make propensity score estimates, and come up with some matches.

_, X_propensity = dmatrices(propensity_formula, unmatched_raw, return_type='dataframe')
propensities = propensity_model.predict(exog=X_propensity)

split = cps.shape[0]

treatment = np.expand_dims(propensities[split:], 1)
control = np.expand_dims(propensities[:split], 1)

# Calculating the distance in propensity of each treatment sample to each control sample
distances = cdist(treatment, control)

# Computing an optimal match between the two groups
treatment_idx, control_idx = linear_sum_assignment(distances)

treatment_data = unmatched_raw[split:].iloc[treatment_idx]
control_data = unmatched_raw[:split].iloc[control_idx]

Here, we compute an optimal match, though this is not always very efficient (or needed). As mentioned earlier typically we can compute a random greedy match with or without replacement and achieve similar results. This will run much faster since it's just a simple loop. Since our treatment group is small enough I wanted to try the optimal matching method.

Now if we compare our density plots we see much more similarity in our groups,

sns.kdeplot(treatment_data.re78, label='Treatment')
sns.kdeplot(control_data.re78, label='Control')

Treatment and control density

And if we look at the comparative histograms we see that the control group doesn't overwhelm our treatment group like it did before,

plt.hist(treatment_data.re78, label='Treatment', bins=100)
plt.hist(control_data.re78, label='Control', bins=100)
plt.legend();

Treatment and control density propensity

Now that we've got a treatment & matched control group we can put these together and build a model to estimate treatment effect. I do include the covariates in this model ($\hat{Y}_i = \alpha + \tau D_i + \beta X_i + \epsilon_i$, the second regression noted above), but it would just require updating the formula to look at just a regression on the treatment indicator, or to add in the interaction terms with the treatment indicator.

data_propensity_matched = pd.concat([
    treatment_data,
    control_data
], axis=0)

causal_formula = "re78_div ~ treat + age + educ + black + hisp + married + nodegr + u74 + u75 + re74_div + re75_div"

with pm.Model() as model:
    pm.glm.glm(causal_formula, data_propensity_matched)

    print('ADVI')
    propensity_v_params = pm.variational.advi(n=80000)
    plt.plot(-np.log10(-propensity_v_params.elbo_vals))

    print('Sampling')
    propensity_samples = pm.sample(
        2000,
        step=pm.NUTS(scaling=propensity_v_params.means),
        start=propensity_v_params.means,
        progressbar=True,
        njobs=3
    )

# ADVI
# Iteration 0 [0%]: ELBO = -45564.78
# Iteration 8000 [10%]: Average ELBO = -179941.26
# Iteration 16000 [20%]: Average ELBO = -1669.65
# Iteration 24000 [30%]: Average ELBO = -1434.98
# Iteration 32000 [40%]: Average ELBO = -1425.37
# Iteration 40000 [50%]: Average ELBO = -1424.45
# Iteration 48000 [60%]: Average ELBO = -1424.24
# Iteration 56000 [70%]: Average ELBO = -1423.82
# Iteration 64000 [80%]: Average ELBO = -1423.97
# Iteration 72000 [90%]: Average ELBO = -1423.87
# Finished [100%]: Average ELBO = -1423.64
# Sampling
# 100%|██████████| 2000/2000 [00:21<00:00, 94.80it/s]

Causal propensity ADVI

pm.traceplot(propensity_samples[1000:], varnames=['treat']);

Propensity traceplot

We only output the trace plot for the treatment variable, since it's what we're interested in currently. If you show all the sampled parameters they do look like they've converged sufficiently, one could check this more thoroughly with an autocorrelation on these samples. If we look at what the parameters means are,

df_propensity_samples = pm.trace_to_dataframe(propensity_samples[1000:])
df_propensity_samples.mean(0)

# age         -0.046268
# nodegr      -0.380227
# educ         0.385656
# u74          1.459713
# Intercept    2.051145
# u75         -0.343626
# hisp         0.218121
# married     -0.473846
# re75_div     0.323274
# re74_div     0.233762
# treat        1.463496
# sd           6.977029
# black       -1.679199
# dtype: float64

The average for treat, is 1.46. That is similar to the above ATE values we saw with the causalinference package. What is the sampled probability that this effect is above zero?

(df_propensity_samples['treat'] &gt; 0).mean()

# 0.97499999999999998

Mahalanobis Matching

Now let's use the Mahalanobis idea for matching. This is going to be roughly the same once we've gotten a newly matched data set.

treatment = X_propensity[split:]
control = X_propensity[:split]

VI = np.diag(X_propensity.var())
distances = cdist(treatment, control, 'mahalanobis', VI=VI)

treatment_idx, control_idx = linear_sum_assignment(distances)

treatment_data = unmatched_raw[split:].iloc[treatment_idx]
control_data = unmatched_raw[:split].iloc[control_idx]

Here, we specify the diagonal matrix VI so that we have the normalized Euclidean distance case. The other lines are similar to the propensity matching. Once again we can look at the distribution plot estimation, and we see that it's very similar to the propensity matched grouping.

sns.kdeplot(treatment_data.re78, label='Treatment')
sns.kdeplot(control_data.re78, label='Control');

Treatment control density Mahalanobis

Likewise, our histograms are similar.

plt.hist(treatment_data.re78, label='Treatment', bins=100)
plt.hist(control_data.re78, label='Control', bins=100)
plt.legend();

Treatment control Mahalanobis histogram

Using this matched group we run the same model.

data_mahalanobis_matched = pd.concat([
    treatment_data,
    control_data
], axis=0)

with pm.Model() as model:
    pm.glm.glm(causal_formula, data_propensity_matched)

    print('ADVI')
    mahalanobis_v_params = pm.variational.advi(n=80000)
    plt.plot(-np.log10(-mahalanobis_v_params.elbo_vals))

    print('Sampling')
    mahalanobis_samples = pm.sample(
        2000,
        step=pm.NUTS(scaling=mahalanobis_v_params.means),
        start=mahalanobis_v_params.means,
        progressbar=True,
        njobs=4
    )

# ADVI
# Iteration 0 [0%]: ELBO = -16156.39
# Iteration 8000 [10%]: Average ELBO = -216377.4
# Iteration 16000 [20%]: Average ELBO = -1637.49
# Iteration 24000 [30%]: Average ELBO = -1434.52
# Iteration 32000 [40%]: Average ELBO = -1425.37
# Iteration 40000 [50%]: Average ELBO = -1424.39
# Iteration 48000 [60%]: Average ELBO = -1424.08
# Iteration 56000 [70%]: Average ELBO = -1424.0
# Iteration 64000 [80%]: Average ELBO = -1423.79
# Iteration 72000 [90%]: Average ELBO = -1423.99
# Finished [100%]: Average ELBO = -1423.85
# Sampling
# 100%|██████████| 2000/2000 [00:21<00:00, 93.26it/s]

Mahalanobis ADVI

pm.traceplot(mahalanobis_samples[1000:], varnames=['treate']);

Mahalanobis traceplot

df_mahalanobis_samples = pm.trace_to_dataframe(mahalanobis_samples[1000:])
df_mahalanobis_samples.mean(0)

# age         -0.045524
# nodegr      -0.404798
# educ         0.382626
# u74          1.406951
# Intercept    2.089522
# u75         -0.377069
# hisp         0.218565
# married     -0.501467
# re75_div     0.326605
# re74_div     0.229498
# treat        1.490877
# sd           6.975974
# black       -1.652019
# dtype: float64
(df_mahalanobis_samples['treat'] &gt; 0).mean()

# 0.97799999999999998

Here we get very similar results to the propensity model, 1.49 for a treatment effect and a high probability of 0.977 that this coefficient is above zero, given our samples.

Bonus - Estimating Causal Effect with Counterfactual Estimators

Now, let's take a step back and see what treatment effect might be if we build two estimators,

$$ \begin{aligned} \hat{Y}_1(1) &= \alpha_1 + \beta_1 X_i(1) + \epsilon_i \ \hat{Y}_0(0) &= \alpha_0 + \beta_0 X_i(0) + \epsilon_i \end{aligned} $$

where the coefficients of $\hat{Y}_1$ are learned exclusively on samples of $X$ where we've observed $Y(1)$, and those of $\hat{Y}_2$ are likewise are learned from samples where $Y(0)$.

treatment_propensity = data_propensity_matched[data_propensity_matched.treat == 1]
control_propensity = data_propensity_matched[data_propensity_matched.treat == 0]

formula = 're78_div ~ age + educ + black + hisp + married + nodegr + u74 + u75 + re74_div + re75_div'

treatment_model = smf.glm(formula=formula, data=treatment_propensity).fit()
control_model = smf.glm(formula=formula, data=control_propensity).fit()

Now, recalling the fundamental problem of causal inference, we want to estimate what the counterfactual outcomes would be, e.g. fill in the gaps.

We can do this if we swap our data and estimate the effect from the two groups using the corresponding models.

$$ \begin{aligned} \hat{Y}_1(0) = \alpha_1 + \beta_1 X_i(0) + \epsilon_i \ \hat{Y}_0(1) = \alpha_0 + \beta_0 X_i(1) + \epsilon_i \end{aligned} $$

Y_treatment, X_treatment = dmatrices(causal_formula, treatment_propensity, return_type='dataframe')
Y_control, X_control = dmatrices(causal_formula, control_propensity, return_type='dataframe')

Y_0 = np.expand_dims(treatment_model.predict(exog=X_control), 1)
Y_1 = np.expand_dims(control_model.predict(exog=X_treatment), 1)

Now we can define $\hat{\tau}_i = Y_i - \hat{Y}_1(0)$ for our control group, and $\hat{\tau}_i = \hat{Y}_0(1) - Y_i$ for our treatment group.

We can then calculate average treatment effect as $\hat{\tau} = \frac{1}{N} \sum_{i=1}^N \hat{\tau}_i$.

np.concatenate([
 (Y_treatment - Y_1),
 (Y_0 - Y_control)
]).mean()

# 1.5483643168987649

Once again we get a very similar treatment effect. The rationale here is that our earlier regression models are largely derived from this idea.

Results

We've seen a few different implementations of the Rubin model of causal effect, and looked at matching with propensity scores, and Mahalanobis distance. With all of these models we've come up with similar positive treatment effects. Note, that if we had just used the OLS or Matching treatment effects from the causalinference library we might not have claimed any significance. However, given that each of these models comes up with similar treatment effects, we can be a bit more confident that there is in fact some positive treatment effect. In fact, with our Bayesian models we're relatively sure that there's a positive effect $P(treat > 0) \approx 0.97$.

The difference between the propensity matching and Mahalanobis distance matching was very small. In other models this might not be the case and there might be good reasons to use one over the other. In general propensity matching is a very popular approach for matching, but one might want to be careful about understanding the strong ignorability assumption that goes along.

Lastly, the Rubin model of causal effect is relatively simple, and more advanced approaches have been studied with structural equation modeling. The work by Judea Pearl is very good and his book, Causality: Models, Reasoning and Inference, is a good reference on the topic.

The full code in this post is available in my Github repo at log0ymxm/causal-inference-notes.

As always I welcome any comments or feedback on this post. Please let me know if I've made any mistakes or omissions here.

References