Wednesday, September 07, 2011

The Ohio Miracle: How Statistics Can Be Used to Create an Effect that Isn't There

Yesterday, I commented on a new report out of the Ohio Department of Health which concludes that the statewide smoking ban implemented in May 2007 caused a significant decline in heart attacks during the period 2007-2009, based on an analysis of hospital discharge data for the diagnosis of myocardial infarction (heart attack). However, I explained that the actual data do not demonstrate any effect of the smoking ban on heart attacks: the baseline annual rate of decline in heart attacks in Ohio was 4.7%, and the average post-implementation annual rate of decline in heart attacks was 3.6%.

How, then, did the report manage to find "a significant change in age‐adjusted rates of AMI discharges within one month after the enactment of the Smoke‐Free Workplace Act"?

The Rest of the Story

The key to understanding this story is to appreciate the difference between an a priori hypothesis and a post-analysis hypothesis. An a priori hypothesis is a logic-based theory that is developed prior to the analysis and guides the analytic method based on a conceptual model of the expected pattern of the outcome variable over time, and in relation to the intervention being tested. A post-analysis hypothesis is one that develops after the data analysis has been initiated.

In the Ohio report, the study puts forward a clear a priori hypothesis: that the rate of decline in heart attacks accelerated after implementation of the smoking ban. Thus, a linear trend in heart attacks is assumed, which changes after the smoking ban. This is most directly modeled as two lines: one before the ban and one after the ban, with the inflection point (or change in slope) at the point of implementation.

Nowhere does the report hypothesize, based on a conceptual model, that the pattern of heart attacks would follow a polynomial curve, with a huge decline in heart attacks that levels off completely for a period of time and then resumes with another huge decline in heart attacks. There is absolutely no a priori reason to believe that this is the pattern that heart attack trends in Ohio would follow. None of the prior studies on smoking bans on heart attacks has used such a model or detected such a pattern.

Nevertheless, this is the structure that this study imposes on the data.

Now, if one were to simply model the changes in heart attacks in Ohio as two lines, one before and one after the intervention, one would find that the slope of the pre-intervention line exceeds that of the post-intervention line. In other words, heart attacks were declining slightly more rapidly prior to the smoking ban than after the smoking ban, at least during the study period 2005-2009, which are the years used in the Ohio analysis. There is no way to avoid this conclusion if one uses a linear model.

Now consider what happens if you model these same data by forcing a polynomial curve. Because of the shape of the curve, it is going to markedly exaggerate the slope of the declining heart attack trend before the inflection point because there is only one year's worth of data. Thus, the actual data points prior to the intervention are going to be too high and to correct for that, the dummy variable "finds" that the heart attack rates were higher in this pre-intervention period than after the intervention.

One can see this in Figure 1. There is no reason to believe, for example, that the decline in heart attacks from January 1, 2005 to January 1, 2006 was extremely sharp for the first half of the year and decelerated rapidly in the latter half of the year. There is no reason why a linear trend could not be hypothesized to have occurred, especially over such a short period of time when there is seasonal variation in rates. In order to correct for this, the cubic model needs to "artificially" increase the pre-intervention heart attack rate estimates.

This is best seen in Figure 2, where one can see that the observed heart attack rates from January 1, 2005 to June 1, 2005 are completely inconsistent with the model used. In fact, for some reason, the figure does now show the predicted heart attack values for January 1, 2005 to June 1, 2005.

By ignoring the actual sharp decline in heart attacks that occurred from January 1, 2005 to June 1, 2005, the model is able to make it appear that there wasn't much change in the heart attack rate just prior to the smoking ban. Obviously, this is nonsense. In fact, if one believes this modeling of the data, then there was a drastic decline in heart attacks in the early part of 2005 which leveled off in association with the smoking ban implementation.

Despite the precarious nature of the analysis, it doesn't stop the report from going on to estimate the actual number of heart attacks averted and to calculate the dollar savings from those averted heart attacks. This should certainly give the reader pause as to the true intentions and purpose of the report. Is it to find out the truth, or to provide post-hoc justification for the smoking ban?

The rest of the story is that the data tell one story and the report tells quite another. When a linear model doesn't provide the answer one wants, it is just too easy to use more complicated models, for which there is no conceptual basis. If you try enough manipulations, you are always going to be able to show the effect that you want. However, it is the truth that we should be after, not "favorable" evidence.

No comments: