Monday, 9 April 2012

Simpson's Paradox





Simpson's Paradox is one of those subtly baffling situations in which statistics leads you along a perfectly logical path, where everything is simple and obvious, until suddenly you're left with your head spinning in confusion. It illustrates how the ideas of "on average" and "overall" can be surprisingly misleading, so beware of taking them at face value.


Edward Simpson, a WWII codebreaker in Bletchley Park, described the effect named after him back in 1941.


To see how it works let's look at a 1986 study* into different methods for treating kidney stones: keyhole surgery versus open surgery. All the figures below are taken from this study, they are real results from actual patients.

Treatment of kidney stones using keyhole and open surgery

Treatment         Success  Failure  Total  Success%
Keyhole Surgery       289       61    350       83%
Open Surgery          273       77    350       78%

The table shows the result of 700 patients, 350 receiving each treatment, how many were successful and unsuccessful under each treatment, and finally the percentage success rate per treatment


So based on 350 patients undergoing each treatment, this paper found that the overall chance of a successful operation was 83% with keyhole surgery versus 78% using open surgery. So keyhole surgery was, on average, more successful than open surgery. That's simple enough. But not all kidney stones are the same, so let's pull out just the results for large stones, defined in the paper as 2cm or more in diameter:

Treatment of large kidney stones

Treatment         Success  Failure  Total  Success%
Keyhole Surgery        55       25     80       69%
Open Surgery          192       71    263       73%

Ahh, interesting: this shows a higher success rate for open surgery. So in this study it seems that despite keyhole surgery being more successful overall, when dealing with large kidney stones open surgery is better: it succeeded 73% of the time, compared with 69% for keyhole surgery. This must mean that for small stones keyhole surgery is far better...it has to be, otherwise how could it come out ahead of open surgery on average? Let's check:

Treatment of small kidney stones

Treatment         Success  Failure  Total  Success%
Keyhole Surgery       234       36    270       87%
Open Surgery           81        6     87       93%


So open surgery has a higher chance of success, by 93% to 87%. Er, what? How can open surgery be more effective at treating small stones as well? One treatment is better in all situations, but worse overall?


What's going on?


So the study found that open surgery is better than keyhole surgery for large stones and it's also better for small stones, but overall it's worse! How can that be? You may be thinking that I've omitted to mention a "medium" category, but there isn't one. You can check the tables to confirm that the total number of patients (700) is equal to all the small cases (357) plus all the large cases (343): nothing is missing. 

So what's going on? How can open surgery be better for both large and small stones, but come off worse overall? Simpson's Paradox! But is it just a statistical trick or is there something real going on? Remember we're dealing with real research into real patients with real (and probably very painful) kidney stones. The point of the research is to find out which treatment gives patients get the best chance of success, so what is a doctor supposed to recommend based on this? A coin flip?


The answer becomes clear if we look closer at the two tables breaking down small and large cases. Here they are again.



Treatment of large kidney stones

Treatment         Success  Failure  Total  Success%
Keyhole Surgery        55       25     80       69%
Open Surgery          192       71    263       73%




Treatment of small kidney stones

Treatment         Success  Failure  Total  Success%
Keyhole Surgery       234       36    270       87%
Open Surgery           81        6     87       93%


First, notice that regardless of which treatment is used, the chance of success when treating a small stone (87% or 93%) is always much higher than the chance of success when treating a large stone (69% or 73%). Small stones seem to be inherently easier to treat.


Second, notice that keyhole surgery was mostly used to treat small stones whereas open surgery was mostly used to treat large stones. The overall table shows 350 treatments using keyhole surgery, but more than 3/4 of them were treating the inherently less risky small stones. In contrast, 3/4 of the open surgery treatments were on patients with a large kidney stone. 


So in this study, keyhole surgery was mainly used to treat the lower risk cases, while open surgery was mainly used in the higher risk cases. This flatters the performance of keyhole surgery when you put all the results together because you're not comparing like with like. The true picture emerges when you separate (or stratify) the easier and harder cases. The stratified results produce a fairer, like-for-like comparison, revealing that in this study open surgery outperformed keyhole surgery for both large and small stones.


Mr Brilliant and Mr Average



Here's another way to picture it. Imagine a hospital with two surgeons, Mr Brilliant and Mr Average. To give patients the best chance of success you would aim to give all the difficult cases to Mr Brilliant since he has the best chance of pulling off a successful treatment. Mr Average can then concentrate on simple, run-of-the-mill cases where the chance of success is always quite high. In that situation, it's quite possible that Mr Brilliant could have a lower overall success rate than Mr Average, despite being the better surgeon. Such a crude comparison flatters Mr Average because Mr Brilliant is taking all the difficult cases. To make the comparison fair you need to compare the surgeons' performance on similar cases.


In an ideal world, statistically speaking, you would allocate patients between the two surgeons at random so they'd both tackle the same mix of easy and difficult cases. That would make it very easy to see who's best, but it doesn't generally happen this way in real life because hospitals tends to prioritise the successful outcome for the patient over the easy life for the statistician. Hey-ho.


Final Thought



I've focussed on one study into treating for kidney stones dating back to 1986. In practice this one study is rather old and its findings outdated. My aim here was to look at the interesting statistical features of its findings rather than to recommend to you a particular treatment for kidney stones - I'm not a doctor. My guess is that treatments and patient outcomes have moved on a lot in the last 30 years.

* Charig, R., Webb, D.R., Payne, S.R and Wickham, J.E.A. (1986) Comparison of treatment of renal culculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy. British Medical Journal, 292, 879-882.

Success was defined as the stones being eliminated or reduced to <2mm. Success % rounded to 2 significant figures. The paper also looked at a 3rd treatment: using sound waves to break up kidney stones.

1 comment:

  1. This Bad Science column by the excellent Ben Goldacre shows a similar situation:

    http://www.guardian.co.uk/commentisfree/2011/aug/05/bad-science-adjusting-figures

    ReplyDelete