When, as recently happened to Alkermes, the US regulator calls your clinical trial a “p value fishing expedition” there can be little doubt that we are in the middle of a statistical data analysis crisis. But what to do about this?
An idea that has been building up steam over the past year is to raise the bar for statistical significance from the current p value of 0.05 to 0.005, a suggestion that got another airing in JAMA this week. But this looks like nothing more than a way of replacing one arbitrary measure with another, and would do nothing to stop biotechs from routinely characterising failed trials as positive.
Perversely, it might actually achieve the opposite, pushing companies’ biostats departments into overdrive to torture datasets with the sole aim of hitting a purported level of statistical significance that clears a more stringent threshold than had been in force before.
The issue stems from a desire to show that a particular clinical result is not a fluke, and the convention is that a p value of 0.05 or less shows this to be the case with an acceptable level of certainty. But this convention is arbitrary, and ignores other equally important aspects of a clinical result.
One immediate problem with raising the 0.05 threshold is that it would only increase the industry and investors’ fixation with p values, while ignoring such considerations as effect size, clinical relevance and reproducibility, and – perhaps most importantly – the often murky question of how the p value was actually generated.
A p value is a measure of the likelihood that, if the hypothesis being tested is due to chance, the results generated would in fact be seen. A 0.05 value is equal to a 5% likelihood, and raising this bar to 0.005 would clearly cut this probability to 0.5%.
So far so good. But it is also key to appreciate that for this to apply a hypothesis must have been specified prospectively. To test a dataset retrospectively to find the analysis that yields a particular p value is absurd, yet this is what biotechs routinely do with failed studies, knowing that some investors will assume that the measure with p<0.05 next to it shows clinical efficacy.
Another phenomenon, known as p value hacking, involves making post hoc changes to a study in an attempt to push the p value generated across the magical threshold. This is common, with biostatisticians apparently routinely asked to remove certain patients to falsify statistical significance, for instance.
A slightly more advanced hack is to eyeball open-label data to guess what type of analysis would show an effect, before formally applying a statistical test. True, design and endpoints can be changed before testing for significance without risking a trial’s integrity, but in reality this is plain cheating.
This leads neatly on to another underappreciated aspect of data mining, namely multiplicity.
There is only a set amount of statistical power in each trial, and each time a dataset is interrogated powering is lost, so a correction has to be made when reporting data generated through multiple analyses. Yet such corrections are frequently not reported.
Alternatively, companies use the glib response that “this was all accounted for in the statistical design” when asked whether, in stating the statistical significance of a purported secondary endpoint, an adjustment for multiplicity has been made.
Interrogate data enough and an apparent correlation will be found. One list yields genuine examples as ludicrous as the divorce rate in Maine being correlated with per-capita consumption of margarine, yet the fact that correlation does not equal causation is frequently lost on investors.
Even prospective hypotheses that yield positive results with no skulduggery might not in fact indicate success. After all, if you take 100 studies each yielding a prospective p value of 0.05 the chances are that a handful (perhaps five) will in fact be false positives.
The proposal to raise the p value bar to 0.005 does nothing to counteract these fundamental problems; it merely assumes that the most egregious examples of cheating would be weeded out. This week’s JAMA analysis of published studies meeting endpoints with p=0.05 reckons about 70% would also show statistical significance at p=0.005.
A pertinent question might be how many of the 30% that fail to pass muster at 0.005 might actually be well-designed and genuinely positive trials. It is well known that biotechs with limited finances design studies to be as small and inexpensive as possible and only just big enough to scrape by powering requirements.
Is it relevant?
And yet another problem with reliance on p values is that these say nothing about the size or relevance of an effect observed, merely that it might or might not be due to chance. Curiously, the JAMA authors suggest that a stricter p value threshold could “encourage a reliance on effect sizes rather than p values”, something that might only be true if trials routinely fail to hit p=0.005.
A much better suggestion would be to force full disclosure of clinical trial plans so that the badly designed studies can be spotted; the same goes for up-front disclosure of statistical analysis methods. Clinicaltrials.gov entries at best mention such issues in passing.
Alternative statistical methods could also be encouraged. Bayesian analysis, for instance, is a method proposed by some respected statisticians, though the black magic behind it is even more poorly understood by generalists than classical null hypothesis testing.
None of which should excuse biotech investors from acquainting themselves with the subject matter, and tirelessly grilling executives to ascertain the truth behind the spin. Common company excuses for failure, such as the claim that measuring a different endpoint or running a larger study would have shown a statistically significant result, should be treated with disdain.
The fact is that a flawed understanding of statistics, common among some investors and probably a few biotech C-suites too, likely lies at the heart of the current malaise. The bald view that a certain p value – be it 0.05 or 0.005 – shows that the data are “good” is in itself fundamentally flawed. It is this ignorance that needs to be treated.