This is the 4th article in the ‘Why do we do systematic reviews?’ series (see references below for previous articles [1, 2, 3]). The series is about exploring the reasons for undertaking a systematic review, with four main reasons seeming popular. The third most popular reason, with 23.6% of the votes at the time of writing, is ‘To quantify, quite tightly, how good the intervention is‘.
I will use this article to highlight the clear, demonstrable, failings of standard systematic review methods in quantifying – accurately – how good an intervention is. I will then bring it round to the case for rapid reviews.
To be clear I find this reason the most depressing of them all – the belief that systematic reviews can be relied upon to give an accurate estimate of the effect size of an intervention. It highlights a real problem at the heart of EBM due to the prominent place systematic reviews have; they are seen as being at the pinnacle of the evidence pyramid. Yet the shortcomings are not much discussed and people, all too often, uncritically accept the results. It is the lack of transparency which is particularly insidious. While a systematic review, based on published journal articles, might be the best available evidence it doesn’t make it right or even particularly accurate. One of my favourite quotes seems pertinent:
“Those that choose the lesser of two evils soon forget they chose evil at all”
My main concerns, apart from the length of time it takes to do a traditional systematic reviews, is that of publication bias. This is the situation whereby 30-50% of all trials are not published . If a central ethos of systematic reviews is to incorporate all trials – it has failed from the outset. We know that Cochrane, one of the more rigourous systematic review producers, doesn’t have a robust system for dealing with unpublished trials. In fact I would describe it as un-systematic .
Do these missing trials have an effect? Absolutely. In 2008 Turner  published an analysis of unpublished studies of antidepressants. For this he had to look through regulatory data (in this case the FDA) to find the unpublished information. While pharmaceutical companies need to register their trials with the FDA they are under no obligation to publish them. By carefully going through the FDA data (this is non-trivial) you can find trials that have been carried out but not published. The article reported that there were 38 studies favouring an antidepressant, of which 37 were published. Of the 36 that were negative, only 3 were published. There is pretty much an even split between positive and negative trials. But if you’d rely on published studies nearly all would be positive. It’s obvious the results are likely to be distorted – which is what Turner reported.
Four years later Hart, using similar methods, looked at a much wider group of interventions .
“Overall, addition of unpublished FDA trial data caused 46% (19/41) of the summary estimates from the meta-analyses to show lower efficacy of the drug, 7% (3/41) to show identical efficacy, and 46% (19/41) to show greater efficacy.”
In just under 50% of the cases the discrepancy was greater than 10%. But what is possibly more troubling is the fact that the results are unpredictable. There is no way of knowing if the result of a meta-analysis, based on published trials, are likely to under-estimate or over-estimate the true effect size. Historically the assumption (acknowledged by Hart) was that negative trials would be unpublished (as Turner had shown) but this was simply not demonstrated in the larger study.
Tom Jefferson (known mostly for his work on Tamiflu) used the analogy of an iceberg in relation to data used for evidence synthesis.
He points out that most systematic reviews are missing huge amounts of data, not just unpublished journal articles but also the large amount of data included in documents such as clinical study reports.
The iceberg image is particularly powerful as it shows the nonsense that systematic reviews are based on ‘everything’. The inclusion criteria for most systematic reviews (ie use all published journal articles) is arbitrary and not evidence-based. It’s as much about convenience as ‘truth’. The papers by Hart and Turner show the effects.
Bottom line 1: if you look at any systematic review there is only a 50% chance that the estimate of effect is within 10% of the real figure. This is compounded by the fact that you can’t be sure if the review you’re looking at is close, an overestimate or an underestimate.
But how does this link to rapid reviews? It starts with the consideration of what, when looking at a systematic review, can you actually say about it? I’m reduced to thinking you can say that it can give a ballpark figure of the effectiveness of an intervention – nothing more. And, if you’re happy with a ballpark figure, do it quickly! I actually find this thinking quite liberating!!
Using the iceberg example is it really problematic to rely on a good sample of published journal articles? If your position is that you need all trials and therefore a sample is wrong – then you have to concede that currently the vast majority of systematic reviews are wrong. On the other hand, if you’re happy with a sample, why is a 70% sample of the trials appreciably better than 60% (as you might get via a rapid search as part of a rapid review)? It is nonsensical.
Reapplying the argument outlined above to the iceberg theme/image:
But is there any evidence of what happens when you take a sample of published trials (versus a systematic review that attempts to get all published trials)? I’m aware of two articles [8, 9] that use different sampling techniques and both report almost identical outcomes to full-blown systematic reviews. My own work in rapid reviews, not peer-reviewed, finds similar findings.
The notion of having to find all published articles to me is a nonsense, is wasteful and is therefore unethical. This is compounded by the (possibly wilful) suppression of this information. People look to systematic reviews for accurate information, the fact that the shortcomings are not advertised is negligent. Systematic reviews waste an awful lot of effort, time, resource. Methodologists fiddle around – at great effort and cost – trying to remove bias that invariably has a much smaller effect than missing trials. As stated before  one of the few reasons that I can see is an economic one in that having a high methodological entry point acts as a barrier to entry to competition.
Bottom line 2: Do reviews rapidly, explicitly mention the methodological shortcoming. If more accurate information is really required then you invariably need more resource and need to go below the surface to the depths of the iceberg.
- Why do we do systematic reviews?
- Why do we do systematic reviews? Part 2
- Why do we do systematic reviews? Part 3
- Searching for unpublished data for Cochrane reviews: cross sectional study. Schroll JB et al. BMJ 2013;346:f2231
- Selective Publication of Antidepressant Trials and Its Influence on Apparent Efficacy Turner EH et al. N Engl J Med. 2008 Jan 17;358(3):252-60
- Effect of reporting bias on meta-analyses of drug trials: reanalysis of meta-analyses Hart B et al. BMJ. 2012 Jan 3;344:d7202.
- McMaster Premium LiteratUre Service (PLUS) performed well for identifying new studies for updated Cochrane reviews. Hemens BJ et al. J Clin Epidemiol. 2012 Jan;65(1):62-72.e1
- A pragmatic strategy for the review of clinical evidence. Sagliocca L et al. J Eval Clin Pract. 2013 Jan 15. doi: 10.1111/jep.1202
- Economics and EBM. Liberating the Literature. October 2014
14 thoughts on “Why do we do systematic reviews? Part 4”
Is the non-trivial search through the FDA’s database of unpublished studies something that could be crowd-sourced? Is the data available to the public? If so, imagine something like StackOverflow where different statistical evaluations of a given FDA study could be vetted by others, with the best voted to the top! Got an extra couple hours? Review yet another FDA study, type it up on DrugStudyOverflow, and win a badge! StackOverflow offers their technology for free, I believe, for others who want to create sites about other topics. Then the site’s metadata could be used to quickly gather a set of studies — for rapid, systematic studies.
I doubt this is a new idea … but I’m wondering if there isn’t some way to apply similar crowd-sourcing voting to something like a moving average assessment of a medical factoid.
For example, something that I find confusing is the state of thinking about the impact of saturated fats on heart disease It was simply known that they were the predominant cause, but that seems to have changed.
Imagine asking a board of experts to some up with a base level of threat from saturated fat, somehow summarized as a “number”. This sounds dumb, but for discussion’s sake, let’s say the board concludes that “Consuming more than 600 calories / day from red meat, butter, and other sources of saturated fat increases the likelihood of a fatal heart attack in men over age 65 by by 85%.” That becomes the number.
Then a new study comes out. It *perturbs* the number by some amount, based on crowd-sourced evaluations of the study’s quality, and the number from the study — with more impact when there are more high-quality evaluations, with the quality assessment also crowd-sourced (StackOverflow again). If it’s a great study with lots of evaluators, and it shows *no effect* from saturated fat on heart disease among 65+ males — then the number drops to 84%.
The board of experts has much greater mass than the study, so the study only moves the number a little. The next study moves it more: 83.5%. Maybe the third study moves it in the opposite direction of the first two, toward the base line: 84.5% And so on. Where the board of experts has low consensus, the initial mass is lower and the perturbations of subsequent studies have larger effects.
The web site shows the change in the number over time, and the data point representing the impact of each study is a link to the web page where that study is evaluated on the site. Whoops, *there’s* an outlier that dropped the number significantly, let’s go read about that one! Maybe the size of the dot indicates the number of evaluations, and its color reflects the degree of consensus, e.g. low saturation means low consensus.
Some potential advantages:
It encourages lots of evaluations, in a public forum. You know better than me, the problems with the conventional peer review system. The forum is a place for experts to show off their chops, persuade others, score points and win badges (that stuff works, witness StackOverflow’s success).
It offers a way for studies to get visibility and have an impact immediately upon publication, instead of however many years later when a team undertakes the systematic review. It promotes useful conversations: The study authors want to have impact, so they’re participating in their study’s evaluation forum, possibly offering new data or new analyses to respond to critics.
Preliminary studies could appear on site, e.g. before all the data is in for a long-term study. Feedback from evaluators might improve the study going forward. “Uh, that’s a pretty good alternative hypothesis, let’s collect some data about that.” The site could host studies that were not accepted for publication. Those unpublished FDA studies could be published on the site.
Over time, the site might develop familiar memes, for example common errors in research design, e.g. the Selection Maturation guy. Hopefully such errors would occur less in future studies.
The site could increase collaboration: Maybe participants who get to know each other from their comments on related studies, join together to conduct new research into an issue raised by the studies they’re discussing on the forum.
There’s math underneath, apply Newtonian physics where the scientific consensus is the heavenly body. This steadies the science. When a journalist wants the latest word on a subject, it’s very similar to last months’s latest word. Hysteria over this week’s study would be uncool.
Evaluators would acquire mass. When Prof. Schickele (at the University of Southern North Dakota at Hoople) has offered a hundred evaluations, most of which are upvoted by others to top levels, and his evaluations correlate highly with the overall assessments of the studies — then the Professor’s opinions accrue greater weight and have greater impact. (On StackOverflow, there’s a guy who is well-known to deliver the *right answer* if the subject is C# — he has many points and huge reputation and all the badges — and he’s super-credible.) When the Professor clicks the [Submit] button, the number observably changes in the direction of the Professor’s evaluation’s opinion, because of his massive mass — he is rewarded for being a high-quality evaluator. With StackOverflow, mass is signified by *reputation* — though there’s no weighting of opinions like I’m suggesting.
Trolls and shills end up with embarrassingly low reputation. Their evaluations are out of site at the bottom of the page, and have no impact on the number.
Uh, I may have gone on a little too long. Feel free to remove this comment if it’s nonsense or otherwise objectionable. My feelings won’t be hurt. It’s just an idea.
The lead author of the 2008 NEJM article, Erick Turner (@eturnermd1) sent two Tweets in relation to the above story to the rapid reviews Twitter account (@rapid_revs_info):
1) Re: our paper w/ the unpub’d antidep trials: all 23 found in FDA reviews, 0 in Cochrane Central Reg of Ctrld Trials
2) “Cookbook” 4 navigating to & through the user-unfriendly FDA website & review docs http://www.bmj.com/content/347/bmj.f5992
Thanks for this series of articles John, I’ve found them a very interesting read and on the whole I agree with your conclusions. Those of us involved in carrying out reviews and teaching EBM are well aware of the limitations of systematic reviews and for a while now I (and I guess many others) have come to the conclusion that we probably need some new definitions around reviews. As you point out a “true” systematic review should synthesise all of the evidence, both published and unpublished and if the reviewers haven’t been able to obtain the unpublished evidence then maybe they shouldn’t be claiming to have written a systematic review. Maybe we need to start calling them what they are, a review and/or synthesis of published evidence, an “attempted systematic review” perhaps, which reports candidly why they haven’t included unpublished data. (We perhaps need to think of something a bit sexier for a name for this type of review than that to please journal publishers though?). It is of course incumbent on all authors and publishers of reviews to make this plain and on the whole I think authors do tend to point out the limitations of the review they’ve written (if they don’t peer reviewers are always keen to do so in my experience). I like the idea of “rapid reviews” however I would say that a rapid review by its very nature suffers from a study selection bias which limits the validity of its conclusions even more so than an attempted systematic review. If you are looking for a quick conclusion of the effectiveness of a treatment, why not just look for the “best” RCT and base your clinical decision making on that, would that not have the same validity as a rapid review? Indeed if most rapid reviews come to the same overarching conclusion that most published systematic reviews do in my opinion (i.e. that more good quality RCTs need to be done) then decision making on the best single RCT would seem to me to be just as valid.
Thanks for that. A few points:
– Is study selection bias a real issue? If you take an unbiased sample of published trials – done rapidly or otherwise – what are the effects on the bias? It appears that the bigger trials are typically easier to find and – handily – they are broadly better. You spend lots of time finding the more esoteric studies which are typically the ones that are worse.
– As for the best RCT, I think heuristics like this are essential. But it comes down to why you’re doing the review (one size does not fit all). I think the best RCT would work if you’re looking for a good estimate of effect size. However, if you’re doing a SR to look at what’s been done previously that will have limited value.
I think study selection bias is an issue because in terms of critical appraisal of a review (and I’m avoiding saying systematic review because as you’ve correctly pointed out, there aren’t actually that many true systematic reviews out there) it looks just like confirmation or exclusion bias and may therefore lead to a review, being dismissed before the results are considered, the perception being that the internal validity of the review is fatally flawed by these biases.
I have really enjoyed this series, and this post in particular. It explains my permanent uncomfortable state of frustration as a member of a Cochrane group dedicated to increasing the involvement of patients and the public with Cochrane systematic reviews. I can fully understand why Cochrane seems to be dragging its feet on this issue, but I am not going to give up. Meaningful consumer involvement with systematic reviews by setting review questions and selecting outcomes will quickly make the many problems besetting primary research obvious to the people who are most negatively and personally affected. Cochrane could see this as an opportunity to engage, educate and mobilize the public to demand and help design high quality primary research.
Thanks Caroline. One thing that someone pointed out is that Cochrane are a relatively minor player in the SR world, publishing a small number of the total. But your point is valid, involving patients is surely important and those that ignore them are really doing a disservice. Patients should be in from the start and ensuring the outcomes are valid for them and if the RCTs are not dealing with issues/outcomes they care about the validity of any SR will be limited.
Which is why to survive and prosper I think Cochrane should recognise that producing “the best” systematic reviews is not a raison d’etre. By involving patients and the public in a meaningful way in what it does, and doing things very differently, perhaps great things could be achieved. Internal validity of a review is of no use when the external validity is severely compromised by poor primary research (which is also also done without the input of patients).
Really appreciate this series and the thinking. So glad to hear your views on the importance of involving patients and the public for meaningful relevance to health care Caroline!
LikeLiked by 1 person