OUP user menu

Efficacy of antidepressants: a re-analysis and re-interpretation of the Kirsch data

Konstantinos N. Fountoulakis , Hans-Jürgen Möller
DOI: http://dx.doi.org/10.1017/S1461145710000957 405-412 First published online: 1 April 2011


Recently there has been much debate on the true usefulness of antidepressant therapy especially after the publication of a meta-analysis by Kirsch et al. (PLoS Medicine 2008, 5, e45). The aim of the current paper was to recalculate and re-interpret the data of that study. Effect-size and mean-score changes were calculated for each agent separately as well as pooled effect sizes and mean changes on the basis of the data reported by Kirsch et al. The weighted mean improvement was (depending on the method of calculation) 10.04 or 10.16 points on the Hamilton Depression Rating Scale (HAMD) in the drug groups, instead of 9.60, and thus the correct drug–placebo difference is 2.18 or 2.68 instead of 1.80. Kirsch et al. failed to report that that the change in HAMD score was 3.15 or 3.47 points for venlafaxine and 3.12 or 3.22 for paroxetine, which are above the NICE threshold. Still the figures for fluoxetine and nefazodone are low. Thus it seems that the Kirsch et al.'s meta-analysis suffered from important flaws in the calculations; reporting of the results was selective and conclusions unjustified and overemphasized. Overall the results suggest that although a large percentage of the placebo response is due to expectancy this is not true for the active drug and effects are not additive. The drug effect is always present and is unrelated to depression severity, while this is not true for placebo.

Key words
  • Antidepressants
  • depression
  • efficacy
  • meta-analysis
  • treatment


Recently there has been much debate on the true usefulness of antidepressant therapy (Ghaemi, 2008). Usually meta-analytical studies report a statistically significant result; however, the effect size is relatively small (Bech et al. 2000) and many authors consider it to be of marginal clinical significance (Moncrieff et al. 2004). When datasets including unpublished as well as published clinical trials are used in meta-analyses, even smaller effects are reported. Recently a meta-analysis on the usefulness of antidepressants by Kirsch et al. (2008) attracted much attention from scientists and the general public. These authors obtained data from the FDA and thus tried to avoid publication bias. Their meta-analysis reported that the overall effect of new-generation antidepressant medications is below the NICE recommended criteria for clinical relevance. They also reported that efficacy reaches clinical relevance only in trials involving the most extremely depressed patients, and that this pattern is due to a decrease in the response to placebo rather than an increase in the response to medication.

The aim of the present study was to re-analyse these data, verify the numerical results and investigate whether the conclusions are supported by the data. So far only narrative critique on the basis of clinical and accumulated scientific wisdom has been made regarding that meta-analysis; however, there is no verification of its results. Furthermore, the aim of the present study was to put the results and conclusions that derive from this meta-analysis in their correct frame in terms of interpretation and clinical implications.

Material and methods

The paper by Kirsch et al. (2008) was carefully scrutinized so as to extract all major conclusions. The data reported in table 1 of that paper was used for the calculations presented in the current study, with the addition of publication dates of trials as reported in the reference list of the same meta-analysis.

View this table:
Table 1

Main findings by Kirsch et al. in contrast to the results of the current analysis

The pooled mean (Mp) was calculated by the following function: Embedded Image where Mp is the pooled standard deviation, ni is the sample size of the ith sample, and Mi is the mean of the ith sample.

The effect size (Cohen's d) was calculated as the mean change divided by the standard deviation of the scale. It can be calculated both as an absolute measure of an arm's efficacy or as the difference in the efficacy between arms.

By reversing this function, one can calculate the standard deviation on the basis of available d and mean values. The pooled standard deviation (SDp) was calculated by the following function: Embedded Image where SDp is the pooled standard deviation, ni is the sample size of the ith sample, si is the standard deviation of the ith sample, and k is the number of samples being combined.

The pooled effect size as well as the pooled change in Hamilton Depression Rating Scale (HAMD) score was calculated by weighting by sample size and by the inverse of the variance, although some authors suggest these methods are essentially equal (Friedrich et al. 2008).

All the calculations of the current study were based on table 1 of the original paper of Kirsch et al. (2008) and thus calculations reported here are easily verifiable and replicable. We present all the data that derive from the current recalculation and do not make any effort to select the most important, in an effort to produce a transparent picture.


The main results and conclusions derived from the text of the Kirsch et al. meta-analysis are shown in Table 1. The results from the recalculation of data are shown in Table 2.

View this table:
Table 2

Individual by trial and pooled baseline HAMD scores, sample sizes, standard deviations and raw improvement and effect sizes d, on the basis of table 1 by Kirsch et al. (2008)

In contrast to what was reported by Kirsch et al. , the recalculation of values on the basis of the data reported by those authors in their study (Table 1), revealed that the correct weighted mean improvement was 10.04 (weighting by sample size) and 10.16 (inverse variance) points on the HAMD in the drug groups, instead of 9.60. Since the placebo change was similar in both analyses (7.80 vs. 7.85/7.48), the correct drug–placebo difference is 2.18 (weighting by sample size) or 2.68 (inverse variance) instead of 1.80. This means that Kirsch et al. reported a >25% lower difference. In contrast, all the effect sizes (d) reported by Kirsch et al. are confirmed by the results of the current study. However, these authors failed to report that that the change in HAMD score was 3.15 (weighting by sample size) or 3.47 (inverse variance) points for venlafaxine and 3.12 (weighting by sample size) or 3.22 (inverse variance) for paroxetine, which are above the NICE threshold of 3 points change.

The plot of effect size d vs. year of publication (Fig. 1) suggests there might be an increased drug and placebo efficacy over the passing years, which remains to be explained. However, the difference between drug and placebo remains stable over the years with regression lines being parallel. In Fig. 2 the difference between groups produces a horizontal line suggesting again that the placebo–drug difference is independent of year.

Fig. 1

Plot of effect sizes of drug (–––––) and placebo (· · · · · · ·) vs. year of publication. The lines are parallel (y=−33.9406+0.0175x and y=−34.6242+0.018x).

Fig. 2

Plot of effect sizes of drug–placebo difference vs. year of publication. The line is horizontal (y=−0.6836+0.0006x).


Several papers in the literature criticize the conclusions of Kirsch et al. (Bech, 2010; Broich, 2009; Ghaemi, 2008; McAllister-Williams, 2008; Moller, 2008; Moller, 2009a, b; Moller & Broich, 2010; Moller & Maier, 2010) by focusing on the limitations of RCTs, on clinical issues and especially, on the problematic properties of HAMD (Bech, 2001, 2004, 2006, 2010; Bech et al. 2004) and on the fact that the effectiveness of antidepressants in clinical practice is normally optimized by sequential and combined therapy approaches (Rush, 2007).

The additivity thesis of pharmacological efficacy is central in RCT logic, being the assumption that the specific or ‘true’ magnitude of the pharmacological effect is limited to the difference between the drug and placebo responses (Waring, 2008). This is a convenient and practical way to prove a specific drug's efficacy, and does not necessarily demand an identical neurobiological mode of therapeutic action, although the theory behind this method suggests it. Kirsch's scientific work is largely on ‘response expectancy’ and has been the focus of Kirsch's research for decades especially concerning hypnosis, psychotherapy, placebo effects, etc. In a further exploration of the early Gelfand's theory, response expectancy was found to be altered by previous experience, and even very small changes in the context of presentation can affect individual differences in the placebo response (entire placebo situation) (Gelfand et al. 1963) while the placebo effects were found to be significantly associated with response expectancy (Whalley et al. 2008). Essentially this model does not take into consideration the fact that placebo-arm patients receive additional treatment, usually with benzodiazepines, which strongly affect several HAMD items.

The hypothesis concerning depression is that there is a response which lies on a continuum from no intervention at all (e.g. waiting lists) to neutral placebo, then to active and augmented placebo including psychotherapy and finally to antidepressants which exert a slightly higher efficacy probably because blinding is imperfect because of side-effects (enhanced placebo) (Kirsch, 2004, 2005, 2008a, b; Kirsch & Johnson, 2008; Kirsch & Moncrieff, 2007). In order to confirm this theory, i.e. all interventions including antidepressants work through ‘response expectancy’, it is necessary to prove that interventions with similar presentation characteristics have similar efficacy and that differences in efficacy can be explained by differences in the magnitude of ‘response expectancy’.

Research so far suggests that there may not be one placebo response but several, and there are multiple mechanisms involved and they may differ as a function of the context in which the placebo is presented. Co-medication, usually with benzodiazepines, according to clinical needs is also responsible for a portion of the placebo response. In this line of research, there was some data suggesting that a large component of the ‘active’ treatment was placebo-mediated (Moncrieff & Kirsch, 2005). The meta-analysis of antidepressant trials comparing those with a run-in period to those without one suggests that this method does not increase drug–placebo differences (Posternak et al. 2002; Trivedi & Rush, 1994), this means that any patient could be a placebo responder which is not possible to predict. There are data suggesting that the proportion of patients that respond are indeed on a continuum (e.g. 28% on waiting list, 44% in limited group, and 62% in augmented group (Kaptchuk et al. 2008). However, the model predicts that all augmented placebos should have similar efficacy and thus this theory does not explain why psychotherapy is inferior to pharmacotherapy (Cuijpers et al. 2010a, b) since both are considered to have similar ‘enhanced placebo’ qualities. Kirsch seems to consider psychotherapy closer to active placebo, and antidepressants closer to augmented placebo, partially because of methodological issues concerning the ability to blind (side-effects make patients realize that they are taking drug instead of placebo, and this increases their expectation) (Kirsch, 2009a).

As a final need for the model to be considered as correct, it is necessary to prove that antidepressants are not significantly better than placebo and that there are no differences between antidepressants. The supposed equal efficacy of all types of antidepressants (i.e. SSRIs, tricyclics, monoamine oxidase inhibitors) despite the fact that they have different modes of operation, the supposed similar efficacy of some active drugs that are not considered antidepressants (amylobar bitone, lithium, liothyronine, and adinazolam) and the high correlation between the placebo and the drug response support this thesis (Kirsch, 2000), which implies a similar mechanism underlying the response no matter what the treatment intervention. This is also partially supported by neuroimaging data (Konarski et al. 2009; Martin et al. 2001), with the reservation that correlation does not necessarily imply shared causality.

The final finding in support of Kirsch's theory was the results of his 2008 meta-analysis (Kirsch et al. 2008), which reported that the effect size and the magnitude of change in HAMD score are small (effect size d<0.50 and change in HAMD equal to 1.80 points) and thus antidepressants fall well below criteria for clinical relevance suggested by NICE; however, these criteria are not generally accepted (Moller, 2008). Similar findings were reported by Barbui and colleagues for paroxetine alone (Barbui et al. 2008). Kirsch et al. also reported that efficacy reaches clinical relevance only in trials involving the most extremely depressed patients, and that this pattern is due to a decrease in response to placebo rather than an increase in response to medication. They also found no linear relation between severity and response to medication. A general flaw in the methodology of pursuing the issue of response expectancy through the results of RCTs is that the reported ‘efficacy’ based on last observation carried forward (LOCF) analysis is in fact a hybrid of both efficacy and tolerability.

However, the finding that antidepressants act at the same magnitude irrespective of initial severity while placebo depends on it, suggest a different mechanism underlying these two different interventions, with antidepressants being unrelated to response expectancy. Response expectancy is strongly related to severity of depression (according to cognitive theory), thus the regression lines in figs 2 and 3 of Kirsch et al. should have been parallel. This is the reason why at higher severity placebo efficacy falls closer to the levels of waiting list. These authors stress the finding that there was a negative relation between severity and the placebo response, whereas there was no difference between those with relatively low and relatively high initial depression in their response to drug. Thus, the increased benefit for extremely depressed patients seems attributable to a decrease in responsiveness to placebo, rather than an increase in responsiveness to medication. However, they do not explain how this fits their position. Our analysis revealed that over passing years there is an increase in both the active drug and placebo efficacy with regression lines being parallel (Fig. 2), possibly suggesting that the techniques developed over the years to support and keep patients in the study also increase expectancy.

Other data against this theory come from a recent meta-analysis suggesting escitalopram is the most effective agent with also the highest tolerability (Cipriani et al. 2009). This means that increased efficacy can not be explained on the basis of unblinding because of side-effects.

The current study recalculated the results of the original Kirsch et al. study by using the data published by those authors. Although most of their results were verified, there are two major differences that raise important questions both concerning the methodology and the conclusion by these authors. Even more surprisingly, both the methods applied in the current recalculation (weighting by sample size and inverse variance) produced similar results and verified the flaw in the original analysis.

The first finding is the different value in HAMD change for the antidepressant group which leads to a 2.18/2.68 overall difference in change instead of 1.80. The second is that although these authors report accurately the d values for individual antidepressants they fail to report that both venlafaxine and paroxetine had HAMD change scores >3. Still, the respective values for fluoxetine and nefazodone are low. It is impossible to calculate individual d values without calculating the change scores first, so this failure cannot be considered as a plain omission, especially in the frame of the importance of a possible ‘clinical difference’ between drugs. Furthermore, four RCTs concerning the elderly, lead to lower d values since it is known that the elderly constitute a refractory population. Further discussion on this issue is beyond the scope of the current paper.

Previous comment suggested that at least the calculations were correct (e.g. ‘Undoubtedly the findings in this analysis are robust, as far as the studies included in the analysis are concerned’; McAllister-Williams, 2008) and that all relevant results are published in the paper. Recently it has been documented that there is a significant bias in the publication of antidepressant trials (Turner et al. 2008); however, Kirsch went further and accused the FDA as having an explicit decision to keep this information from the public and from prescribing physicians (Kirsch, 2009a). He also suggested that because they do not incur drug risks, alternative therapies (e.g. exercise and psychotherapy) showing equal benefits to those of antidepressants, may be a better treatment choice for depression (Kirsch, 2009a) and went on to author a book under the title The Emperor's New Drugs: Exploding the Antidepressant Myth (Kirsch, 2009b). This was also the picture painted in the media. However, psychotherapy seems to be significantly less effective than pharmacotherapy, since the effect size (in RCTs of lower quality than that of drugs) is close to the placebo effect size (Cuijpers et al. 2010a, b) and additionally, it appears that publication bias is more pronounced concerning non-pharmacological treatments (Cuijpers et al. 2010a). The results of psychotherapy studies cannot be directly compared to results of antidepressant trials; in spite of being based on RCTs, psychotherapy studies are not blinded, in contrast to antidepressant trials.

Conclusively, Kirsch et al.'s results suggest that although a large percentage of the placebo response is due to expectancy this is not true for the active drug and effects are not additive. The drug effect is always present and is unrelated to depression severity, while this is not true for placebo. If this is proved to be true in future research, then the value of RCTs as the major tool for investigating the efficacy of antidepressants is doubtful.



Statement of Interest

K.N.F. has received support concerning travel and accommodation expenses from various pharmaceutical companies in order to participate in medical congresses. He has also received honoraria for lectures from AstraZeneca, Janssen-Cilag, Eli-Lilly and a research grant from Pfizer Foundation. He is member of the board of Wyeth for desvenlafaxine and Bristol–Myers Squibb for aripirpazole in bipolar disorder. H.J.M. has received grants or is a consultant for and on the speakers' bureaux of AstraZeneca, Bristol–Myers Squibb, Eisai, Eli-Lilly, GlaxoSmithKline, Janssen-Cilag, Lundbeck, Merck, Novartis, Organon, Pfizer, Sanofi-Aventis, Schering-Plough, Schwabe, Sepracor, Servier, and Wyeth.


View Abstract