Single-arm non-inferiority? Does that make any sense?

In the previous post I summarised some of the reasons I find many single-arm trials problematic, and why I think that they are often used inappropriately, and misinterpreted. In this and subsequent (planned!) posts, I’m going to look at some examples, to give some concrete examples of what I think are questionable practices in recent trials. The first one (1) is a trial of reduced chemotherapy and radiotherapy for localised germinoma. The basic rationale is that there are several different chemotherapy regimens in clinical use for these patients. These are thought to have similar effects, and outcomes are generally good, with high survival rates. The question addressed by this trial was whether reducing the intensity of chemotherapy and radiotherapy would cause fewer harmful effects on neurocognition, without compromising survival. That sounds very much like the trial that is needed would be a randomised comparison between a reduced chemotherapy/radiotherapy regimen and standard chemotherapy. However, a single-arm design was used. It’s worth saying at this point that this was badged as a Children’s Oncology Group study, so some reputable researchers are officially supporting it.

Aims of the study

The study aims were described in several places in the paper:

Introduction: “The study aimed to evaluate whether simplified chemotherapy followed by dose-reduced irradiation was effective for treating patients (ages 3–21 years) with localized germinoma”

Study objective section: “The primary objective … was to determine the 3-year progression-free survival (PFS) rate”

Statistical design section: “The primary objective was to evaluate if the previously reported 3-year PFS rate of 95% could be maintained with the proposed reduced-intensity treatment”

These three statements of the aims are not quite the same, but all are hard to achieve with a single-arm trial design.

The first statement is about whether the treatment is “effective.” It isn’t clear exactly what is meant by this, but the important issue is surely how the proposed treatment compares to existing treatments, which, in this particular case, are known to be associated with good survival rates. Is it better or worse, or pretty much equivalent, taking into account its effects on survival and side effects? That’s a comparative question that can’t be answered easily with a single-arm trial.

The trial actually did make a comparison – but with an assumed rate of progression-free survival of 95% PFS at 3 years, rather than another randomised treatment. The comparison was designed to establish “non-inferiority,” which raises another whole load of issues that I’ll return to later. But the main reason the comparison is inadequate is that we cannot be sure that this rate would apply to the patients in the trial if they received standard care, and I couldn’t see any evidence in the paper that this was the case. It’s just an estimate from the whole treated population of what the overall success rate is like. Maybe the trial recruited a lot of higher-risk patients whose survival rate with standard care would have been much lower than 95%? If so, the treatment is likely to look poor compared to a 95% threshold, whereas in reality it may be pretty much equivalent. Or of course the opposite might have occurred.

The second statement of the aim says that the study aims to estimate “the” 3-year progression-free survival rate. The assumption here is that there is a value of 3-year progression-free survival that is characteristic of the treatment. But, in reality, the observed progression-free survival rate will depend on the patients that are recruited – and it might vary widely. This is why randomisation and a control arm are crucial. If the trial population is different from that seen in clinical practice, it’s still valid to estimate a treatment effect for the intervention versus the control (acknowledging that there are issues of generalisability and transportability of treatment effects – but omitting the control doesn’t help with any of those).

The third statement is different again, and frames the question as being about non-inferiority; is the rate of 95% that is expected with standard care achieved by the new treatment? Again, without a randomised comparison this is difficult to establish. We don’t know exactly what the outcome rate would be in the trial patients if they received standard care. It’s likely to be high, but probably wouldn’t be exactly 95%, and any patient selection could shift it further from that value. So it probably isn’t very reasonable to use a single arm design to judge whether the new treatment is similar (or non-inferior) to the control.

Issues in the design

The aim of the trial (according to the statistical design section at least) was to establish that the reduced regimen was “non-inferior” to the assumed value of 95% progression-free survival with standard care.

The trial design assumed a noninferiority margin of 8% (absolute), meaning that the design was intended to establish that the lower 95% confidence limit of the 3-year PFS estimate was above 87%. According to their calculation (which I haven’t checked) this gave a required sample size of 79 patients (90% power and 5% type 1 error). Essentially this is saying that they will conclude “non-inferiority” if values no lower than 87% are compatible with the data. A conclusion of non-inferiority would require 73/79 patients to be alive and progression-free at 3 years. In fact only 74 patients were recruited, with 86.5% alive and progression-free at 3-years (though the estimated PFS from the survival model was 94.5%, I think because the raw dichotomised result assumed that losses to follow-up and withdrawals were treatment failures). So the trial did not in fact meet the statistical criterion for non-inferiority, though that may be partly due to the way they counted losses and withdrawals.

The big issue here is whether a single-arm non-inferiority design makes sense. Comparing with a single point estimate and (potentially) declaring the new treatment non-inferior to standard care is really cheating. A standard sample size calculation using their parameters gives 128 per group (256 in total) – which is a lot more than 79. So why would anyone do a randomised non-inferiority trial if it’s possible to establish non-inferiority with only 31% of the sample size?

We might be able to convincingly demonstrate that the progression-free survival proportion is non-inferior to a point value of 95% in these patients, but that isn’t the same as showing that it’s non-inferior to standard care. We just don’t know if that would be the outcome of the patients in the trial if given standard care. Even if the underlying true rate is 95%, there will be random variation and you might see a rate that is somewhat different from 95%. An actual comparison with standard care is needed to investigate this question – it just cannot be adequately addressed by a single-arm trial.

Defence of the single-arm design?

I imagine that some might say that my criticisms are too harsh: it is a Phase II trial, with the limited aim of establishing whether the intervention has a reasonable chance of being beneficial. I think there are two responses to this. First, that isn’t what the study’s aims said (see above); and second, is it even possible to establish this with a single arm trial, especially in a situation when standard care is already very good?

A more reasonable aim for a Phase II trial might be to establish that the reduced treatment is not really bad: if the PFS rate was substantially lower than a “very high” rate (maybe 50% or lower) it might be reasonable (i.e. correct most of the time) to reject the new therapy, and conclude that it shouldn’t go forward into a comparative trial. I’m not convinced by that argument though, because the potential for patient selection could have a big effect on the observed event rate and we might be running a substantial risk of rejecting a treatment that actually is non-inferior. Running a single arm trial seems like a lot of effort to get a very weak answer; it surely would not be much more effort to do it properly and randomise?

Conclusion

It seems to me that single-arm trials are often conducted inappropriately and without sufficient thought about what question they are addressing. This is probably at least partly a consequence of their legitimate and sensible use in Phase 2 cancer trials, which has led people to regard them, uncritically, as a viable alternative to randomised trials. They really aren’t. There is often a lack of clarity about the aim of a study and what is the appropriate design to achieve that, with the result that single-arm trials are frequently used in places where they shouldn’t. I suspect that many are conducted because they are logistically and administratively easier, and an easier sell to patients. The problems arise when these considerations override sound scientific design. There’s no point in taking the easy option if the cost of doing so is producing junk.

References

Bartels U et al. Phase II trial of response-based radiation therapy for patients with localized germinoma: a Children’s Oncology Group study. Neuro-Oncology, Volume 24, Issue 6, June 2022, Pages 974–983

The Muddy Waters of single-arm trials

Blog Archive

Archive of all previous blog posts