Statistical aspects of trial design
Received May 23; Accepted May This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.
Case series and case reports A case report is a descriptive study of a single individual, whereas case series is a study of a small group.
A NALYTICAL S TUDY These studies are generally although not always used to test one or more specific hypotheses, typically whether an exposure is a risk factor for a disease or an intervention is effective in preventing or curing disease or any other occurrence or condition of interest. Cross sectional Cross-sectional study is also known as a prevalence study. Case-control In studies using this particular design, patients who already have a certain condition cases are compared e.
Open in a separate window. Figure 1. Cohort Longitudinal studies A cohort study begins with a group of subjects with some causative factor e. Figure 2. Correlational studies These studies sometimes called ecologic studies explore the statistical connection between disease in different population groups and estimated exposures in groups rather than individuals. Nonrandomized controlled This is an experimental study in which people are allocated to different interventions using methods that are not random.
Randomized controlled Randomized controlled trials RCTs are considered the "gold standard" in medical research since they offer the best answers about the effectiveness of different therapies or interventions. Figure 3. Parallel In parallel studies, treatment and controls are allocated to different individuals. Figure 4. Crossover In these types of studies each patient serves as his own control.
Figure 5. Assumptions The effects of intervention during the first period do not carry over into the second period. Factorial Studies involving two or more factors while randomizing are called factorial designs [ Figure 6 ]. Figure 6. Cluster It is a type of randomized controlled trial wherein groups of participants as opposed to individual participants are randomized.
Quasi-randomized In these studies, participant allocation is done using schemes such as date of birth odd or even , number of the hospital record, date at which they are invited to participate in the study odd or even , or alternatively into different study groups.
Acknowledgments The authors sincerely acknowledge Dr. Campbell MJ, Machin D. Chichester: Wiley; Medical Statistics: A common-sense approach. London: Oxford University Press; Foundations of Epidemiology.
Grimes D, Schulz K. Bias and causal association in observational research. How valid is a self-reported 12 month sports injury history? Br J Sports Med. Choi B, Noseworthy A. Classification, direction, and prevention of bias in epidemiologic research.
J Occup Med. Colon cancer controls versus population controls in case-control studies of occupational risk factors. BMC Cancer. What study design should I choose? Malays Fam Physician. The quality of reports of randomised trials in and Comparative study of articles indexed in PubMed. Byron J, Kenward MG. London: Chapman and Hall; Design and Analysis of Cross-Over Trials. In many situations, more than one efficacy endpoints are used to address the primary objective.
This creates a multiplicity issue since multiple tests will be conducted. Decisions regarding how the statistical error rates e.
Endpoints can be classified as being objective or subjective. Objective endpoints are those that can be measured without prejudice or favor. Death is an objective endpoint in trials of stroke. Subjective endpoints are more susceptible to individual interpretation.
For example, neuropathy trials employ pain as a subjective endpoint. Other examples of subjective endpoints include depression, anxiety, or sleep quality. Objective endpoints are generally preferred to subjective endpoints since they are less subject to bias. An intervention can have effects on several important endpoints.
Composite endpoints combine a number of endpoints into a single measure. The advantages of composite endpoints are that they may result in a more completed characterization of intervention effects as there may be interest in a variety of outcomes.
Composite endpoints may also result in higher power and resulting smaller sample sizes in event-driven trials since more events will be observed assuming that the effect size is unchanged. Composite endpoints may also reduce the bias due to competing risks and informative censoring. This is because one event can censor other events and if data were only analyzed on a single component then informative censoring can occur. Composite endpoints may also help avoid the multiplicity issue of evaluating many endpoints individually.
Composite endpoints have several limitations. Firstly, significance of the composite does not necessarily imply significance of the components nor does significance of the components necessarily imply significance of the composite.
For example one intervention could be better on one component but worse on another and thus result in a non-significant composite. Another concern with composite endpoints is that the interpretation can be challenging particularly when the relative importance of the components differs and the intervention effects on the components also differ.
For example, how do we interpret a study in which the overall event rate in one arm is lower but the types of events occurring in that arm are more serious? Higher event rates and larger effects for less important components could lead to a misinterpretation of intervention impact. It is also possible that intervention effects for different components can go in different directions. Power can be reduced if there is little effect on some of the components i.
When designing trials with composite endpoints, it is advisable to consider including events that are more severe e. It is also advisable to collect data and evaluate each of the components as secondary analyses.
This means that study participants should continue to be followed for other components after experiencing a component event. When utilizing a composite endpoint, there are several considerations including: i whether the components are of similar importance, ii whether the components occur with similar frequency, and iii whether the treatment effect is similar across the components.
In the treatment of some diseases, it may take a very long time to observe the definitive endpoint e. A surrogate endpoint is a measure that is predictive of the clinical event but takes a shorter time to observe. The definitive endpoint often measures clinical benefit whereas the surrogate endpoint tracks the progress or extent of disease.
Surrogate endpoints could also be used when the clinical end-point is too expensive or difficult to measure, or not ethical to measure.
Surrogate markers must be validated. Ideally evaluation of the surrogate endpoint would result in the same conclusions if the definitive endpoint had been used. The criteria for a surrogate marker are: 1 the marker is predictive of the clinical event, and 2 the intervention effect on the clinical outcome manifests itself entirely through its effect on the marker. It is important to note that significant correlation does not necessarily imply that a marker will be an acceptable surrogate.
Missing data is one of the biggest threats to the integrity of a clinical trial. Missing data can create biased estimates of treatment effects. Thus it is important when designing a trial to consider methods that can prevent missing data. Researchers can prevent missing data by designing simple clinical trials e.
Similarly it is important to consider adherence to protocol e. Envision a trial comparing two treatments in which the trial participants in both groups do not adhere to the assigned intervention. Then when evaluating the trial endpoints, the two interventions will appear to have similar effects regardless of any differences in the biological effects of the two interventions.
Note however that the fact that trial participants in neither intervention arm adhere to therapy may indicate that the two interventions do not differ with respect to the strategy of applying the intervention i.
Researchers need to be careful about influencing participant adherence since the goal of the trial may be to evaluate the strategy of how the interventions will work in practice which may not include incentives to motivate patients similar to that used in the trial. Sample size is an important element of trial design because too large of a sample size is wasteful of resources but too small of a sample size could result in inconclusive results.
Calculation of the sample size requires a clearly defined objective. The analyses to address the objective must then be envisioned via a hypothesis to be tested or a quantity to be estimated. The sample size is then based on the planned analyses. A typical conceptual strategy based on hypothesis testing is as follows:. Formulate null and alternative hypotheses. Select the Type I error rate. Type I error is the probability of incorrectly rejecting the null hypothesis when the null hypothesis is true.
In the example above, a Type I error often implies that you incorrectly conclude that an intervention is effective since the alternative hypothesis is that the response rate in the intervention is greater than in the placebo arm. For example, when evaluating a new intervention, an investigator may consider using a smaller Type I error e.
Alternatively a larger Type I error e. Select the Type II error rate. Type II error is the probability of incorrectly failing to reject the null hypothesis when the null hypothesis should be rejected. The implication of a Type II error in the example above is that an effective intervention is not identified as effective. Type II error and power are not generally regulated and thus investigators can evaluate the Type II error that is acceptable.
For example, when evaluating a new intervention for a serious disease that has no effective treatment, the investigator may opt for a lower Type II error e. Obtain estimates of quantities that may be needed e. This may require searching the literature for prior data or running pilot studies.
Select the minimum sample size such that two conditions hold: 1 if the hull hypothesis is true then the probability of incorrectly rejecting is no more than the selected Type I error rate, and 2 if the alternative hypothesis is true then the probability of incorrectly failing to reject is no more than the selected Type II error or equivalently that the probability of correctly rejecting the null hypothesis is the selected power.
Since assumptions are made when sizing the trial e. Interim analyses can be used to evaluate the accuracy of these assumptions and potentially make sample size adjustments should the assumptions not hold.
Sample size calculations may also need to be adjusted for the possibility of a lack of adherence or participant drop-out. In general, the following increases the required sample size: lower Type I error, lower Type II error, larger variation, and the desire to detect a smaller effect size or have greater precision. An alternative method for calculating the sample size is to identify a primary quantity to be estimated and then estimate it with acceptable precision.
For example, the quantity to be estimated may be the between-group difference in the mean response. A sample size is then calculated to ensure that there is a high probability that this quantity is estimated with acceptable precision as measured by say the width of the confidence interval for the between-group difference in means. Interim analysis should be considered during trial design since it can affect the sample size and planning of the trial.
When trials are very large or long in duration, when the interventions have associated serious safety concerns, or when the disease being studied is very serious, then interim data monitoring should be considered. Typically a group of independent experts i.
The DSMB meets regularly to review data from the trial to ensure participant safety and efficacy, that trial objectives can be met, to assess trial design assumptions, and assess the overall risk-benefit of the intervention.
The project team typically remains blinded to these data if applicable. The DSMB then makes recommendations to the trial sponsor regarding whether the trial should continue as planned or whether modifications to the trial design are needed. Careful planning of interim analyses is prudent in trial design. Care must be taken to avoid inflation of statistical error rates associated with multiple testing to avoid other biases that can arise by examining data prior to trial completion, and to maintain the trial blind.
Many structural designs can be considered when planning a clinical trial. Common clinical trial designs include single-arm trials, placebo-controlled trials, crossover trials, factorial trials, noninferiority trials, and designs for validating a diagnostic device.
The choice of the structural design depends on the specific research questions of interest, characteristics of the disease and therapy, the endpoints, the availability of a control group, and on the availability of funding. Structural designs are discussed in an accompanying article in this special issue.
This manuscript summarizes and discusses fundamental issues in clinical trial design. A clear understanding of the research question is a most important first step in designing a clinical trial. Minimizing variation in trial design will help to elucidate treatment effects.
Randomization helps to eliminate bias associated with treatment selection. Stratified randomization can be used to help ensure that treatment groups are balanced with respect to potentially confounding variables. Blinding participants and trial investigators helps to prevent and reduce bias. Placebos are utilized so that blinding can be accomplished. Control groups help to discriminate between intervention effects and natural history.
The selection of a control group depends on the research question, ethical constraints, the feasibility of blinding, the availability of quality data, and the ability to recruit participants. The selection of entry criteria is guided by the desire to generalize the results, concerns for participant safety, and minimizing bias associated with confounding conditions.
Endpoints are selected to address the objectives of the trial and should be clinically relevant, interpretable, sensitive to the effects of an intervention, practical and affordable to obtain, and measured in an unbiased manner. Composite endpoints combine a number of component endpoints into a single measure.
Surrogate endpoints are measures that are predictive of a clinical event but take a shorter time to observe than the clinical endpoint of interest. Interim analyses should be considered for larger trials of long duration or trials of serious disease or trials that evaluate potentially harmful interventions. Sample size should be considered carefully so as not to be wasteful of resources and to ensure that a trial reaches conclusive results.
There are many issues to consider during the design of a clinical trial. Researchers should understand these issues when designing clinical trials.
The author would like to thank Dr. Justin McArthur and Dr. The author thanks the students and faculty in the course for their helpful feedback. National Center for Biotechnology Information , U. In the planning of interim analyses, it is generally accepted that care must be taken to control type I error, which can limit the ability to change the monitoring schedule to adapt to accumulating data, which may lead to delays in stopping unsuccessful treatments e.
Although strict error control may not always be necessary in a trial using frequentist methods [ 20 ], a more flexible approach to monitoring can be easier to implement and justify by using a Bayesian approach, which allows for stopping guidelines that are based on directly interpretable probabilities, particularly in complex multi-arm trials [ 21 , 22 , 23 , 24 ].
Incorporating Bayesian monitoring within a multi-arm factorial trial can allow for a flexible monitoring schedule to test multiple strategies and detect inferior ones quickly. Previous designs have looked at multi-arm trials [ 25 , 26 ] but have not examined the impact of factorial randomisation. Here, we present the statistical aspects of the design of a multi-arm factorial trial to be conducted in Vietnam that aims to find efficacious drug-sparing treatment strategies that will allow access to HCV treatment to be widened, and our particular focus is on increasing evidence on treatment of genotype 6.
Randomisation will be stratified by genotype 6 versus all other genotypes. Each strategy reduces DAA exposure and has other benefits, such as compatibility with directly observed therapy programmes as used for tuberculosis but with other additional costs Table 1. Trial schema. The primary outcome is binary, and all observed endpoints are either SVR12 or treatment failure. However, for ethical reasons, within the trial such patients will be offered retreatment as soon as they are definitively identified as having failed first-line treatment.
Failure is defined using a higher threshold than the LLOQ because patients have been observed to achieve cure despite having low-level viraemia at EOT or shortly after and so they do not need retreatment to achieve cure on first-line treatment. This will be carefully reviewed by the independent data monitoring committee DMC.
The design of the trial therefore allows for failing groups to be stopped early at any time and subsequent patients to be randomly assigned to more successful groups. Individual performance of groups receiving shortening strategies will be monitored during recruitment by an independent DMC that will make decisions on whether a group should be stopped. Interim analyses will not be comparative as the aim of monitoring is not to find the best strategy but to find any strategy that meets a minimum acceptable cure rate that may also be non-inferior to standard treatment as different strategies may benefit different patient populations.
Analyses of cure rates will follow the Bayesian paradigm to allow the probability of the true cure rate being below different thresholds to be calculated: recruitment into a group will stop if there is a greater than 0. The primary monitoring is combined across genotypes; if the combined group reaches the stopping guideline, each genotype stratum will be tested separately and the DMC will have the discretion to stop only those strata reaching the stopping criteria.
Differences in stopping groups across strata are likely to occur only when there are extreme differences in the cure rates between the strata, which is not expected, and so the operating characteristics of the trial are based on stopping combined strata only. If neither stratum reaches the stopping criteria despite the combined strata doing so, it will be at the discretion of the DMC whether to stop recruitment into the stratum or group.
At interim analyses, there is greater uncertainty about the performance of the shortening strategies. Therefore, when the prior was determined, it was assumed that one strategy would fail completely such that all four groups receiving that strategy, of a total of 12 tested, would meet the stopping guideline.
As each individual outcome is assumed to be Bernoulli-distributed and therefore distribution of all outcomes is binomial, a beta prior was chosen as this is the conjugate prior for the binomial distribution. The mean of the prior was fixed at 0.
The prior chosen was beta 4. The relatively low precision of the prior will allow greater influence of the data in the posterior distribution. If the stopping guideline is met, sensitivity analyses using priors informed by observed cure rates in other randomly assigned groups or strata will be performed and will be provided to the DMC to help inform their decision to stop a group.
There will then be per group for the regimen comparison non-inferiority; comparing the two WHO-recommended drug regimens against each other , in the control group and per intervention group for the strategy comparison non-inferiority; comparing each of the three treatment-shortening strategies versus the licensed week control duration , and per group for the ribavirin comparison superiority; comparing each treatment-shortening strategy with and without ribavirin.
The total sample size is fixed at patients; if individual groups are stopped early, any subsequent patients will be randomly assigned to open groups where possible depending on the delay between randomisation, identification of primary endpoints and interim analyses , so numbers in each fully recruited group may be higher.
This is appropriate in a pragmatic trial, where the goal is to maximise information gained about many different strategic approaches to treatment rather than to minimise sample size per se. The choice of the non-inferiority margin was based on clinical judgement and the size of margins used in other trials of anti-infectives with relatively low failure rates such as community-acquired pneumonia [ 30 ].
In contrast, the different drug-sparing strategies have a variety of different advantages and disadvantages in terms of additional visits vs. For superiority comparisons conducted for ribavirin and any comparison that meets non-inferiority above and two-sided alpha of 0. The final analysis will estimate risk differences between groups using marginal effects after logistic regression. The model will include all main randomised effects and strata and will test interactions between all randomisations Supplementary Methods, Additional file 1.
The interaction between regimen and strategy will include all levels of strategy. Owing to the partial factorial randomisation, the interaction between ribavirin and strategy will not include the standard treatment length strategy. Comparisons of regimens and of strategies will be non-inferiority analyses, and the ribavirin comparison will be a superiority analysis.
Analysis priors for the final analysis are listed in Table 2 ; these differ from the monitoring priors as the aim of monitoring is only to identify poorly performing groups and not to compare the randomly assigned groups. The control cure rate analysis prior is beta 4. This mean was derived from previous research into the trial drug regimens [ 28 , 29 ]. Sensitivity analyses will use a range of informative priors reflecting plausible belief in the clinical community.
Sceptical analysis prior distributions were chosen with means corresponding to the null hypothesis for each randomisation and enthusiastic analysis priors with means y greater than this, where y is the non-inferiority margin or absolute difference specified in the power calculations. To define the performance characteristics of the proposed stopping guideline, posterior probabilities of cure rates and the probability of stopping groups at each number of outcomes were calculated analytically using beta and binomial distributions respectively.
Timings of interim analyses were determined by applying the probabilities of stopping groups to a projected recruitment schedule. The average probability of stopping a genuinely inferior group was estimated by integrating the probability of stopping a group with respect to the monitoring prior beta 4.
Simulations of datasets with outcomes taken from binomial distributions were used to determine the overall probability of stopping a group, the cumulative probability of stopping groups at specified analysis time points, and to estimate power after being analysed using marginal effects after logistic regressions with a model containing all randomised comparisons, as described above.
Predictive probabilities the probability of achieving a success at the end of the trial were calculated analytically using the beta-binomial distribution in R 3. All other analyses were performed by using Stata version The minimum number of failures required to satisfy the stopping criteria for the main monitoring beta 4.
The probability of stopping a group is then the probability of observing the required number of failures in the group. It is expected from the specification of the stopping guideline that when the true cure rate is equal to the mean of the prior, the probability of stopping a group is 0. The calculated probability is not exactly 0. Additionally, any other stopping guideline would similarly be unable to discriminate between these cure rates without a very large sample size.
The probability of incorrectly not stopping a group rapidly decreases as the true cure rate decreases. The overall probability of stopping a group, and therefore making a correct decision or incorrect decision to stop recruitment into a group again analogous to the frequentist concepts of power and type I error , shows results similar to those given above Supplementary Table 1, Additional file 1.
Therefore, it was decided to perform analyses after the first month such that at least one inferior group has a 0. An average probability is used to reflect the uncertainty about the true cure rates; for low cure rates, the probability of stopping a group can be substantially higher Fig. Initial probability of stopping groups over an estimated recruitment schedule for various true cure rates.
Probabilities are calculated assuming no previous interim analysis. See Supplementary Figure 1, Additional file 1 for the cumulative probability of stopping a group.
Four analysis time points were chosen to provide multiple opportunities to detect failing groups while allowing adequate time between analyses for the accrual of patients and outcome data and preventing an unnecessary burden on time and resources needed for analyses and subsequent DMC meetings. The highest probability threshold 0. The other thresholds 0. The number of patients in each group and the probability of stopping an inferior group of each strategy type at these analyses are listed in Table 4.
By assessing the cumulative probability of stopping a group Supplementary Fig. The exception to this is having analyses every month but this schedule is impractical because of the resources required for an interim analysis. As this is only projected recruitment, sensitivity analyses were performed to examine the effect of faster or slower recruitment Supplementary Table 3, Additional file 1.
There were greater differences in the timings of the last analyses, but the timing of this analysis is the most flexible and can be determined on the basis of observed rather than assumed true cure rates. Non-inferiority can exist between the standard duration group and the pooled shortening strategy groups, the shortening strategy without ribavirin groups, or the shortening strategy with ribavirin groups, meaning that the shortening strategy with ribavirin groups can have cure rates 2.
These alternatives are shown in different columns of Table 5. A potential weakness in the design is that the sample size was not originally calculated using Bayesian principles, but primary analyses will be conducted using Bayesian methods to allow for the calculation of posterior probabilities exploring the difference in cure rates between the interventions.
However, for the non-inferiority comparisons, sample size estimates obtained using Bayesian methods are similar to or smaller than those obtained using frequentist methods [ 34 ], suggesting that our design is likely to be conservative. Additionally, secondary analyses will use frequentist methods for comparison. For interim analyses, the probability of correctly stopping a group, analogous to the frequentist concept of power, is determined by the true cure rate in the group and the number of analysed patients at each analysis and not by the overall group size.
The timing of and the number of patients at interim analyses are determined by at least one group, usually the 4-week treatment group with PEG-IFN since this has the shortest overall treatment duration, reaching a certain probability threshold of being stopped. This may mean delays in identifying unsuccessful groups receiving other strategies.
As the treatment length of patients in the RGT groups is unknown until after their day 7 visit, it is not possible to stagger treatment start dates so that the length between randomisation and EOT is the same for all strategies.
Staggered treatment start might also lead to dropout after randomisation but before starting treatment, leading to inefficiency and potential bias. During the trial, cure rates will be monitored in all groups. If the cure rates are not as anticipated, either they are higher and lower than expected, and so our derived schedule is inappropriate based on these cure rates then the timing can be adjusted wih no penalty to the probability of incorrectly stopping recruitment into a randomised group, due to the use of Bayesian monitoring [ 22 ].
The power calculations for the final analysis assume that all groups will be included and that no groups have been stopped. It is possible that power will be lower if fewer groups are included, but for most comparisons with a full sample, power is very high and is likely to remain acceptable at the final analysis even with the exclusion of some patients.
To help preserve power, if groups are stopped early subsequent patients will be randomly assigned to open groups. The power calculations were also estimated by using frequentist methods, although the primary analysis will use Bayesian methods. However, as power is extremely high, the analogous concept to power in Bayesian analysis, that for non-inferiority comparisons the lower credible interval bound is above the non-inferiority margin, is likely to be similarly high.
Additionally, owing to the many possible combinations of strata and groups that could be stopped with different true and observed failure rates and at different times, examining the impact of stopping multiple combinations would require a large number of assumptions, probably also using a factorial simulation design, and hence would be a large piece of additional work in its own right.
This is also the case for examining the impact of stopping multiple randomly assigned groups on other aspects of the trial, such as bias. Alernative Bayesian designs, which have been used elsewhere [ 35 ], include basing the stopping guideline on a predictive probability, the probability of achieving a success at the end of the trial.
A rule based on predictive probabilities would then state a group will be stopped at an interim analysis if there is a more than 0.
0コメント