Appl. Statist. (2005) 54, Part 4, pp. 691–706 Multivariate tests comparing binomial probabilities, with application to safety studies for drugs University of Florida, Gainesville, USAWilliams College, Williamstown, USA
[Received October 2003. Final revision August 2004]
Summary. In magazine advertisements for new drugs, it is common to see summary tables that compare the relative frequency of several side-effects for the drug and for a placebo, based on results from placebo-controlled clinical trials. The paper summarizes ways to conduct a global test of equality of the population proportions for the drug and the vector of population propor- tions for the placebo. For multivariate normal responses, the Hotelling T 2-test is a well-known method for testing equality of a vector of means for two independent samples. The tests in the paper are analogues of this test for vectors of binary responses. The likelihood ratio tests can be computationally intensive or have poor asymptotic performance. Simple quadratic forms comparing the two vectors provide alternative tests. Much better performance results from using a score-type version with a null-estimated covariance matrix than from the sample covariance matrix that applies with an ordinary Wald test. For either type of statistic, asymptotic inference is often inadequate, so we also present alternative, exact permutation tests. Follow-up inferences are also discussed, and our methods are applied to safety data from a phase II clinical trial. Keywords: Adverse events; Binary data; χ2-test; Generalized estimating equations; Hotellingtest; Marginal homogeneity; Marginal logit model; Random effects
Introduction
Table 1 contains summary results of the type that are often found in news magazines (e.g. Time)that have full page advertisements promoting a new drug. (In recent years, advertisements of thistype have appeared for drugs such as Lamisil by Novartis Pharmaceuticals, Flonase by GlaxoSmith Kline, Clarinex by Schering, Pravachol by Bristol-Myers Squibb, Allegra by Aventis andBotox by Allergan.) Table 1 compares the relative frequency of several undesirable side-effectsfor a drug and placebo, based on results from placebo-controlled clinical trials. In the phar-maceutical industry, such side-effects are often called adverse events, and the studies makingsuch comparisons of a drug with a placebo are called safety studies.
The data in Table 1 refer to a safety study for an asthma drug, conducted by Schering-Plough
Corp. The adverse events were collected from a double-blind, randomized, phase II clinical trialin which subjects were randomized to one of three treatments: two levels of an active drug and aplacebo. Each patient was followed over a period of at least 3 months. The adverse events werereported at scheduled visits to the clinic and were non-solicited reports by the subject to theinvestigator. The primary objective of the clinical trial was to assess a subject’s lung functions as
Address for correspondence: Alan Agresti, Department of Statistics, Box 118545, University of Florida, Gaines-
ville, FL 32611-8545, USA. E-mail: [email protected]
a response to a treatment versus placebo. Subsequently, interest also focused on analysing theevidence of a difference between the occurrence of adverse events in the treated and non-treatedgroups.
For simplicity of exposition, in Table 1 we combined the results for the two dosage levels
of the drug and compared the two drug groups combined with the placebo group. (Section 8mentions straightforward generalizations for multiple groups.) Of the 211 subjects in the study,146 were in the drug group and 65 in the placebo group. Table 1 lists the adverse events in orderaccording to their overall frequency in the two groups.
In Table 1, for any given one of the 11 adverse events, a 2 × 2 table compares the counts on the
two possible outcomes for the two groups. We can then use standard inference (e.g. a χ2-test)to analyse whether the occurrence of that adverse event was signiﬁcantly different for the twogroups. However, how could we conduct a global test to analyse the evidence of a differencebetween the vector of 11 population proportions for the drug and the vector of 11 populationproportions for the placebo? This question was ﬁrst asked of one of us for similar data fromanother company a few years ago. In this paper, we survey strategies for answering the question. Literature on safety studies and relevant methods
The analysis of adverse event data in clinical trials is an important part of the development,pre- and post-market characterization and safety of pharmaceutical products. Despite that fact,comparative statistical methods for the evaluation of safety outcomes are not as well developedas those for efﬁcacy (O’Neill, 2002). O’Neill (1988) presented a general summary of statisticalprocedures for analysing safety data.
Lin et al. (2001) investigated adverse events in a placebo-controlled clinical study based on
proportional hazards and logistic regression models for repeated binary data. The adverse eventswere handled in a univariate manner, as is the case in almost all the literature on safety studies. A simple way to conduct a global test using the univariate information in Table 1 is with theBonferroni approach. If Pj is the P-value for the test for the 2 × 2 table comparing a drug witha placebo for adverse event j, an overall P-value is 11 minj.Pj/ (or 1.0 if this exceeds 1). This
approach is potentially quite conservative, both because of its use of the Bonferroni inequal-ity and because it ignores potential dependence between separate individual inferences. Theconservativeness is compounded if we use a small sample discrete method for each individualtest (e.g. Fisher’s exact test). Less conservative Bonferroni approaches have been developed,such as sequential versions (e.g. Holm (1979)). Westfall and Young (1989) proposed a permu-tation resampling of the vector responses to ﬁnd the probability (for each component in thevector) that the minimum P-value of all tests is less than the observed P-value. This gives anadjusted P-value for each component, following a suggestion by Mantel (1980). Their approachis implemented by using Monte Carlo generation of random permutations in the SAS procedureMULTTEST, which reports P-values for all individual tests (e.g. based on the marginal χ2- orFisher’s exact tests) adjusted for correlation and discreteness. This approach does not give aglobal P-value.
Pocock et al. (1987) combined score tests for each individual component to construct a global
test for multivariate binary data, extending results from O’Brien (1984). Their test is a specialcase of more general tests that were proposed by Lefkopoulou and Ryan (1993) that assume thatoutcomes are uniformly more likely for one group than for another and assume an independenceor exchangeable correlation structure among them. Zhang et al. (1997) summarized this andrelated multiple-test approaches for analysing multiple end points in clinical trials with quanti-tative response variables. For instance, Lehmacher et al. (1991) described test procedures thatallow, after rejection of the global null hypothesis at level α, a stepwise analysis of differencesin subsets of all adverse events or even single adverse events while still maintaining an overallexperimentwise error rate of α. More recently, Mehrotra and Heyse (2004) addressed multi-plicity by using a less conservative approach of controlling a false discovery rate rather than anexperimentwise error rate. In quite a different vein, Berry and Berry (2004) used a three-levelhierarchical mixed model to obtain for each adverse event a Bayesian posterior probability thatthe rate is higher for the treatment. Mehrotra and Heyse (2004) and Berry and Berry (2004)analysed a data set in which only the marginal results are known for the adverse events, so it isnot possible to conduct a multivariate analysis. 1.2. The multivariate approaches of this paperIn this paper, we shall consider test statistics that treat the data in a multivariate manner. Chuang-Stein and Mohberg (1993) proposed a related approach, with a multivariate Waldstatistic. In Table 1, each group (drug, placebo) has 211 = 2048 possible response sequences,according to the (yes, no) outcome for the response on each adverse outcome. The percentagesin Table 1 refer to the 11 one-dimensional marginal distributions of the 211 contingency table foreach group that shows the counts of the possible response sequences. We compare the marginaldistributions for the two groups, while using the information in their joint distributions, and wealso compare the joint distributions.
For multivariate normal responses, the Hotelling T 2-test is a well-known method for testing
equality of a vector of means for two independent samples. (In the two-sample context, it is alsocalled the Mahalanobis test.) We discuss analogues of this test for vectors of binary responses. Section 2 presents a likelihood ratio test comparing the marginal distributions with marginallogit modelling. The test is computationally intensive when each vector has a large number ofelements. Section 3 presents a simpler Wald test and a related score-type test. Section 4 discussestests comparing the joint distributions for the two groups. The emphasis is on permutation tests,since asymptotic tests are not justiﬁed even with relatively few side-effects. Section 5 presentsanalyses based on simpler models, such as random-effects models, that provide structure for the
association between the responses for different adverse events. Section 6 considers the adequacyof the large sample methods when the data are sparse and makes recommendations. Meth-ods of each section are illustrated for the asthma data of the phase II clinical trial. Section 7describes possible follow-up analyses, and Section 8 brieﬂy discusses extensions to multicategoryresponses and comparisons of several groups, for which the tests proposed are multivariate ver-sions of likelihood ratio and Pearson tests of independence. Using marginal models for multivariate binomial vectors
For concreteness, in formulating models we refer to Table 1, which has a binary explanatoryvariable (group) and a multivariate binary response vector. We denote the group by i = 1 forthe drug and i = 2 for the placebo and we denote the number of binary variables that consti-tute the multivariate response by c .c = 11 for Table 1). We assume an independent multinomialdistribution for the counts in each subtable of size 2c, with sample size n1 for group 1 andn2 for group 2. For a randomly selected subject assigned x = i, let (yi1, . . . , yic) denote thec responses, where yij = 1 or yij = 0 according to whether side-effect j is present or absent. Let
πi.j/=P.yij =1/. Then {.πi.j/, 1−πi.j//, j =1, . . . , c} are the c one-way marginal distributionsfor the 2c cross-classiﬁcation of responses when x = i. Simultaneous marginal homogeneity model
This section considers the null hypothesis of equality of the two vectors of binomial parameters(π1.1/, . . . , π1.c/) and (π2.1/, . . . , π2.c/), i.e., for each side-effect j,
We refer to this as the simultaneous marginal homogeneity (SMH) hypothesis for the two multi-variate distributions. This hypothesis corresponds to the marginal logit model
More generally, this and other models that we consider can incorporate explanatory variablesin addition to the group.
Model (2) is simple. However, maximum likelihood (ML) ﬁtting is computationally imprac-
tical for large c. The models apply to c marginal distributions of the 2c-table for each group,yet the product multinomial likelihood refers to the multinomial probabilities within those twotables. Note that we cannot ﬁt the model by using only the marginal information in a tablesuch as Table 1; we need the two 2c joint distributions to incorporate the correlations betweenresponses on different adverse events. See Agresti (2002), pages 464–466, for a brief review ofML methods for ﬁtting marginal logit models.
To maximize the product multinomial likelihood subject to the SMH constraint, one approach
iteratively uses Lagrange’s method of undetermined multipliers together with the Newton– Raphson method (Aitchison and Silvey, 1958; Haber, 1985). We used an algorithm based on reﬁnements of this method (Lang and Agresti, 1994; Lang, 2004), in which the matrix inverted in the Newton–Raphson step has simpler form. Let π denote the vector (with 2 × 2c elements) of the two sets of multinomial probabilities. Among the classes of models to which this algorithm applies are the linear model having the matrix form Aπ = Xβ
and generalized log-linear models of form
C log.Aπ/ = Xβ:
In this context, the matrix A applied to π forms the relevant marginal probabilities, and β is the vector of the c model parameters. For logit model (2), C applied to the log-marginal-probabilities forms the marginal logits for the models. An R function (mph.fit) for the algorithm applied to such classes of models is available from Professor J. B. Lang (Statistics Department, University of Iowa; e-mail [email protected]; details at www.stat.uiowa.edu/∼jblang). The algorithm becomes more computationally demanding as c increases, but we could use it with c = 11 for the example of this paper. 2.2. Testing simultaneous marginal homogeneityAfter ﬁtting model (2), likelihood-based methods can test the SMH hypothesis. With large sam-ples, we could use a likelihood ratio or Pearson statistic testing the goodness of ﬁt of logit model(2). Such statistics compare the ﬁt of this model with the saturated model
The SMH hypothesis (1) corresponds to H0 : β1j = β2j, j = 1, . . . , c, in this model.
The likelihood ratio statistic G2 equals −2 times the logarithm of the ratio of the maximized
likelihoods for models (2) and (4). The Pearson statistic compares the 2 × 2c observed and ﬁttedcounts for model (2), using X2 = Σ (observed − ﬁtted)2/ﬁtted. These two statistics have largesample χ2-distributions with degrees of freedom df = c, the difference in parameter dimension-ality of the two models. For these statistics, the resulting null distribution does not assume anyparticular structure for the joint distribution.
In the ﬁrst step towards a safety analysis, investigators in the phase II trial sought an overallevaluation of the safety proﬁle of the asthma drug. The goodness-of-ﬁt tests of model (2) yieldlikelihood ratio statistic G2 = 16:1 and Pearson statistic X2 = 14:2, each with df = 11. Neitherstatistic shows much evidence against the SMH null hypothesis (P = 0:14 and P = 0:22) for theasthma data. This is valuable information to determine whether proceeding to a larger trial isjustiﬁed from a safety point of view. It is also relevant for an interim analysis of large, expensivephase III trials, in which an independent data monitoring committee monitors safety and givesrecommendations based on their statistical safety analysis. In a different context, the result ofsuch a test might be part of the statistical presentation to federal drug agencies to help to justifya drug approval application.
The joint tables for the asthma data are sparse, having 211 observations in 2 × 211 = 4096
cells, so conclusions based on these tests are tentative. The reliability of asymptotics in suchcases will be addressed further in Section 6. Wald and score-type tests of simultaneous marginal homogeneity
As c increases, likelihood-based approaches become computationally more difﬁcult. Forinstance, we could not use the R function that was mentioned earlier for a data set with c > 11variables. Alternative strategies are needed that can also handle large c. The simplest approachto testing SMH is to form a test statistic using solely the marginal sample proportions and theirvariances and covariances.
In group i, let ˆπi.j/ denote the sample proportion of subjects who report side-effect j. Let
d = .d1, . . . , dc/ with dj = ˆπ1.j/ − ˆπ2.j/, j = 1, . . . , c. Appendix A gives the formula for the covariance matrix of d. Let ˆ Σ denote the sample version of this matrix. Then, a Wald statistic
for testing the null hypothesis of SMH is
Σ−1d:
This also has an asymptotic null χ2-distribution with df = c and was used by Chuang-Stein andMohberg (1993) for comparing adverse events.
In the univariate case (c = 1), the Wald statistic is not as reliable a method for comparing two
proportions as the Pearson statistic is. For instance, its nominal size tends not to be as close to the actual size. Thus, for any c we prefer an alternative statistic that uses the pooled estimate of the variance and covariance. Appendix A also shows this matrix, which applies under the null hypothesis. Denote the pooled estimate of Σ by ˆ Σ0. Let W0 = d ˆΣ−1
0 d. When c = 1, this
is the Pearson χ2-statistic, which is the score test. We recommend it over W because of thepoor performance in general of Wald inference for proportion data. We shall refer to W0 as a‘score-type’ test, since a full score test for this hypothesis requires estimating the covariancessolely under SMH, which is considerably more complex.
For the data that are summarized in Table 1, W0 = 19:9 with df = 11 (P = 0:047). The evidence
against the null hypothesis is somewhat stronger than with the likelihood-based statistics. Ofcourse, there is no guarantee that W0 performs well for large c or with small n1 and n2. Also,Appendix A shows that when n1 = n2 it uses an additional assumption about the second-ordermarginal distributions. To obtain some feed-back on the validity of the asymptotic P-value,we could construct a P-value by using the bootstrap, repeatedly taking multinomial samples ofsizes n1 and n2 from the two groups. The multinomial probabilities for the bootstrap are theﬁtted distribution for the SMH model (2). The bootstrap test P-value is the proportion of gen-erated resamples for which W0 is at least as large as the sample value. Using 100000 bootstrapresamples, the bootstrap P-value for the observed value of W0 = 19:9 was 0.045, compared with0.047 from the asymptotic χ2-distribution.
When the models are expanded to include explanatory variables, the most straightforward
way to obtain parameter estimates in marginal models is the quasi-likelihood approach basedon generalized estimating equations (GEEs; Liang and Zeger (1986)). This approach is summa-rized in Appendix B. Even without explanatory variables, the GEE approach is computationallymuch simpler than ML for tables with large c. With the binary predictor of group and an unstruc-tured working correlation matrix for the joint distribution of the variables, this corresponds toiterating the weighted least squares approach of Koch et al. (1977) (see Miller et al. (1993)). TheGEE methods are not likelihood based. Thus, tests of hypotheses such as SMH naturally useWald tests rather than likelihood ratio tests. There has been some work on constructing score-type tests for the GEE approach (e.g. Rotnitzky and Jewell (1990)) which also use empiricalcovariance estimates to adjust for a misspeciﬁed correlation structure.
For the asthma data, the GEE approach assuming an exchangeable correlation structure
among the adverse events gives a Wald statistic of 21.7, with df = 11 (P-value 0.03). Simi-lar results occurred for the Wald statistic by using other working correlation structures. Whenapplied to the linear model using the identity link function, GEEs compute the empirical covari-ance of the marginal sample proportions rather than the marginal sample logits. Then, the Waldstatistic that is obtained with this approach is the statistic W that was introduced above, whichequals 21.1. However, the empirically based standard errors for the GEE approach tend tounderestimate the true standard errors (e.g. Firth (1993)), and this is supported by a study
that we conducted that is reported below in Section 6. So, we treat the P-value of 0.03 for thisapproach with some scepticism.
We do not believe that GEEs with Wald tests are as reliable as the test using the score-type
statistic W0 or the likelihood ratio test of the previous section. This is studied further in Section6. Its advantages are versatility and readily available software. Tests of identical joint distributions
In some cases, it may be of interest to test the null hypothesis that the entire 2c joint distributionsare identical for the two groups, i.e., for all possible response sequences (a1, . . . , ac),
P.y11 = a1, . . . , y1c = ac/ = P.y21 = a1, . . . , y2c = ac/:
When the null hypothesis is supposed to represent ‘no effect’, for instance with subjects makingthe same response whether they take a drug or placebo, then this is a more complete descriptionthan SMH of no effect. Although this hypothesis of identical joint distributions (IJDs) is nar-rower than SMH, in a way it is actually more nearly analogous to the Hotelling approach fornormally distributed data. That test assumes a common covariance matrix for the two groups,and hence identical multivariate normal distributions.
The ﬁtted null joint distribution results simply from ﬁnding joint sample proportions for
the table collapsed over the group, and the ﬁtted counts are these proportions multiplied bythe respective sample sizes in the two groups. The likelihood ratio test, which has test statisticG2 = 2 Σ observed log(observed/ﬁtted), has residual df = 2c − 1. The df-value results from com-paring an alternative hypothesis with two independent sets of 2c − 1 multinomial probabilitieswith a null hypothesis with a single set. Although computationally simple, using a χ2-distribu-tion for this or the related Pearson X2-statistic is not sensible for even moderate-sized c, becauseof extreme sparseness and the very large df-value. For instance, for the asthma data on whichTable 1 is based, G2 = 118:6 and X2 = 31:9, but these have df = 2047.
Instead, we recommend conducting tests of the IJDs hypothesis using the exact permutation
distribution under this null structure of exchangeability of distributions. For the sample subjects,consider all .n1 + n2/!=n1! n2! ways of partitioning the sample into n1 subjects for group 1 andn2 subjects for group 2. For a chosen test statistic, the P-value is the proportion of these parti-tions for which the statistic is at least as large as the observed value. This P-value is calculatedunder the exchangeability assumption for the two groups in terms of their joint distribution,which is the null hypothesis that was mentioned above. With large n1 or n2, this permutationapproach can be computationally intensive even with a simple test statistic. We can then selecta random sample of the possible partitions. For instance, with 5 million random partitions anda true P-value of 0.05, the estimated P-value has a standard error of 0.0001, which is more thansufﬁcient for nearly all purposes.
Even with the modest sample sizes (n1 = 146 and n2 = 65) of the asthma drug safety study,
the permutation analysis entails the order of 1073 different partitions of the 211 subjects intotwo groups of these sizes. Thus, we took a random sample of 5 million partitions. Using thepermutation distribution, G2 = 118:6 has P-value 0.14 and X2 = 31:9 has P-value 0.29. TheseP-values provide very similar results to those for the asymptotic tests of the SMH hypothesisusing these two statistics.
Likewise, we could generate a P-value under the IJDs hypothesis for a statistic that is designed
to detect a particular characteristic for which the two distributions differ. An example is thescore-type statistic of the previous section for comparing the marginal proportions. Under thepermutation distribution, W0 = 19:9 has P-value equal to 0.041. Tests imbedded in a model for the joint distributions
The main questions of interest for the asthma data refer to the marginal probabilities for the11 adverse events, for the drug and placebo. The actual form of that joint distribution may beregarded as a nuisance, or at best of secondary interest. Thus, the analyses that are considered inSections 2 and 3 dealt directly with the marginal distributions and made no attempt to describethe joint distribution of the responses. Alternatively, we can compare the marginal distributionsor the joint distributions of the responses while assuming a model for the joint distribution. Itis easiest to do this by considering a model for which the SMH hypothesis of Sections 2 and 3is equivalent to the IJDs hypothesis of Section 4.
This section shows ways to compare the margins while modelling the joint distribution. It
also mentions ways potentially to increase the power by considering simpler structure for themarginal inhomogeneity. Using random effects to model the dependence
The best-known way to induce an association between the c responses is by using randomeffects. Let πs.i/.j/ denote the probability of side-effect j for subject s who is in group i. Alogistic–normal random intercept analogue of model (4) is
where the subject-speciﬁc random effects {us.i/} are independent from an N.0, σ/ distribution. Under this structure, SMH and IJDs correspond to the simpler model
Since this random-effects model implies a common, non-negative association between pairsof adverse events, it is inappropriate if there is reason to expect negative association betweencertain pairs of side-effects or associations that vary dramatically in strength.
Assuming this model form, we can test SMH (and IJDs) by the likelihood ratio test compar-
ing models (6) and (5). Again, it has df = c. For Table 1, the likelihood ratio statistic equals 22.1(df = 11; P-value 0.023). Marginal models with simultaneous model for joint distribution
When many adverse events are measured, it may be that certain associations are negative. Then,there are alternative ways to model the joint distribution. For instance, we could use a log-linearmodel. This does not require assuming an exchangeability structure for the joint distribution,unless we assume a quasi-symmetric form of log-linear model (which is implied by a random-effects model). The model for the two joint distributions can be speciﬁed simultaneously withone for the marginal distributions. We can ﬁt log-linear models simultaneously with compatiblemarginal models by using methods that were described in Fitzmaurice and Laird (1993) andin Lang and Agresti (1994). Lang’s R function that was mentioned above can ﬁt such models. With this approach, however, results of tests of SMH will be similar to results for tests that usea saturated structure for the joint distribution. In standard log-linear models for the joint distri-bution, the marginal and joint model parameters are orthogonal. In particular, if the marginal
Summary of methods for comparing adverse event incidence for drug and placebo groups by
1. Marginal models(a) Likelihood ratio test of SMH (e.g. using Lang’s
Likelihood ratio statistic 16.1, df = 11, P = 0:14
(b) Score-type test of SMH† (quadratic form using
differences and a null covariance matrix)
(c) GEE (Wald) test of SMH (quadratic form using
2. Joint models(a) Permutation test of IJDs†
Likelihood ratio statistic 118.6, P = 0:14
(b) Likelihood ratio test of SMH and IJDs for random
Likelihood ratio statistic 22.1, df = 11, P = 0:02
†Preferred method for sparse data.
model of SMH holds, the ML estimator of the marginal model parameters is consistent even ifthe model for the joint distribution is incorrect. Structure for the marginal inhomogeneity
Table 2 summarizes the types of analyses that we have applied to the asthma data. Except forthe permutation tests, the P-values are based on asymptotics. Since the complete 2 × 211 tablecorresponding to Table 1 is sparse, conclusions based on tests having df = 11 must be madecautiously. The asymptotics may not hold well, as we shall discuss in Section 6. More reliableand informative tests use a model-based comparison of the SMH model with a model that pro-vides some structure for the nature of the marginal inhomogeneity. Using a narrower alternativehypothesis provides the potential for increased power and also focuses attention on estimatingwhatever effects may exist.
One special case of the saturated model (4) that has SMH as a further special case is the logit
Here, I.·/ is an indicator function, and the model permits a shift difference α between the groupsfor each variable. SMH is the special case α = 0. We could use an analogous structure in random-effects model (5). Such alternatives are worthy of attention, for instance, if we expect that eachadverse event may be more likely for the drug than for the placebo.
For Table 1, model (7) has ML ﬁt statistics G2 = 12:6 and X2 = 8:9, with df = 10, and ˆα =
−0:354 has se = 0:178. It provides slight evidence of improvement over the SMH model (2),with the change in G2 equal to 3.5 (df = 1; P-value 0.06).
In the spirit of this model, we could form a simple statistic to summarize results across adverse
events that would build power for an alternative by which the probability tends to be higher forone of the groups. For instance, for each subject we could count the total number of adverseevents and compare the means for the two groups, using either asymptotic normality of thesample means or assuming a distribution such as the negative binomial distribution or using anonparametric comparison.
For Table 1, about 80% of the subjects had no more than two adverse events, and the maxi-
mum was six. The drug and placebo groups had sample means of 1.34 and 1.48, with standarddeviations of 1.33 and 1.34. The two-sample t-test has a two-sided P-value of 0.50. This isalso the P-value for the likelihood ratio test comparing negative binomial models with separatemeans and equal means. The ‘exact’ Wilcoxon test comparing the two distributions using aconditional test for the 2 × 7 table cross-classifying the group with the number of adverse events(i.e. conditional on the total number of observations for each adverse event total) had a P-valueof 0.44 (using StatXact or procedure NPAR1WAY in SAS). The large P-value here partly reﬂectsthe substantial discreteness for this conditional test.
Such approaches have the potential for building power, by focusing the effect on a single
parameter and single degree of freedom. This can be helpful; for instance, O’Neill (1998) pointedout that pre-market safety databases are often not sufﬁciently large to have much power fordetecting signiﬁcance for a particular adverse event. However, in practice, adverse events areprobably not often uniformly more likely with a drug than a placebo. In Table 1, the sampleproportion is higher for the placebo than for the drug in four of the 11 cases, including thecase with the largest difference, so it is no surprise that the P-values that are reported in thissubsection are not particularly small. Checks of asymptotic tests, and recommendations
A limitation of the ML modelling approach is potential problems due to sparseness of the data. Sparseness can occur in the 2 × 2c contingency table if it has many possible adverse events (i.e. large c), or small sample sizes or additional predictors that expand the table even further. Inparticular, large sample χ2-tests are more trustworthy when based on small df-values than largedf-values. Asymptotics for score and Wald statistics
When the asymptotics are questionable for the χ2-tests that are presented in this paper, it issensible to use the permutation distribution of the statistic of interest. However, one shouldrealize that the distribution is computed under the IJDs condition, as discussed in Section 4. When we are merely interested in testing SMH, the IJDs condition is narrower than the nullhypothesis of interest.
To check the adequacy of the large sample asymptotics, we conducted a simulation study. We
used two null joint distributions: the SMH ﬁt and the IJD ﬁt, for the sample distribution thatgenerated Table 1. We used two values of c: c = 11, and c = 5 with the ﬁrst ﬁve side-effects. Weused sample sizes n1 = n2 = 50 and n1 = n2 = 100. Since some studies use two or three times asmany subjects for the drug as for the placebo, we also considered the unbalanced case (n1 = 100and n2 = 50), as well as the actual sample sizes for Table 1 (n1 = 146 and n2 = 65).
The theoretical asymptotic distribution for W0 holds under IJDs, but not under solely SMH,
because the covariance matrix assumes second-order homogeneity as well (unless n1 = n2). Nevertheless, we found that, overall, W0 performs well for both IJDs and SMH although thedata are quite sparse for some choices of c and (n1, n2). However, results for the ordinary Waldstatistic W were poor. For instance, consider SMH with c = 11 and .n1, n2/ = .146, 65/. Thesimulated mean for W was 13.4 and the variance was 46.7 (compared with nominal χ2-values of11 and 22) and, for nominal tail proportion values of 0.10, 0.05 and 0.01, the actual proportionsin the tails were 0.23, 0.15 and 0.07. By contrast, for the score statistic W0, the simulated meanwas 10.9, the variance was 21.6 and the tail proportions were 0.097, 0.047 and 0.009. For this
Estimated probability density functions of the Wald statistic W (– – – ) and score-type statistic W0
.. . . . . . ./ under the assumption of (a) SMH and (b) IJDs (
, reference χ2-density with df D 11)
case, Fig. 1 shows the simulated density functions of W and W0 under SMH and IJDs relativeto the χ2-distribution with df = 11.
The tests that compare the c marginal distributions have df = c, unless we add further structure
such as in model (7). For such tests and estimation of corresponding parameters, the sparse-ness seems to be relevant in terms of the marginal totals of the two possible outcomes for eachadverse event, for each group. The marginal models do not have reduced sufﬁcient statistics,but on the basis of what applies to χ2-statistics in the univariate case it seems sensible to inspectthe expected frequencies for the c separate 2 × 2 marginal tables comparing the two groups on
the binary response. For the sample sizes that were used in the simulation study, for the casesin which the asymptotics performed poorest, the minimum expected frequency was less than 3and many of the 4c expected frequencies were below 5. It is unrealistic to expect a simple samplesize guideline to cover all cases well, but a tentative suggestion is to be cautious when using thistest when many marginal expected frequencies are smaller than 5.
Refer to the summary of models and tests in Table 2. Overall, we have the following recom-mendations. To test the IJD hypothesis, use the likelihood ratio or Pearson statistic based onthe ﬁtted values for that hypothesis, but use the permutation distribution (randomly sampled,if necessary) to obtain the P-value. To test the SMH hypothesis, use the score-type statisticW0. We recommend W0 over the likelihood ratio or Pearson statistic merely because we couldconduct simulations to evaluate its asymptotic performance; this is computationally difﬁcultfor the ML-based statistics for testing SMH. When some marginal expected frequencies aresmall to moderate, we can seek corroboration by checking whether similar results apply witha bootstrap for W0 under the ﬁtted SMH distribution (when it is computationally feasible toobtain that ﬁtted distribution). If results differ in a practical sense, or if many of the marginalexpected frequencies are less than about 5, it is safer to use a permutation test of IJDs instead. When c = 1, the SMH and IJDs methods are identical, and the likelihood ratio and score-typestatistics simplify to the ordinary likelihood ratio and Pearson statistics for testing independencein a 2 × 2 table. Follow-up comparisons
We presented multivariate methods to assess the evidence of a global difference for two vectorsof proportions. When the null hypotheses of SMH or IJDs are rejected, investigators are nat-urally interested in which speciﬁc adverse events or sets of adverse events actually caused thedifference. For any given adverse event, a 2 × 2 table compares the counts on the two possible
Follow-up inference for estimating differences of incidence of several adverse events
outcomes for the two groups. Table 3 shows the signed square root of the Pearson statistic,which is the z-statistic for comparing two independent proportions by using the standard errorbased on pooling the two samples. Of the 11 z-statistics, only one has absolute value largerthan 2, with one other close to 2. The Westfall and Young (1989) adjusted P-values are alsoshown.
More informatively, we could form simultaneous conﬁdence intervals for a summary mea-
sure comparing the drug with the placebo for each adverse event. Table 3 illustrates by showingBonferroni conﬁdence intervals for the difference of proportions based on inverting the scoretest (Mee, 1984). This method tends to have actual conﬁdence level nearer the nominal levelthan the usual Wald interval. Each of these intervals shown in Table 3 has nominal conﬁdencecoefﬁcient of 0.995 45, so asymptotically the nominal overall level is at least 0.95.
For such follow-up comparisons, it is possible for all to be non-signiﬁcant, for the signiﬁcant
comparisons to be in a single direction (e.g. always a higher proportion for the drug) or mixed. In the last case, what can we say about the overall safety advantages of one treatment over the other? We could weight the evidence that is provided by the different adverse events, especially if some are regarded as more important than others. Let wj denote a non-negative weight that is associated with adverse event j. For w = .w1, w2, . . . , wc/ , w d is a weighted average of the differences. The global score-type statistic W0 then generalizes to the weighted version
˜W0 =.w d/2=.w ˆΣ0w/,
with df = 1. For instance, investigators considered adverse events 1 and 4 in Table 1 to be moreimportant for the asthma drug. Assigning twice as much weight to these two adverse events,we obtain ˜
W0 = 1:32 (P-value 0.25). Such summaries also have the advantage that was men-
tioned in Section 5.2 of potentially building power from concentrating the effect on a singledegree of freedom. Here, this approach did not result in a small P-value, as the placebo hada higher proportion for the ﬁrst adverse event but the drug did for the fourth adverse event.
We could also incorporate weights in the score statistic itself, without planning to form
a weighted summary. We weight the inﬂuence of difference j using the weighted difference
˜dj = wjdj. The global score-type statistic W0 then generalizes to the weighted version
0 d. It incorporates prior belief about the seriousness of adverse events and the mag-
nitude of their differences between the drug and placebo. Or, as in Berry and Berry (2004), wecould take into account the body system, for instance by using weights for adverse events ina common body system that are inversely proportional to the number of adverse events in it. Σ0 is constructed from ˆΣ0 by simply multiplying the jth diagonal element by w2 and the
.j, k/th off-diagonal element by wjwk. The ordinary score-type statistic W0 is the special casewith identical {wj}, and this statistic likewise has an asymptotic χ2 null distribution with df = c. If the greater differences between the drug and placebo occur with adverse events consideredmore serious, this statistic may show greater signiﬁcance than the ordinary score-type statistic. Extensions
The methods of this paper extend in obvious ways to several groups. To test SMH with g groups and c variables, we can extend the score-type statistic W0 by forming a vector d of c.g − 1/ differences of proportions, comparing a given proportion for each group with the correspond- ing proportion for an arbitrary base-line group. The variances and covariances of the differences are estimated by using estimates { ˆπ.j/} and { ˆπ.j, k/} based on pooling the g samples.
The methods also extend in obvious ways to multicategory responses. For comparing g groups
simultaneously on c variables, with rj categories for variable j, the basic likelihood ratio and
score-type sorts of tests have df = .g − 1/.Σj rj − c/. For a single variable, these simplify to thelikelihood ratio and Pearson χ2-tests of homogeneity (or, equivalently, independence) in a two-way g × r contingency table. With even moderate g and c, asymptotic methods are suspect. Asensible strategy for testing is a permutation test for the various allocations of the subjects tothe g groups, computing a relevant sample statistic for each (e.g. the extended W0-statistic fortesting SMH). With covariates, the permutation test is still feasible by using a random sampleof the possible permutations, even when some covariates are continuous.
In another context, the SMH hypothesis is a special case of a hypothesis that was studied by
Agresti and Liu (1999) in considering survey data in which each subject can pick any number ofoutcomes for a multiple-category response. See also Loughin and Scherer (1998) for a bootstrapapproach for such data. For a related permutation analysis, see Berry and Mielke (2003).
As is generally true, we have seen that different tests and different test statistics for a given
hypothesis can lead to quite different P-values. For the asthma data, there was no uniformityrelative to the often sacred 0.05-level in terms of whether hypotheses should be rejected. Thispoints out the importance of giving careful thought ahead of time to which is the relevanthypothesis to test (i.e. SMH or IJDs) and which statistic we prefer to summarize the effect. Italso points out the ultimate advantage of focusing on the size of the effects rather than mere sta-tistical signiﬁcance. Conﬁdence intervals based on different methods (e.g. Wald, likelihood ratioor score) can appear relatively similar in practical terms even when P-values of correspondingtests diverge somewhat. Acknowledgements
This research was supported by grants from the National Institutes of Health and the NationalScience Foundation. The authors thank Dr Davis Gates at Schering-Plough Corp. for permis-sion to use the data, Dr Joseph Lang for the use of his R function for ﬁtting marginal modelsand two referees for helpful suggestions to improve the presentation. Appendix A: Covariance matrices for Wald and score statistics
Let d = .d1, . . . , dc/ with dj = ˆπ1.j/ − ˆπ2.j/, j = 1, . . . , c. The vector of differences d has covariance matrix with elements
var.dj/ = π1.j/{1 − π1.j/}=n1 + π2.j/{1 − π2.j/}=n2,
cov.dj, dk/ = cov{ ˆπ1.j/, ˆπ1.k/} + cov{ ˆπ2.j/, ˆπ2.k/}
= {P.y1j = 1, y1k = 1/ − P.y1j = 1/P.y1k = 1/}=n1
+ {P.y2j = 1, y2k = 1/ − P.y2j = 1/P.y2k = 1/}=n2:
Under the null hypothesis, the variance is estimated by
π.j/{1 − ˆπ.j/}π.j/ = {n1 ˆπ1.j/ + n2 ˆπ2.j/}=.n1 + n2/:
Under the additional assumption that the two groups have the same second-order marginal distributions,the covariance is estimated by
{ ˆπ.j, k/ − ˆπ.j/ ˆπ.k/}π.j, k/ denotes the sample proportion of cases that had both side-effects j and k, after the two sam-
ples have been pooled. When n1 = n2, this estimate is identical to the estimate using only pooled ﬁrst-ordermarginal distributions, and we do not need the extra assumption. Appendix B: A non-technical summary of the generalized estimating equation approach
The GEE approach is a multivariate version of quasi-likelihood, meaning that it speciﬁes only the ﬁrsttwo moments rather than a full distribution (Liang and Zeger, 1986). The model applies to the mean ofthe marginal distribution for each component yij of the multivariate response (such as model (2)). Themethod assumes a variance function corresponding to the distribution that it is natural to assume foryij marginally (such as the binomial distribution) and uses a working guess for the correlation structureamong {yi1, . . . , yic}. It does this without assuming a particular multivariate distribution. The estimatesare solutions of GEEs. These resemble likelihood equations but are not, since a complete multivariatedistribution is not speciﬁed (in the univariate case they are likelihood equations under the additionalassumptions that the distribution is the member of the exponential family that has the assumed variancefunction).
The GEE estimates of model parameters are valid even if we misspecify the covariance structure, i.e.,
when the marginal model is correct, then the GEE model parameter estimators are consistent. Standarderrors result from a ‘sandwich matrix’ adjustment that the GEE method makes using the empirical depen-dence that the data exhibit. The na¨ıve standard errors based on the working correlation assumption areupdated by using the information that the data provide about the actual dependence structure to yieldrobust standard errors that are more appropriate than those based on the guessed working correlation. In theory, choosing the working correlation wisely can pay beneﬁts of improved efﬁciency of estimation. However, Liang and Zeger (1986) noted that estimators based on treating the responses as independentin the working correlation structure can have surprisingly good efﬁciency when the actual correlation isweak to moderate.
The GEE approach is appealing because of its computational simplicity, but it has limitations. Since
it does not completely specify the joint distribution, there is no likelihood function, and likelihood-basedmethods are not available. In addition, unless the sample size is quite large, the empirically based standarderrors tend to underestimate the true standard errors (Firth, 1993). References
Agresti, A. (2002) Categorical Data Analysis. New York: Wiley. Agresti, A. and Liu, I.-M. (1999) Modeling a categorical variable allowing arbitrarily many category choices. Biometrics, 55, 936–943.
Aitchison, J. and Silvey, S. D. (1958) Maximum-likelihood estimation of parameters subject to restraints. Ann.Math. Statist., 29, 813–828.
Berry, K. J. and Mielke, Jr, P. W. (2003) Permutation analysis of data with multiple binary category choices. Psychol. Rep., 92, 91–98.
Berry, S. M. and Berry, D. A. (2004) Accounting for multiplicities in assessing drug safety: a three-level hierarchical
mixture model. Biometrics, 60, 418–426.
Chuang-Stein, C. and Mohberg, N. (1993) A uniﬁed approach to the analysis of safety data in clinical trials. In
Drug Safety Assessment in Clinical Trials (ed. G. Sogliero-Gilbert). New York: Dekker.
Firth, D. (1993) Recent developments in quasi-likelihood methods. Proc. 49th Sess. Int. Statist. Inst., 341–358. Fitzmaurice, G. M. and Laird, N. M. (1993) A likelihood-based method for analysing longitudinal binary
responses. Biometrika, 80, 141–151.
Haber, M. (1985) Maximum likelihood methods for linear and log-linear models in categorical data. Comput.Statist. Data Anal., 3, 1–10.
Holm, S. (1979) A simple sequential rejective multiple test procedure. Scand. J. Statist., 6, 65–70. Koch, G. G., Landis, J. R., Freeman, J. L., Freeman, D. H. and Lehnen, R. G. (1977) A general methodology for
the analysis of experiments with repeated measurement of categorical data. Biometrics, 33, 133–158.
Lang, J. B. (2004) Multinomial-Poisson homogeneous models for contingency tables. Ann. Statist., 32, 340–
Lang, J. B. and Agresti, A. (1994) Simultaneously modeling joint and marginal distributions of multivariate
categorical responses. J. Am. Statist. Ass., 89, 625–632.
Lefkopoulou, M. and Ryan, L. (1993) Global tests for multiple binary outcomes. Biometrics, 49, 975–988.
Lehmacher, W., Wassmer, G. and Reitmeir, P. (1991) Procedures for two-sample comparisons with multiple
endpoints controlling the experimentwise error rate. Biometrics, 47, 511–521.
Liang, K.-Y. and Zeger, S. L. (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73,
Lin, T. C., Hosmane, B. S., Olson, P. J. and Padley, R. J. (2001) Analysis of adverse events in titration studies. J. Statist. Planng Inf., 96, 129–142.
Loughin, T. M. and Scherer, P. N. (1998) Testing for association in contingency tables with multiple column
responses. Biometrics, 54, 630–637.
Mantel, N. (1980) Assessing laboratory evidence for neoplastic activity. Biometrics, 36, 381–399. Mee, R. W. (1984) Conﬁdence bounds for the difference between two probabilities. Biometrics, 40, 1175–1176. Mehrotra, D. V. and Heyse, J. F. (2004) Use of the false discovery rate for evaluating clinical safety data. Statist. Meth. Med. Res., 13, 227–238.
Miller, M. E., Davis, C. S. and Landis, J. R. (1993) The analysis of longitudinal polytomous data: generalized
estimating equations and connections with weighted least squares. Biometrics, 49, 1033–1044.
O’Brien, P. C. (1984) Procedures for comparing samples with multiple endpoints. Biometrics, 40, 1079–1087. O’Neill, R. T. (1988) Assessment of safety. In Biopharmaceutical Statistics for Drug Development (ed. K. E. Peace).
O’Neill, R. T. (2002) Regulatory perspectives on data monitoring. Statist. Med., 21, 2831–2842. Pocock, S. J., Geller, N. L. and Tsiatis, A. A. (1987) The analysis of multiple endpoints in clinical trials. Biometrics, 43, 487–498.
Rotnitzky, A. and Jewell, N. P. (1990) Hypothesis testing of regression parameters in semiparametric generalized
linear models for cluster correlated data. Biometrika, 77, 485–497.
Westfall, P. H. and Young, S. S. (1989) P value adjustments for multiple tests in multivariate binomial models. J. Am. Statist. Ass., 84, 780–786.
Zhang, J., Quan, H., Ng, J. and Stepanavage, M. (1997) Some statistical methods for multiple endpoints in clinical
trials. Contr. Clin. Trials, 18, 204–221.

Case report Clin Ter 2013; 164 (3):e207-209. doi: 10.7417/CT.2013.1570 A case of sensitization to Alitretinoin T. Grieco, V. Faina, L. Dies, M. Milana, L. Alei, E. Silvestri, S. Calvieri Department of Dermatology, “Sapienza” University of Rome, Italy Abstract Case report Alitretinoin is a unique retinoid authorised for the treatment of adults with severe chronic hand eczem

Iraqi Journal of Veterinary Sciences, Vol. 22, No. 1, 2008 (5-7) Evaluation the effect of albendazole against nematodes in sheep in Mosul, Iraq E. K. Mohamed and M. I. Al-Farwachi Department of Internal and Preventive Medicine, College of Veterinary Medicine, University of Mosul, Mosul, Iraq (Received February 11, 2008; Accepted June 26, 2008) Abstract Six sheep farms in Mos