Evaluating Security Products with Clinical Trials
date malware signatures, poorly written software, com-placent users. . . security experts can pontificate at length
One of the largest challenges faced by purchasers of se-
regarding the weaknesses of current systems. However,
curity products is evaluating their relative merits. While
moving from this subjective, qualitative list to more con-
customers can get reliable information on characteristics
crete evaluations is difficult. Is patching more important
such as runtime overhead, user interface, and support
than updating malware signatures? If so, how risky are
quality, the actual level of protection provided by dif-
delayed updates? And, more importantly, what defenses
ferent security products is mostly unranked—or, worse
work in the field, and which ones do not? It is relatively
yet, ranked using criteria that do not generally reflect
easy to decide whether a defense could stop an attack;
their performance in practice. Even though researchers
it is quite another to say that it will stop that attack in
have been working on improving testing methodologies,
practice—particularly when attackers are given time to
given the complex interactions of users, uses, evolving
adapt and users are given the opportunity to invalidate
threats, and different deployment environments, there are
fundamental limitations on the ability of lab-based mea-
Today nobody knows the true relative security mer-
surements to determine real world performance. To ad-
its of different products, techniques, or strategies. Virus
dress these issues, we propose an alternative evaluation
scanners perform similarly in most lab tests, with the
method, computer security clinical trials. In this method,
“best” solutions differing by fractions of a percent in
security products are deployed in randomly selected sub-
overall results. Firewalls are compared and sold based
sets of targeted populations and are monitored to deter-
upon features and speed, not security. Standard secu-
mine their performance in normal use. We believe that
rity evaluation standards (such as the Common Criteria)
clinical trials can provide solid evidence of the efficacy
do not apply to systems as they are used. And security
of security products, much as they have in the field of
experts regularly give advice such as “use strong pass-
words” and “turn off JavaScript” that most users willnever follow. If we security experts do not know what
are the best security products, and we do not know howto effectively help non-experts, is it any surprise that we
The Internet is a dangerous place for users. As the reach
of the network has increased, it has brought with it not
While lab-based evaluations are essential, we believe
only access to vast collections of data but also fraud and
we must do more if we are to make significant strides in
compromise. According to several reports [3], users are
improving the security of the Internet. Specifically, we
at more risk of attack than ever before. Furthermore, at-
must learn what works best on deployed systems. Note
tackers are increasingly sophisticated, adapting quickly
that “what works” is not the same as “what could work.”
to new technologies and countermeasures and nimbly
For example, usability studies can identify problems that
morphing strategies to maximize payoffs. While the se-
could arise in deployment, such as difficulties in firewall
curity industry has mounted a valiant effort, we face a
configuration or confusion over messages from an an-
situation where our best efforts are inadequate.
tivirus scanner. Ultimately, though, we don’t care about
Perhaps the scariest part of this situation is that we
usability as determined in the laboratory—we care about
don’t completely understand why we are failing. We
actual use: Do administrators misconfigure firewalls in
have identifiable problems: unapplied patches, out-of-
practice? How often does user confusion over proper
virus scanner use actually lead to compromise?
important advantage—the public availability of security
To measure the use of security technologies in real-
products. Highly-skilled attackers can keep modifying
world circumstances, we have to account for how a given
their newly created malicious codes until they can by-
technology will interact with a huge variety of software,
pass all current defenses [2], forcing every security ven-
systems, users, uses, and attack profiles. The full com-
dor to constantly update their products. Given this situ-
plexity of the computational world cannot be captured
ation, how can a regular user know that their vendor is
in any lab setting or theoretical model—there are too
providing adequate protection against the latest threats?
many variables, and many of them change over time-
The obvious answer is that users should check published
frames (months or years) that cannot be practically mea-
benchmarks; unfortunately, according to those tests, vir-
sured in a laboratory setting using humans. As an al-
tually every major product appears to be equivalent—
ternative, we propose that the performance of security
they all “pass” or catch virtually all tested threats.
technologies be measured “in the field.” Specifically,we propose that security technologies be tested using the
In the antimalware field, researchers and industry
same methodology as used in medical clinical trials. In
members are currently working on developing better test-
essence, we propose that we use the same measures of
ing standards [1]; this task is extremely difficult, how-
outcome, side effects, and user tolerance and compliance
ever, because vendors and evaluators disagree regarding
that regulatory bodies use to demonstrate that the benefit
basic testing practices. For example, there is no consen-
of a drug or medical device outweighs its risks. Clini-
sus on how to construct an a collection of malware for
cal trials come in many forms depending upon the spe-
testing purposes. A major point of contention is whether
cific questions they are designed to address; what they all
such collections may contain new viruses, rather than
have in common, though, is that the test subjects live in
just ones not observed “in the wild” [5].
the “real” world, not a laboratory.
Clinical trials were originally developed because med-
While there are certainly ethical issues involved with
ical practitioners faced challenges analogous to those
creating new computer viruses, we believe there is a
faced by today’s security professionals: they knew a
more fundamental issue: if you create malware from
lot about health problems, but they didn’t know what
scratch for testing purposes, how do you know you’ve
worked to prevent or fix them. Clinical trials provided a
created the right kinds? In other words, how will you
methodology for separating “snake oil” from penicillin.
determine whether detection performance on synthetic
As we will explain, clinical trials have a number of lim-
test cases will correlate with performance on malware
itations as a testing methodology; our hope, though, is
observed in practice? This issue is just one part of a
that clinical trials of security technologies will allow us
much larger issue: how can you take into account all of
to separate ineffective and dangerous technologies from
the factors—detection mechanisms, relative frequencies
those that provide significant security benefits.
of different kinds of malware, user behavior, host andnetwork environment, changing attacker strategies and
goals—that affect a product’s real world performance ina set of standardized lab tests?
The evaluation problem exists broadly in computer se-curity, for both academic research and commercial prod-
We believe the simple answer is that you can’t—the
ucts. The most egregious type of improperly evaluated
task is impossible. There are simply too many variables.
security technology is often referred to as “snake oil” [8].
Researchers and companies will continue to argue about
The ultimate question in computer security evaluation is,
proper lab testing procedures because there is no single
how do we differentiate effective security mechanisms
right answer: every test incorporates assumptions about
from such quackery, particularly in the eyes of a lay au-
the real world, and these assumptions cannot be evalu-
Such differentiation is becoming more important be-
cause, almost always, even the best commercial systems
Is there a way beyond this impasse? Perhaps, but only
cannot detect many of the most recent threats. This lim-
if we can test security technologies “in the field”—in the
itation arises because new threats emerge much more
contexts in which they are used. Of course, such testing
frequently than before, and meanwhile some of them
would involve attempting to protect real users from real
aim for economic profits and use very complex tech-
threats while measuring relative performance. This ap-
nologies in order to bypass security mechanisms [6].
proach is technically difficult, expensive, ethically chal-
Even though many security companies have started using
lenging, and potentially very risky. We believe, however,
more flexible techniques such as heuristics to respond to
that such testing is feasible based on experiences from
new threats, in this arms race attackers always have an
the field of medicine, in the form of clinical trials.
Randomly Chosen Treatments Subjects or doctors do
not choose their treatment; instead, the treatment is
While computers and humans are very different sys-
tems, the medical field has long faced evaluation prob-
Control Groups Some subjects do not receive any treat-
lems analogous to that of computer security. Specifi-
ment or are given a placebo (e.g., a sugar pill).
cally, before the 20th century there existed many poten-
Blinding In a single blind study, subjects do not know
tial “defenses”—treatments that promised to ensure or
which treatment they are receiving. In a double-
repair health—but people continued to be attacked and
blind study, the treating doctors do not know either.
compromised (suffer and die prematurely from disease).
Indicators Often the condition studied evolves over a
While modern medicine has a variety of limitations, cur-
rent medical practice has treatments that can reliably pre-
end (e.g., wait until the subject is cured or dead),
vent or cure many conditions that before were debilitat-
progress is measured by observing indicators that
ing or even fatal. What is remarkable about these treat-
are known to correlate with the final outcome. For
ments is that, in general, we don’t understand how they
example, insulin and blood sugar levels of dia-
work: our understanding of living systems is still prim-
betes patients are monitored in diabetes-related tri-
itive in many ways. Despite this lack of knowledge,
als. Note that it is often hard to find a reliable indi-
however, we are now able to differentiate treatments that
cator (e.g., a cancer recurs even when all tests indi-
work from those that do not. The primary methodology
cate the treatment was successful); thus, longer term
for drawing such conclusions is the clinical trial [4].
studies are always required to assess the reliability
The key insight behind clinical trials is that when
studying systems (such as the human body) that are com-plicated, diverse, and tightly coupled with a dynamic en-
Due to the constraints of particular experiments, not all
vironment, individual variables cannot be isolated and so
clinical trials will include all of these features; the more
cause and effect relationships cannot be inferred from
that are used, however, the greater the statistical power
individual observations: correlations can occur with-
of the results. In other words, each of these mechanisms
out causation, and observed effects can originate from
help with determining causal relationships. The fewer
unidentified causes. Clinical trials are an experimental
that are used, the more likely the study will only show
methodology designed to identify causal relationships in
While clinical trials are very powerful tools for deter-
In medicine, clinical trials, or randomized control tri-
mining cause-effect relationships, they are not able to tell
als (RCTs), are planned experiments that are designed to
why those relationships exist. Clinical trials do not them-
compare treatments for a given medical condition. They
selves provide explanations or models; what they can do,
use results based on a limited sample of patients to make
however, is test the validity and completeness of models.
inferences about how treatments should be conducted in
For example, in medicine drugs that work well in lab ex-
the general population of patients. While the majority of
periments routinely fail to work in clinical trials on peo-
clinical trials are concerned with evaluating drugs, they
ple. This failure happens even when the precise molec-
can also be used to evaluate other interventions such as
ular mechanism of the drug is known. Quite simply, we
surgical procedures, radiotherapy, physical therapy, and
cannot capture the full complexity of the human body
in any current model or lab. With clinical trials, how-
To account for variations in genetic makeup, lifestyle,
ever, we can make sure that regular patients get effective
life history, and environment, clinical trials are designed
treatments—even if we don’t understand how those treat-
Selected populations At risk or afflicted individuals are
studied, rather than the general population.
Extended duration Experiments are performed for
Because computers are engineered systems, we are much
months or, ideally, years in order to evaluate longer
better able to determine cause and effect in computer se-
curity than in medicine. However, while it is relativelystraightforward to understand a given vulnerability and
Random samples Subjects are randomly recruited from
devise a patch that fixes it, as we explained in Section 2,
it is not nearly so easy to determine what produce the ul-
Comparable Treatments Subjects are given one of a
timate result of more secure systems. So, here we ask, is
small selection of treatments, each of which is in-
it potentially feasible to adapt the clinical trial methodol-
The key constraint to the feasibility question is to re-
Treatments Three major antivirus programs would be
alize that clinical trials cannot be use to address the same
selected for the trial and randomly assigned to dif-
questions as standard security evaluation techniques. We
cannot use a clinical trial to analyze malware, expose a
tivirus programs would be allowed to be installed;
new software vulnerability, or test a new cryptographic
otherwise, only the standard security software that
protocol. However, we can use clinical trials to address
comes with Windows Vista would be allowed to be
used. Compliance would be verified by scanningoff-site backups.
• What is the security benefit of running an antivirus
program on a personal computer in a typical home?
Note that all provided software would be kept auto-
• Do personal firewalls provide additional protection
matically up to date, including updates to the lat-
for technically advanced users on their home ma-
subscription model.) Other upgrades (software and
• Does user training protect organizations from social
hardware) and new installations would be permitted
at the user’s discretion (e.g., upgrades from Win-dows Vista to Windows 7 and the installation of new
Note the key feature of these questions is that, because
they involve interactions between computers and their
Control A control group would receive no antivirus pro-
users in specific environments, they cannot be answered
gram and would be prohibited from running any
in a controlled laboratory setting; nevertheless, they are
host-based antivirus program. To ensure that users
precisely the kinds of questions we need to answer if we
were still protected, unobtrusive non-host based de-
are to improve security in practice.
fenses (e.g., scanning disk backups, cloud-based an-
It takes a team of people to develop a medical clinical
trial design: experts in the specific treatment must work
protection could not be provided with these other
with general clinicians, statisticians, experts in patient
mechanisms, we would then have to omit a control
recruitment, ethicists, and others. Given that computer
group. This case is analogous to a medical clini-
security clinical trials will also deal with human popu-
cal trial where it is unethical to omit treatment for
lations (along with computer populations), many of the
same technical, legal, ethical, and logistical issues willneed to be addressed. For these reasons, we cannot hope
Blinding The antivirus programs would be modified to
to present a complete trial design here; however, we can
remove any obvious corporate insignia or other ad-
give an outline for a plausible computer security clinical
vertising. Color schemes would also be modified to
trial. Here we present a sketch of a trial addressing the
make them as similar as possible. Otherwise, how-
first question: the benefit of antivirus programs.
ever, their interfaces would remain the same. Such
It is generally recommended that all personal com-
uniformity would help minimize the effect a prod-
puter users (at least, those running a version of Microsoft
uct’s brand on user behavior, e.g., a new product
Windows) run an up-to-date antivirus scanner. A clinical
trial designed to test their relative benefits could have the
In addition, if we have a control group, the control
group computers would run a program that mim-
Population Users running (at the start of the trial) Mi-
icked the appearance and behavior of an antivirus
crosoft Windows Vista SP2 on a home machine con-
program. It would provide a Windows tray icon and
nected to the Internet via a large home internet ser-
it periodically would report that its signatures were
updated. In addition, it would check and report a
Duration Three years, with preliminary results reported
variety of relatively innocuous, common problems
such as tracking cookies. This program would dono proper scanning and it would provide no protec-
Sample 1000 ISP subscribers would be randomly re-
cruited to participate in the trial. Each subscriberwould be given the following incentives to partic-
Indicators A variety of measures would be required to
ipate: free technical support and automatic offsite
monitor the users and computers involved in the
backups for all machines enrolled in the trial and
study. Primary measures would classify the effi-
their users. In return, they would have to agree to re-
cacy of the tested systems based on scans of off-
searchers monitoring their computer usage (subject
site backups for examples of known malware. To
to appropriate privacy and other controls). Users
maximize accuracy, such scans would use a large
would be allowed to drop out of the trial at any time.
number of commercial scanners (including those
not part of the test). Further, supplementary soft-
also adapt to new attacks via automated update mech-
ware would record CPU, disk, and network usage.
anisms. Thus, clinical trials of security software will,
Periodically, a small subset of machines would be
implicitly, be testing the software and the organization
inspected manually by security experts to evaluate
behind it. In practice, then, we would really be com-
computer health and other characteristics. Finally,
paring humans (attackers) versus humans (defenders), as
technical support records would give direct mea-
mediated by a computational battlefield.
But even if we are talking about human institutions,
The primary goal of such measurements would be
as with many financial products, past performance is not
to evaluate the “health” of the subject machines. Of
indicative of future results. Given that we cannot pre-
course, we cannot ever be completely sure that a
dict the future of security technologies using any current
seemingly healthy machine is not infected. We do
technique (including formal models), however, past per-
not need to know “ground truth” in this situation,
formance is all we have to go on when choosing security
however—we just need to measure relative perfor-
solutions. Clinical trials are merely a formal methodol-
mance. Thus, simplistic measures should suffice for
ogy for rigorously assessing that past performance.
While there are a variety of logistical, technological,
and financial challenges implicit in the above descrip-
Even if adopted, a clinical trial methodology will not be
tion, it should be clear that it would be possible to run
a panacea with respect to security. While the approach
this trial given the right resources. While we could spec-
should demonstrate the real world effectiveness of prod-
ulate on what results we might find from such a study,
ucts, it will not explain why differences exist. For exam-
the fact is that we don’t know what would be found. In-
ple, consider two virus scanners. Our trial would perhaps
deed, that is the key point of clinical trials: they can re-
show that one product provides statistically better protec-
veal interactions and behaviors that are not observed in
tion than the other—but it would not (directly) provide
laboratories nor predicted by theoretical models.
any explanation for their differential performance. Is itthe accuracy of virus detection? The speed or ease of
update? While individual users may be able to say whatthey liked about the product they were given, such opin-ions only provide clues as to the cause. As such, the re-
There are many potential objections to the use of the clin-
sults produced by the trial may be both unexpected and,
ical trial methodology in a computer security context.
Here we address some of the ones that have arisen in ourdiscussions.
Because of these limitations, clinical trials should be
seen as a complement to, not a replacement of, lab testingof security technologies. We also believe better method-
ologies are needed for lab evaluations. Our purpose here,though, is to point out that lab testing cannot be expected
One significant objection is that computer security is
to address all of the issues that arise in deploying secu-
fundamentally different from medicine because the ad-
rity solutions. Clinical trials provide a rigorous way to
versaries we face are not microorganisms but people—
determine to what extent solutions developed in the lab
intelligent, motivated people. While many have debated
the merits of the biological metaphor for computer secu-rity [9], we believe that debate is not relevant to the ques-tion of computer security clinical trials because the un-
derlying methodology is applicable in any circumstancewhere one is performing experiments outside of a con-
To be sure, clinical trials are an expensive and compli-
trolled lab setting. Randomization, selected populations,
cated way to evaluate systems. Aren’t there feasible al-
controls, blinding—these are just techniques for isolat-
ternatives? We have already discussed the limitations of
ing one variable of interest from a complex background
lab experiments; however, there is an alternative. Rather
than deal with the overhead of blinding, controls, screen-
Of course, it is true that clinical trials are back-
ing populations, and the like, why not just observe real
wards looking; thus, it is always possible that new
users with the defenses they already have?
attacks could render previously effective defenses
Such experiments are known as observational trials.
obsolete—something that happens much less frequently
They are used frequently in medicine, particularly when
in medicine. However, virtually all modern security tools
researchers are searching for effects that show up over
long periods of time (e.g., decades). Unfortunately, ob-
the importance of information assurance in the modern
servational trials are very limited in their ability to estab-
world and the increasing regulatory requirements for op-
lish causal relationships. Thus, virtually any interesting
erational security, we believe the cost and complexity of
correlation found in an observational trial is later subject
clinical trials are justified. While the ultimate value of
security clinical trials will only be known in retrospect,
While the cost of a security clinical trial can be miti-
we are optimistic that clinical trials will help the develop-
gated through appropriate automation, a clinical trial will
ment and deployment of effective security technologies.
always be at least an order of magnitude more expensivethan a simple lab comparison because of labor costs, par-
ticularly for technical support, subject recruitment, andongoing observation. For example, assume that a trial re-
The authors wish to thank Tim Furlong for first thinking
quired a 10:1 ratio of subjects to study personnel. Then,
of the computer security clinical trial in a lab brainstorm-
to run a trial with 1000 subjects we would need 100 study
ing meeting in the summer of 2006. AS, YL, and HI ac-
employees. If they are paid $100,000 on average, this
knowledge support from Canada’s NSERC, though the
Discovery Grants program and the Internetworked Sys-
We believe this estimate is a worst case scenario—
tems Security Network (ISSNet), and MITACS.
effective security clinical trials should be feasible for atenth this cost ($1,000,000/year) or less. But even thispessimistic estimate is potentially feasible: computer
security is a multi-billion dollar market, and $10 mil-lion/year is well within the funding capabilities of gov-
[1] AMTSO. Anti-Malware Testing Standards Orga-
ernments or NGOs (non-profits). Further, this cost is jus-
tified by the importance of the problem. Organizations
[2] DEFCON. The Race to Zero Contest. http://
are now being required by regulation to implement secu-
www.racetozero.net/, August 8–10, 2008.
rity solutions. Such implementations can be very expen-sive. To date, we have no way of determining whether
[3] FOSSI, M., Ed. Symantec Global Internet Security
those solutions provide concrete benefits in practice.
Threat Report, Volume XIV. Symantec, 2009.
If clinical trials are shown to work for computer se-
curity, it is likely they will become mandated by regula-
DEMETS, D. L. Fundamentals of Clinical Trials,
tion, much as they have been for medicine. Such regula-
tions would mean that changes in security practice wouldfirst need to be experimentally evaluated—for their se-
curity benefit in practice—before being adopted. We
think such a change would be to the benefit of the com-
puter security industry. Before medical practice was reg-
ulated, there was a vigorous but relatively small trade in
patent medicines—unregulated preparations that claimedto cure people’s ills. Despite being pioneers in marketing
[6] LARKIN, E. Storm Worm’s virulence may change
and advertising, patent medicines were widely maligned
tactics. Network World (August 2, 2007).
and mistrusted, largely because in general they didn’t ac-
[7] OBERHEIDE, J., COOKE, E., AND JAHANIAN,
tually work [10]. In contrast, modern medicine is an ex-
tremely large, lucrative, and well-respected enterprise. If
work Cloud. In 17th USENIX Security Symposium
our community can, as a group, recommend solutions for
which we have scientific evidence of their efficacy, per-haps computer security will also see a transformation in
[8] SCHNEIER, B. Snake oil. Crypto-Gram Newsletter
(February 15, 1999). http://schneier.com.
[9] SOMAYAJI, A., LOCASTO, M., AND FEYEREISL,
J. Panel: The Future of Biologically-Inspired Se-curity: Is There Anything Left to Learn? In 2007
In order for the field of computer security to progress, we
Workshop on New Security (2008), ACM.
need better ways to measure the relative benefits of dif-ferent techniques and tools as they are used in practice.
[10] STYLES, J. Product Innovation in Early Modern
To this end, we have proposed applying the proven tech-
London. Past & Present 168, 1 (2000), 124–169.
niques used in medical clinical trials to security. Given
Alcohol & Alcoholism Vol. 42, No. 5, p. 506, 2007 doi:10.1093/alcalc/agm058 Advance Access publication 1 August 2007 GAMMA-HYDROXYBUTYRATE (GHB)-DEFICIENCY IN ALCOHOL-DEPENDENCE? 23 rue du Depart—BP 37—7 5014 Paris, France (Received 10 January 2007; first review notified 17 January 2007; in revised form 3 March 2007; accepted 28 March 2007; advance access publication 1 August 2007) J
NOTICIAS INTERNACIONALES al 20/04/11 BRASIL Servicio sanitario ruso comenta los resultados poco satisfactorios de una inspección en plantas brasileñas que terminó el 18/04/11 y anticipa la posibilidad de adoptar medidas más severas Apr 19, 2011 A two-week inspection of Brazilian meat-processing plants intended to export their products to the Russian market was completed on