Detection of Infectious Outbreaks in Hospitals through Incremental Clustering
Timothy Langford1, Christophe Giraud-Carrier1, and John Magee2
1 Department of Computer Science, University of Bristol, Bristol, UK
2 Department of Medicine, Microbiology and Public Health Laboratory, Cardiff, UK
Abstract. This paper highlights the shortcomings of current systems of noso- comial infection control and shows how techniques borrowed from statistics and Artificial Intelligence, in particular clustering, can be used effectively to en- hance these systems beyond confirmation and into the more important realms of detection and prediction. A tool called HIC and examined in collaboration with the Cardiff Public Health Laboratory is presented. Preliminary experiments with the system demonstrate promise. In particular, the system was able to uncover a previously undiscovered cross-infection incident. 1 Introduction
Nosocomial infections are estimated to affect 6-12% of hospitalised patients. Theseinfections have significant effects on mortality, mean length of hospital stay and anti-biotics usage, and result in many hundreds of thousand pounds annual cost to the Na-tional Health Services in the UK. Outbreaks, i.e., multiple nosocomial infectionscaused by the same bacterial strain (e.g., the infamous multi-resistant Staphylococcusaureus or MRSA), can result from many factors, including lapses in good practice,changes in procedure that introduce a new route of transmission, or the movement ofpatients between wards in the same hospital and between different hospitals.
Most detected outbreaks are limited to a single ward, a single medical/surgical team
or a single treatment procedure. Detection and advice on treatment and control ofinfections (including nosocomial infections) are largely the responsibility of the localmicrobiology laboratory, run by the hospital or sometimes (in England and Wales) bythe Public Health Laboratory Service. These laboratories receive specimens from thelocal hospitals and analyse them for bacteria. The bacteria are identified at the specieslevel (there are about 4500 known species, of which about 500 cause human infec-tion), and their susceptibility to a range of antibiotics (usually 3 to 12) appropriate tothat species is assessed. These findings are then recorded and reported to the ward.
Outbreaks of nosocomial infection that are detected are typically first recognised by
a laboratory worker, who, over a relatively short period, has handled several speci-mens yielding the same unusual species of bacterium, or a common species with an
unusual pattern of antibiotic susceptibility. A possible cross-infection situation is rec-ognised and records are searched for previous similar findings. This search often re-veals that other closely similar or identical bacteria have been detected previously, andthat all these infections are from either the same ward or same medical/surgical team. The cross-infection team then investigates the incident, and attempts to determine andrectify its cause. In a common alternative detection method, recent infections for a fewspecies of particular interest are noted routinely on a board depicting the wards, andthe board status is reviewed at each update, looking for anomalous concentrations ofsimilar infections. Both methods are limited, labour-intensive and highly dependent onsubjective interpretation.
Furthermore, modern hospitals may have more than 500 beds and laboratories may
receive in excess of 100,000 specimens per annum. This means that clues to incidentsare easily lost in the vast amount of data generated; no single member of the labora-tory team sees all reports, and it is less likely that a single staff member will handleseveral specimens from an outbreak. In addition, the range of species that may beinvolved has increased markedly over the years. Outbreaks involving common specieswith “normal” susceptibility patterns are unlikely to be detected, and those outbreaksthat are detected are often found late in their progress.
What is required is a systematic and ongoing search for structured anomalies in the
report flow. Well established statistical techniques as well as recent advances in Arti-ficial Intelligence, especially in the area of Data Mining, open up new possibilities forthis kind of analyses in large bodies of data. This paper describes a procedure andsoftware system, which support the automation of nosocomial infection detectionthrough incremental clustering. The system, called HIC, has been trained and testedwith data obtained from the Cardiff Public Health Laboratory (CPHL).
The paper is organised as follows. Section 2 outlines our case study with the Car-
diff data. Section 3 shows how the raw data is transformed by HIC using clusteringand sliding aggregation techniques to improve the design of detection and predictionmechanisms. Section 4 discusses the current implementation of outbreak detectionbased on thresholding the transformed data and reports on promising preliminaryresults obtained within the context of the Cardiff data. Finally, section 5 concludes thepaper and highlights areas of future work. 2 The CPHL Data: A Characteristic Case Study
The CPHL currently holds about 7 years of historical diagnostic data, and new data
is recorded daily. Each record or case corresponds to a particular sample collected ona given patient and is represented by a set of features as follows:
CASE_DATE : ORG_LOC : WARD : FIRM : SEX : AGE : HSP : LSN : <ANTIBIOGRAM>where,CASE_DATE
The date at which the specimen was submitted to the Laboratory
The anatomical origin of the case specimen (rt, urine, blood, faeces)
The Hospital Patient Number, a unique identifier for each patient
The Laboratory Specimen Number, a unique specimen identifier
The sample’s antibiograms, including the species of organisms foundand their antibiotic susceptibility patterns (see below)
The first 8 features define the context of the antibiograms. From the viewpoint of
infection detection, the antibiograms can be regarded as the “core” data of each casesince they provide information about bacteria found. Cases where no organism hasbeen found constitute a good proportion of the recorded data, but these uninfectedpatients are not particularly relevant to detection of cross-infection.
Each antibiogram is represented as follows:
The corresponding case’s LSN (see above)
The type of the organism (rt, coliform, @pur vpuvhÃpyvÃMRSA)
The antibiotic susceptibility pattern of the organism, vr, a list of the or-ganism’s resistance to a predefined set of 27 antibiotics, recorded asuntested (no entry), sensitive (S), resistant (R) or intermediate (I)
The LSN feature serves only as a means to identify cases. The two other features
carry the information that determines the type of organism. There are three main is-sues regarding the CPHL data that must be addressed by an automatic detection sys-tem. They are as follows.
• Sparseness. ORG_ASP has 427 possible values. Although a reasonable amount
of data is collected (bacteria from around 100 infections are identified and re-corded each day), the fact that there are so many possibilities within each strainof bacterium means that the data is rather sparse. Hence, any direct attempt atclustering bacterial strains based on a symbolic comparison of theirORG_TYPE and ORG_ASP will lead to a large number of small clusters. Thisproblem is further compounded by the fact that not all of the possible antibiot-ics are tested on every sample. Normally, only 6 to 12 tests are performed andrecorded. The choice of tests performed reflects the organism species, the siteand severity of the infection. In some cases, during a known outbreak, addi-tional antibiotics may be tested to extend the possibility of identifying an out-break-associated organism from an extended antibiogram pattern.
• Test variation. Two organisms with the same ORG_TYPE, but with different
ORG_ASP should theoretically be regarded as distinct. However, test varia-tions can cause small differences in the ORG_ASP between repeat analyses ofthe same organism. Similarity between ORG_ASPs can not require strict iden-tity, but must account for “errors” in test reproducibility.
• Time Dependency. The sample data represents only one instance of the bacte-
rium. The data contains no knowledge about how long the bacterium was pres-ent in the hospital environment before the data was obtained, or how long itremains in the hospital environment after it has been detected. Hence, the re-corded data does not strictly provide an accurate representation of the actualfrequencies of bacteria in the hospital environment. However, because eachcase has an associated date, the cases can be “trended” over time.
To accommodate the above characteristics and render the data more amenable to
analysis, a number of transformations are applied, as detailed in the following section. 3 Data Transformation
This section deals with how the data can be utilised in a daily incremental way toallow a greater understanding of the distribution of infections in the hospital. The aimis to transform the data, using the concept of sliding aggregation, to make it moreamenable to our selected data analysis methodology. Note that other methodologiesmay require different (or no) transformations. 3.1 Antibiograms Expansion
Antibiotic susceptibilities tend to follow rules. For example, a penicillin-sensitive
Staphylococcus aureus will invariably also be sensitive to methicillin, and an ampicil-lin-sensitive E. coli is normally susceptible to co-amoxiclav and cephalosporin. Suchrules can be used to fill in some of the missing susceptibility values based on recordedones. In order to do so, the following meta-classes and corresponding classes of anti-biotics were formed.
A PENICILLIN 0: PENICILLINA PENICILLIN 1: AMPICILLIN, PIPERACILLINA PENICILLIN 2: AUGMENTIN, (FLUCLOXACILLIN for ThuÃh r)A PENICILLIN 3: IMIPENEM
A QUINOLONE 0: NALIDIXIC ACIDA QUINOLONE 1: CIPROFLOXACIN
A CEPHALOSPORIN 0: CEPHRADINEA CEPHALOSPORIN 1: CEFUROXIMEA CEPHALOSPORIN 3: CEFTAZIDIME, CEFOTAXIME
AN AMINOGLYCOSIDE 0: NEOMYCINAN AMINOGLYCOSIDE 1: GENTAMICINAN AMINOGLYCOSIDE 2: TOBRAMYCIN
A GLYCOPEPTIDE 0: VANCOMYCINA GLYCOPEPTIDE 1: TEICOPLANIN
UNGROUPED: FUSIDIC ACID, TRIMETHOPRIM, NITROFURANTOIN, COLISTIN,
CHLORAMPHENICOL, METRONIDAZOLE, RIFAMPICIN, MUPIROCIN
Within a meta-class (except for UNGROUPED), each class has an associated “level”,
where the higher the level the more potent the corresponding antibiotics with respectto bacterial tolerance. For example, within the A PENICILLIN meta-class, PENICILLIN,
which belongs to the class A PENICILLIN 0, is a “weaker” antibiotic thanFLUCOXACILLIN, which belongs to the class A PENICILLIN 2. With this classification, itis possible to design a simple set of rules expressing dependencies between antibioticsusceptibility and to use this to expand an ORG_ASP to explicitly reveal informationthat is implicit in the original patterns. For example, an original --SI--SRI. patternmay become R-SI--SRI. after expansion. A single (meta-)rule, instantiated as needed,is as follows.
IF A BACTERIUM IS RESISTANT TO AN ANTIBIOTIC IN GROUP BfG67@GÃITHEN IT IS RESISTANT TO ALL ANTIBIOTICS IN GROUPS BfG67@GÃHÃFOR H≤I
Pattern expansion is attempted for all cases but only applied if the antibiotic in the
original antibiogram is untested. If the antibiotic has been tested, the originally re-corded result takes precedence even if it contradicts the expansion rule’s suggestion. By reducing sparseness, pattern expansion also facilitates clustering by improving thereliability of symbolic comparisons between patterns as discussed below. 3.2 Antibiograms Comparison
Clearly, the expansion rules discussed above limit the amount of symbolic mismatchbetween ORG_ASPs that should be treated as originating from the same organism. However, even then it is unrealistic to rely on exact symbolic matching. What isneeded is a clustering mechanism based on some reasonable measure of similarity.
The similarity measure implemented in our system, combines a rather simple simi-
larity coefficient with a user-defined threshold as follows. In order to be comparable,two antibiograms must have the same ORG_TYPE. If they do, the entries of theirORG_ASP patterns are compared in a pairwise fashion. Comparisons involving an Iresult or an empty entry in either pattern are ignored. The number of exact matches(i.e., S-S or R-R) and the number of mismatches (i.e., S-R or R-S) are used to com-pute the following similarity coefficient.
For example, the following two ECOLI antibiograms of length 15
give rise to 6 matches and 3 mismatches, and thus to a similarity coefficient of 0.67. Auser-defined threshold can then be used to determine whether two organisms may beconsidered the same. For example, if the threshold is 0.6, the two ECOLIs above aredeemed to be indistinguishable. Armed with SimCoef (and an associated threshold), itis possible to cluster antibiograms. This is accomplished in our system using a slidingaggregator as detailed in the following sections. 3.3 Sliding Aggregation
Sliding aggregation is a method that clusters data at two points in its operation in anincremental manner. Essentially, the mechanism is based on a period of time t, with asliding window of n components (the fixed window size) in which the ordered data isplaced. There must exist data for at least n time periods before the aggregator can beused. At the start, the leftmost compartment of the window contains data from thetime period tand the rightmost compartment of the window contains data for the
time period t . The window then “slides along” the data, so that when data from a
subsequent time period t (i≥1) is added to the window, the oldest element (i.e., t ) is
removed from the window and all other elements occupy a new position directly to theleft of their last position.
The operation of the sliding window can be used to implement a filter across any
ordered set of data, although its performance is highly dependent on t, n, the size ofthe set of ordered elements and the ability to efficiently order the elements. With re-spect to the diagnostic data, it makes sense to set the time-period t to 1 day, henceeach component of the window is the case data for the day it represents and incre-mental daily updates to the sliding window are possible. As the elements here arenaturally ordered by time, the next critical choice is the value of the fixed window sizen. The application domain suggests that n can be set to any value representative of theestimated time that organisms remain present in the hospital environment after detec-tion. Currently, this is set to 35 days. (i.e., t=1, n=35). Note that the use of the slidingwindow helps to compensate for the fact that the data represents investigated instancesand does not account for the length of time that the bacteria remain in the hospitalenvironment after detection (see section 2).
By clustering the case data within each component of the window with respect to
their antibiograms, one may compare equivalent infections on a daily basis (section3.3.1). Furthermore, one can also cluster daily clustered data within a window (section3.3.2). By taking these results for each window on the data as a whole (i.e., as thewindow slides or is updated with new data each day), the data can be transformed intoa set of sequences for each bacterial type that is the aggregate of the data as the win-dow slides along. The transformed data is more representative of the bacteria causinginfections, more amenable to subsequent analysis, and can be readily presented forintuitive visual inspection by laboratory staff. Also, further transformations can bemade readily. For example, one may consider an exponentially weighted modifiergiving new data in the window a greater effect than older data. 3.4 Incremental Clustering
This section shows how antibiograms are clustered as data from the laboratory isadded to the sliding aggregation system on a daily basis. For horizontal clustering, it isassumed that the aggregator window is full (i.e., at least n=35 days’ worth of data areavailable). 3.4.1 Vertical Clustering. Raw daily data is clustered using the above procedure and a threshold provided by the laboratory staff. Antibiograms of bacteria of the same organism type are compared to one another in turn until all have been compared. Each antibiogram has two frequency values associated with it: the “actual” frequency of the antibiogram and its “clustered” frequency. The actual frequency of an antibiogram is the number of times it occurs identically (i.e., same ORG_TYPE and same ORG_ASP) in a day. The clustered frequency of an antibiogram is based on SimCoef (see section 3.1) and its associated threshold. Every time two antibiograms are considered sufficiently similar, their respective actual frequencies are added to each other’s clustered frequencies. The following are two examples.
As vertical clustering is applied to all bacteria in the day’s data, no information re-
lating to the antibiograms’ actual ORG_ASPs is lost, which is important to the subse-quent horizontal clustering of the incremental process. 3.4.2 Horizontal Clustering. Clustered daily data make up the contents of the day’s component in the sliding aggregator. Once the sliding aggregator has been updated, a further clustering operation is performed on the data. Here, the system tries to create the “highest clustered frequency” path through the sliding aggregator possible for each clustered antibiogram in the sequence. The output here is a sequence of both actual and clustered frequencies for each unique antibiogram within the window. This is achieved by beginning with the first cell and iteratively clustering each instance of the same organism/antibiogram type within it against all those in the next cell, then picking the first highest frequency antibiogram that satisfies the similarity measure. As progress through the window continues new previously unique antibiograms may be uncovered; for these new antibiograms a backward search through the aggregator window is performed before proceeding to ensure that each unique antibiogram in the window has its complete highest clustered frequency pattern enumerated. The following are two examples.
At this point, the focus of the data has shifted from that of the original case data as
a whole to that of the antibiogram. The system now has the opportunity to analyse thedata with respect to particular organism/antibiogram type. Probability distributionsover the other features in the cases can be calculated to provide a context to theseinfection levels. Also the summation of frequencies within the sliding aggregator canbe obtained and used to construct higher-level sequences with respect to particularbacteria (filters, e.g., exponential decay could also be applied to these). 4 Data Analysis – Detection
The data transformation methods of section 3 have been implemented in a softwaresystem, called HIC . This section now details the detection mechanism used by HICto uncover potential outbreaks and outlines a method to perform prediction.
Once a sequence of clustered frequencies has been obtained for each organ-
ism/antibiogram type, HIC can attempt to discover anomalous bunching of similarinfections. The system could carry out this stage of the process in a variety of ways,either supervised or unsupervised. Currently, HIC relies on a rather simple threshold-ing method.
First, the stored “total” frequency for the organism/antibiogram is obtained. The
expected frequency is then calculated from the date range over which the system datawas collated and the size of the sliding aggregator window. This value can then becompared against the clustered value obtained from the sliding aggregator frequencytotal for the organism. The user defines a threshold, and if the sliding aggregator totalis greater than the threshold times the expected frequency for the given organism, thenthis is noted as an anomaly and presented to the user for inspection. The following is asimple example.
Threshold = 1.8 (80%)!! Possible Anomaly Detected !!Sequence Frequency - 13System Frequency -
0:0:0:0:0:0:0:0:0:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:1:0:0:0:0:0, sum = 20:0:0:0:0:0:0:0:0:0:0:0:0:1:0:1:0:0:0:0:0:0:1:0:0:2:1:1:0:3:0:1:1:0:1, sum = 13
This is a fairly intuitive way of determining whether or not a specific organism has
exhibited anomalous bunching. Yet, it has proved successful in detecting known out-breaks in the historical laboratory data. It is also suitable for unsupervised detection ofanomalies in a data stream, with a default threshold that may be amenable to auto-matic adjustment by machine-learning methods in later versions of the system. It isalso readily adapted to interactive supervised training of the system.
Details of the implementation of the HIC system are in . The data provided by
the Cardiff laboratory was used to train and test HIC. The experiment, which focusedon detection, proved successful in highlighting a possible outbreak that remained un-detected by the laboratory.
There is an ongoing infection problem with Klebsiella at the University Hospital of
Wales. Briefly, there have been sporadic clusters of colonisation with a few cases ofinfection from 1995 to 1999. The strains involved were mostly identified to the spe-cies Klebsiella aerogenes and showed resistance to multiple antibiotics. The datadownloaded as input for development of the cross-infection detection program in-cluded one of these clusters. This was not actually called as an outbreak, becausesmall numbers of patients were involved, and the organisms were identified as multi-resistant Klebsiella oxytoca, rather than Klebsiella aerogenes. However, in retrospect,these organisms had closely similar antibiograms and biochemical patterns, andprobably represented a cluster of nosocomial colonisation/infection. This cluster wasstrikingly obvious in the teaching set output from the detection program. There has notbeen sufficient time for fuller inspection of the output outside the Klebsiella group,where research interests engendered greater familiarity with the ongoing infectionstatus. However, this result was promising, particularly as the cluster had not beenrecognised as such in the laboratory at the time the data was downloaded.
In another experiment extending HIC, automation was taken further as several
methods of detecting outbreaks using thresholding techniques were tested . Theresults have yet to be validated by the microbiologists. 5 Conclusion
This paper describes a system of automatic nosocomial infection detection and pre-diction, based on AI methods. The system uses clustering and sliding aggregationtechniques to build sequences of data containing the actual and clustered frequenciescentred on the organisms involved. This methodology is intended to transform the rawdata into a form that takes into account the sparseness of the data, its instance basednature when gathered and the fact that values contained within it are subtly inter-changeable. It maintains implicit information on the evolution of bacterial strains andtheir changing resistance to antibiotics over time. It also makes the data more amena-ble to subsequent analysis by, for example, Case-Based Reasoning, where it is possi-ble to store higher-level aggregated sequences and feature-bound probability distribu-tions of organisms that have previously led to outbreaks and to use them for the pre-diction of similar future outbreaks. Preliminary experiments with data from the Car-diff Public Health Laboratory on detection demonstrate promise, as in at least onesituation, the system was able to detect a possible outbreak that had remained unde-tected by the laboratory.
The system’s functionality could be extended as follows. Rather than using fixed,
user-defined thresholds, the system must be capable of learning and adapting itsthresholds for reporting anomalies on the basis of operator feedback based on “level of
interest”, “confirmed outbreak”, “possible seasonal variation” and “antibiotic testvariability”. Other essential feedback items may be added in the light of developmentwork. Changes in detection parameters should be checked automatically to see if thesewould have affected detection of previously confirmed outbreaks, or would have re-vealed previously undetected anomalies. Where an outbreak has occurred, but notbeen detected by the system, the operator should be able to specify the specimen num-bers, species and susceptibility pattern involved. The system should then test modifiedparameters, optimising detection of the new and previously detected outbreaks, andreviewing the database. Previous parameters must be available for re-institution if theoptimisation process fails, or if the new optimum is rejected by the user. It may benecessary to operate the system with several distinct threshold levels or interest pro-files, particularly if other potential uses, such as detection of factors influencing fre-quencies of specific infections, are pursued. Acknowledgements
The first author was supported in part by the MRC Health Services Research Collaboration anda British Council/JISTEC scholarship to IBM Tokyo Research Laboratory. References
1. Althoff, K., Auriol, E., Barletta, R. and Manago, M.: A Review of Industrial Case-Based
2. Bouchoux, X.: HIC Dataset Report. Unpublished Manuscript, Advanced Topics in Machine
Learning, University of Bristol, Department of Computer Science (1999)
3. Bull, M., Kundt, G. and Gierl, L.: Discovering of Health Risks and Case-based Forecasting
of Epidemics in a Health Surveillance System. In Proceedings of the First European Confer-ence on Principles and Practice of Knowledge Discovery in Databases (PKDD’97). LNAI,Vol. 1263, Springer (1997), 68-77
4. Kolodner, J.: Case-Based Reasoning. Morgan Kaufmann (1993)5. Langford, T.: A Prototype System for Hospital Infection Control. Technical Report PR4-99-
09, University of Bristol, Department of Computer Science (1999)
6. Leake, D. (Ed.): Case-Based Reasoning: Experiences, Lessons, and Future Directions. AAAI
Transcript: Q&A with Dr. Beer August 5, 2004 Wendy Fisher: Welcome everyone who is on the conference call and everyone here. I’m Wendy Fisher. The idea behind the discussion today is to talk with with my Reproductive Immunologist Dr. Beer. He helps with delivering Lee and Andy, our babies, but also knows a lot about immunology and how it relates to the health of particular wo
DEBAT SUR L’EDUCATION - TAIN - LE 29.01.04 - COMPTE-RENDU - QUELLE ECOLE POUR CONCILIER LES ATTENTES DES FAMILLES, DES ENTREPRISES, DES ENSEIGNANTS ? A l’initiative des sections cantonales du PS de Tain-Tournon, ce débat sur l’éducation a permis à une vingtaine de participants d’échanger et de confronter leurs points de vue pendant deux heures. L’objectif affiché était de favoriser