Unsupervised clustering of wildlife necropsy data for syndromic surveillance

Background The importance of wildlife disease surveillance is increasing, because wild animals are playing a growing role as sources of emerging infectious disease events in humans. Syndromic surveillance methods have been developed as a complement to traditional health data analyses, to allow the early detection of unusual health events. Early detection of these events in wildlife could help to protect the health of domestic animals or humans. This paper aims to define syndromes that could be used for the syndromic surveillance of wildlife health data. Wildlife disease monitoring in France, from 1986 onward, has allowed numerous diagnostic data to be collected from wild animals found dead. The authors wanted to identify distinct pathological profiles from these historical data by a global analysis of the registered necropsy descriptions, and discuss how these profiles can be used to define syndromes. In view of the multiplicity and heterogeneity of the available information, the authors suggest constructing syndromic classes by a multivariate statistical analysis and classification procedure grouping cases that share similar pathological characteristics. Results A three-step procedure was applied: first, a multiple correspondence analysis was performed on necropsy data to reduce them to their principal components. Then hierarchical ascendant clustering was used to partition the data. Finally the k-means algorithm was applied to strengthen the partitioning. Nine clusters were identified: three were species- and disease-specific, three were suggestive of specific pathological conditions but not species-specific, two covered a broader pathological condition and one was miscellaneous. The clusters reflected the most distinct and most frequent disease entities on which the surveillance network focused. They could be used to define distinct syndromes characterised by specific post-mortem findings. Conclusions The chosen statistical clustering method was found to be a useful tool to retrospectively group cases from our database into distinct and meaningful pathological entities. Syndrome definition from post-mortem findings is potentially useful for early outbreak detection because it uses the earliest available information on disease in wildlife. Furthermore, the proposed typology allows each case to be attributed to a syndrome, thus enabling the exhaustive surveillance of health events through time series analyses.


Background
The importance of monitoring wildlife health is increasingly recognised [1,2], because free-ranging wild animals are victims, reservoirs or indicators of an increasing number of disease agents shared with humans and/or domestic animals [3][4][5][6][7].
General wildlife disease surveillance is a means of maintaining vigilance against emerging wildlife-related diseases [8,9], but it produces data that are frequently biased [10]. These data are further characterised by the diversity of monitored parameters: species, pathogens, diagnoses, environmental characteristics, etc. The analysis of data from this type of surveillance is usually limited to retrospective descriptive assessments. Passively acquired wildlife accessions may however also give insight into the occurrence of disease processes, whose significance may only become apparent over time [8]. Therefore, there is a need to monitor wildlife diseases prospectively, using an approach that takes into account the great diversity of the parameters.
Syndromic surveillance "applies to surveillance using health-related data that precede diagnosis and signal a sufficient probability of a case or an outbreak to warrant further public health response" (Center for Disease Prevention and Control, http://www.cdc.gov/ncphi/disss/ nndss/syndromic.htm [11]). It has been developed in recent years in human health surveillance systems as a means of timely detection of disease outbreaks using robust pre-diagnostic data, which are registered automatically [12,13]. For efficient syndromic surveillance, it is necessary to group cases that share the same health indicators, in order to enhance the efficiency of event detection [14]. Health problems for which syndromic surveillance is used are either classified by bodily system [9,12,15,16] or focus on specific diseases, such as "influenza-like-illness" [17,18]. Syndrome definitions (groups of health indicators linked to these classifications) are either based on expert knowledge or on statistical classifications [12,13,[19][20][21].
Macroscopic post-mortem findings are the primary data collected from cases of general wildlife disease surveillance. These descriptions form robust and reliable information, provided examinations are performed by experienced staff [22].They are also the only information available for diseases of unknown aetiology [8]. Diagnoses of causes of death are generally not available soon enough to assist early detection, because they depend on laboratory analyses that are costly, time consuming or unavailable [23,24]. Descriptions of macroscopic lesions can be used for the syndromic classification of cases [15]. Syndromic groups can be monitored over space and time for trend analysis and rapid detection of unusual health events, and can enhance the usual data analysis and its usefulness [8].
A general wildlife mortality monitoring network in France [25] has been compiling health data for over 20 years, including descriptions of necropsy findings. We chose to adapt the principles and methods of syndromic surveillance to these wildlife surveillance data. Syndrome definition is the scope of this paper, and our aim was to retrospectively identify and characterise distinct pathological profiles from these data which could be used to structure the whole dataset and thus take every case into consideration. Clustering methods have been widely used in medical and biological disciplines to analyse and filter complex databases [26][27][28][29][30]. They make it possible to synthesise data complexity and define clusters without using any a priori knowledge of the biological reasons for the existence of groups [27]. Furthermore, this statistical grouping took into account health conditions that could potentially affect several bodily systems and have various causes. Such conditions are common in wildlife [10]. In addition, we did not stratify the data analysis by species, so that disease processes potentially affecting several species (e.g. intoxications) could be recognised.
Below, we describe and discuss the application of a three-step statistical analysis and classification procedure for wildlife necropsy data, and the biological significance and value of the clusters obtained for syndrome definition.

Material
Wildlife disease surveillance in France has yielded over 53,000 records since 1986, through a nationwide network called SAGIR, managed by the French Hunting and Wildlife Agency (Office national de la chasse et de la faune sauvage, ONCFS), with input from national and departmental hunting federations (Fédération nationale des chasseurs, FNC, and Fédérations départementales des chasseurs, FDC) [22,31,32]. Cadavers of free-ranging wild terrestrial mammals and birds are reported to the network by hunters and the public. The people in charge of surveillance at departmental level select carcasses according to their state of preservation and relevance and bring them to the departmental veterinary diagnostic laboratory for post-mortem examination and, in some cases, for further biological analyses. Up to now, 252 different species and 228 causes of death have been diagnosed by 97 labs and registered in the national database.
From the data collected up to 31.12.2007, 23,228 cases had a registered description of macroscopic post-mortem findings (each case represented a wild animal cadaver reported to the network and submitted for laboratory examination). For the cluster construction process, we selected a subset of 8,709 cases, analysed between 1.1.1986 and 31.12.1997, for which a complete description of post-mortem findings was available. Unfortunately, the registrations in the database of post-mortem findings of some of the remaining 14,519 cases were incomplete because since 1998, lesions typical of certain causes of death have no longer been registered; the descriptions were later completed by data imputation and cases were then classified (see Discussion section).
Macroscopic lesions were described for each case, according to the affected organs (Topography) and their morphological characteristics (Morphology) indicating the changes observed in the organs. In addition, a Cause of death was registered for each case (including some for which no diagnosis was reached, labelled DNR). Pathogenic agents (bacteria, parasites, fungi, viruses, toxins), which were not necessarily related to the cause of death, were described for 75% of the cases. Species were recorded with their common name. In the original database, the terms used for Cause of death, Morphology, Topography, and Pathogenic agent were numerous and heterogeneous, so experts (see acknowledgements) and other sources of reference (College of Pathological Anatomy of Marseille, http://medidacte.timone.univmrs.fr/webcours/umvf/anapath/corpus.htm; Systematic Nomenclature of Medicine SNOMED CT, http://www. ihtsdo.org; Canadian Cooperative Wildlife Health Centre, http://wildlife1.usask.ca) were consulted to group them into broad categories. For the Cause of death, Pathogenic agent and Species classifications, all broad categories whose frequency was below 100 were combined into a single category named 'Other'. The terms for Topography and Morphology were pooled according to their meaning, to obtain sufficient group sizes for statistical analysis (each term had to represent more than 3% of the total number of cases). Topography (15 modalities with two expressions each, either "not affected" or "affected") and Morphology (14 modalities with two expressions each, either "yes" or "no") were used to partition the data (active variables); their distributions are described in Tables 1 and 2. Cause of death (19 modalities), Pathogenic agent (18 modalities) and Species (9 modalities) were used for cluster interpretation (illustrative variables). The distributions of these variables are described in Tables 3, 4 and 5.

Method
Topography and Morphology (active variables) were used to perform a three step clustering in order to identify groups. First the data from each case were pre-processed by a multiple correspondence analysis (MCA) and reduced to their principal components. Then hierarchical ascendant clustering (HAC) was performed to determine a consistent partition. Finally the k-means algorithm was applied to consolidate this partition. The Cause of death, Pathogenic agent and Species variables were used to interpret the groups obtained (illustrative variables). Calculations were performed with R software (R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org). The packages, functions   and references used for each process are indicated below.

Multiple correspondence analysis
MCA is a descriptive analysis of multidimensional qualitative data [33,34]. It allows the analysis of a matrix of I individuals depicted by J qualitative variables. Projections of these individuals in a J-dimensional space are used to calculate factorial axes, the first one retaining the maximum variance, and the following axes retaining the residual variance and being perpendicular to each other. MCA allows continuous quantitative coordinates to be attributed to individuals, and the most significant factorial axes to be selected, in order to reduce the number of dimensions of the initial space [35]. The variables' contributions to the axes are examined to visualise what they represent and to check for outliers. The number of axes to be retained is chosen, with respect to their meaning, so that the cumulated percentage of explained variance, calculated with the Benzécri method [36], is greater than 95%. MCA was performed with the R package "FactoMineR" [37].

Classification
HAC allows individuals to be grouped according to their coordinates, by calculating pair-wise distances between cases and aggregating the closest ones. We used the Euclidean distance [26], and the Ward criterion was used for aggregation, because it maximises inter-cluster variance while minimising intra-cluster variance [27,38]. Intra-cluster inertia is measured by the sum of the squares of the Euclidean distances between cluster cases and the cluster centroid. The closer the cases are grouped around the cluster centroid, the lower the intra-cluster inertia. The number of clusters to consider was determined classically by inspecting the bar chart of global intra-cluster inertia as a function of cluster numbers ( Figure 1). The optimal number corresponds to the bar whose height-difference with respect to the preceding (i.e. to the left) bar in the chart is great compared to the height-difference with the following bars (i.e. to the right), indicating that a smaller number of clusters implies a sharp increase in intra-cluster variance. This choice was further supported by analysing the biological significance of the clusters at different levels of the clustering tree [27]. HAC was performed with the "agnes" function of the R package "cluster" [39]. As HAC clustering is not optimal due to the constraint of hierarchical grouping, the cases were then partitioned into the defined clusters by the k-means method [26], using the cluster centroids calculated by the HAC as seeds [38,40]. The k-means algorithm   Figure 1 Bar chart of the sum of intra-cluster inertias for different numbers of clusters. The red line indicates the point of the changing slope.
attributes cases to their nearest centroid. The cluster centroids are adjusted and calculations reiterated until no further significant improvement in intra-cluster inertia is achieved. K-means was performed with the "kmeans" function of the R package "stats" [41]. The quality of the clustering result is highest when clusters are compact around their centroid and well separated from each other. This clustering quality is evaluated by a criterion defined as R 2 = 1-(sum of intracluster inertias/total inertia of the data set) [35]. The closer R 2 is to one, the better the clustering.

Cluster interpretation
The classification assigns each individual, i.e. an MCAderived representation of a case, to a cluster.
In order to understand the meaning of these groups, one has to know which features characterise them. Cluster interpretation was based on both kinds of variables, the active and illustrative ones. The proportions of the modalities in each cluster and in the whole dataset were compared (V-test) [38]. We used a curve of ordered absolute V-test values for each cluster, and the point of the changing slope separated the more meaningful modalities from the other ones ( Figure 2). Modalities with V-test values above the slope change were retained for cluster description [33]. Positive V-test values represent positive associations, negative values represent negative associations. Visualisation by projection of the modalities of the variables onto the factorial planes was also helpful for interpretation.

Results
We used the 14 modalities of Morphology and 15 modalities of Topography from our dataset of 8,709 cases (i. e. recorded mortality cases from 1986 to 1997 with a description of post-mortem findings) to build a statistical classification (active variables). The first five axes of the MCA loaded more than 96% of the total variance of the 29-dimensional space. Details are given in Table 6. Variables contributing to axis definition differed from one axis to the other, and no rare modalities had a determining influence.
HAC was then performed on the case-coordinates derived from the first five axes of the MCA. Analysis of intra-cluster inertia levels of the clustering-tree ( Figure  1), and the examination of the biological significance of clusters at different thresholds, indicated that nine was the optimal number of clusters. With a higher number of clusters, cases which were very similar from a biological point of view would have been separated, while fewer clusters would have merged cases exhibiting rather different lesional features.
Partition strengthening by the k-means method (10 iterations) was used to attribute the cases to these nine clusters. The calculated R 2 value was 0.62.
The modalities of variables (active and illustrative) best describing the different clusters are presented in Table 7. To analyse to what extent a cluster could be considered as a syndrome in terms of pathological findings, our interpretation was based on these descriptions, and on pathological descriptions and differential diagnoses found in the literature for the more frequent Causes of death of the clusters.
Cluster 1 comprised 12.1% of the total number of cases. It was characterised by haemorrhagic lesions, associated with evidence of anticoagulant compounds. Haemorrhage is also present for example in trauma cases or in the European brown hare syndrome (EBHS), but in Cluster 1 all other types of lesions were absent. For anticoagulant poisoning, evidence of massive bleeding is noted at necropsy and the lack of coagulation is highly suggestive of exposure [42]. According to Berny [43], large herbivores are usually less susceptible than predators, which was highlighted here by a negative association of this cluster with roe deer (Capreolus capreolus).   Animals in Cluster 2 (12.8% of cases) presented diarrhoea and lesions of the gut, sometimes with parasitism, namely coccidiosis, but some were intoxicated by toxic agents. These agents were mostly cholinesterase inhibitors, which is consistent with symptoms of diarrhoea [42].
Cluster 3 was the largest cluster (18.7% of cases). It grouped cases characterised by the absence of lesions typical of Clusters 7 and 9, and was associated with rarer causes of death, such as those grouped under "Other", respiratory infections of wood pigeons (Columba palumbus) or roe deer shooting accidents. It is therefore difficult to propose a straightforward biological explanation for this cluster.
Cluster 4 (5.9% of cases) was typed by different locations and types of parasitism to Cluster 2. Inflammatory, necrotic or parasitic lesions of the stomach, lung and heart associated with the presence of Strongylida (in 76% of cases) or Trichurida were found in this cluster, mainly observed in roe deer. Debilitating conditions, such as heavy parasite burden, especially in the stomach can cause mucosal abrasions that promote the action of toxigenic bacteria, leading to enterotoxaemia or septicaemia [44]. Cluster 5 (8.6% of cases) identified inflammatory bacterial diseases of thoracic organs, in particular pasteurellosis. The health conditions in this cluster affected 19.8% of the wild boar (Sus scrofa), in the analysed dataset. Typical findings included pleuro-pneumonia, purulent bronchitis, fibrinous pleurisy and pericarditis [44,45].
Cluster 6 (10.2% of cases) dealt with traumatic lesions, especially in roe deer.
Cluster 7 (9.3% of cases) was defined by an altered texture and haemorrhagic and congestive lesions of the trachea, liver and lungs and was linked to Viral hemorrhagic disease (VHD) and EBHS as causes of death, and to rabbits (Oryctolagus cuniculus) (25% of cases in this cluster) and hares (Lepus europaeus) (64%). Caliciviruses that cause EBHS and VHD are closely related and both induce similar pathological changes [46]. Cluster 8 (15.0% of cases) was characterised by hypertrophy and purulent lesions of the spleen and liver. In this cluster they appeared linked to hares and to yersiniosis and Yersinia pseudotuberculosis, and to a lesser degree to tularemia. Acute yersiniosis is characterised by an enlarged spleen and necrotic hypertrophied mesenteric lymph nodes; the chronic form causes multiple small nodular caseous lesions of the spleen, liver, and possibly kidneys, lungs and cecum [44,47,48]. Similar lesions of the spleen and liver can be found in tularemia and yersiniosis, which might explain why these two diseases were grouped together.
Cluster 9 (7.2% of cases) and Cluster 7 had rather similar characteristics. Hypertrophy of the spleen and lesions of the kidney and lymph nodes were present in Cluster 9 but not in Cluster 7. Cluster 9 was also linked to EBHS and hares, as well as to tularemia and Eucoccidiorida. Liver coccidiosis, tularemia and haemorrhagic septicaemia (due to Pasteurella sp.) are differential diagnoses to EBHS [49]. As hares and EBHS were associated with both clusters, they possibly reflect two different stages of the same disease (acute or protracted) in this host [50].

Discussion
This paper describes the use of a three-step clustering method to group cases of wild animals found dead with similar post-mortem findings, over a period of ten years in France, for syndrome definition.
The SAGIR network continuously collects data from investigations of causes of mortality in free-ranging animals in France. However, there is some variability in the intensity of surveillance both spatially and among species, which influences the representativeness of the database. The network provides a more accurate picture of health events for game species than for non-game animals [31]. Furthermore, the network's activity is uneven from one département to another. Nevertheless, these differences have been relatively stable over time, so the quantity and quality of data appeared suitable for trend analysis and detection of unusual health events [51].
Despite the fact that laboratory staff involved in the network has been regularly trained in post-mortem examination of wildlife cadavers, differences in the precision of descriptions contributed to the complexity of the database. Nevertheless, these descriptions were assumed to be more reliable than diagnostic conclusions, because the process of arriving at a cause of death did not follow a standardised procedure.
Methods of classifying qualitative variables are dependent on the number of occurrences for each modality, and small counts make a minor contribution to the variance of the factorial axes [38]. The number of terms used for coding the variables was reduced by preliminary work, and we tried to minimise the risk of misinterpretation by relying on the skill of experts and other sources of reference. For statistical reasons some categories had to be combined further (e.g. "genital organs" alone were mentioned only 193 times, so they were combined with "urinary organs"). For some other categories, the descriptions were more or less detailed (e.g. "respiratory organs" instead of "lung" or "trachea"). We decided not to group these categories together, in order to keep as much precision as possible. These choices may have influenced the outcomes of the classification. However, results were consistent, as "respiratory organs" together with "lung" and "trachea" were determining for Cluster 7, "lung" alone was determining for Clusters 4 and 5, and "trachea" alone for Cluster 9.
Variables were split into active and illustrative ones to avoid redundancy and limit insignificant noise, produced for example by information that was not necessarily linked to the case's cause of death. Noise reduction was also the reason for retaining only the coordinates on the first five axes of the MCA. These axes were used regardless of their rank, because each represented very different biological information that retained the most differentiating characteristics of the dataset.
The statistical classification procedure used here showed its ability to handle large datasets and identify pathologically relevant characteristics. However, it should be noted that the cluster description does not address the full range of lesions found on an animal. It merely indicates features that are characteristic and allow clusters to be distinguished. As a result, the cases which were infrequent or poorly defined were gathered in a cluster (Cluster 3) that is difficult to qualify as an entity. Diseases that remained rare or those that induced only unspecific lesions, such as congestion of different organs, could not be highlighted by our approach.
The clusters obtained in this study were of three different types: those which were species-and disease-specific (Clusters 7, 8 and 9), those suggestive of specific conditions but not species-specific (Clusters 1, 5 and 6), and the others, covering a broad pathologic condition (Clusters 2 and 4). It might be interesting to group Clusters 7 and 9 for further epidemiological analysis as they seem to present two different views of the same disease.
The characteristics of the clusters derived from our analysis are consistent with features found in previous epidemiological studies on wildlife diseases in this country [42,[52][53][54][55]. The clusters reflect the most distinct and most frequent disease entities on which the surveillance network focused. The importance of investigations into VHD and EBHS for example, which were emerging diseases in the early 1990 s [50,56], was decisive in defining two clusters.
The statistical classification of cases collected by the French SAGIR network could lead to the adoption by the surveillance community of eight distinct syndromes: 1) a hemorrhagic syndrome, interesting because it allows accidental wildlife intoxications to be monitored [42] and could potentially also detect anthrax cases [16]; 2) an enteritic/diarrheic syndrome, which could reflect environmental constraints, such as changes in food supply [57] or density related parasite burdens [58,59]; 3) a multifactorial (parasites and toxigenic bacteria) syndrome, more specific to the difficult living conditions of wild ruminants [55]; 4) a respiratory syndrome, which is a disease complex that takes a regular toll on wildlife [44]; 5) a trauma-related syndrome, representing one of the foremost causes of death in our database, but less interesting from an epidemiological point of view; 6) a syndrome of acute hepatitis-like diseases, which reflects the importance of EBHS and VHD, especially during the study period, and could be useful for other emerging hepatites; 7) a syndrome of subacute or chronic diseases of the liver, kidney and spleen, caused mostly by endemic bacteria. This syndrome could be useful for the monitoring of tularemia and salmonella outbreaks, potentially threatening public health [60,61]; 8) a miscellaneous syndrome; despite being difficult to understand, this syndrome is worth considering, because an unknown disease might probably first increase this group before being recognised as a distinct entity.
Future cases can be attributed to the defined syndromes by determining their MCA-derived representation and the cluster they belong to [40]. We used this procedure on the remaining 14,519 cases collected between 1998 and 2007. Missing information was completed statistically by multivariate imputation. MCA with the above determined eigenvalues was used to calculate the coordinates of these additional cases in the five-dimensional space. These coordinates were used to determine the cluster to which each case belonged (smallest Euclidean distance to cluster centroid). Clustering quality of the whole dataset (R 2 = 0.605) was not substantially different from that of the initial dataset (R 2 = 0.62) (unpublished work).
As new diseases with distinct pathological profiles emerge in free-ranging wild animals over time, the syndrome definition might evolve. The statistical classification could be revised in the future, and historical data could be integrated in the classification process, thus allowing the analysis of continuous time series.
For the epidemiological study of the syndromic time series, we will develop models and anomaly detection algorithms on the number of cases of each syndrome per time unit from the historical database [62].

Conclusions
The results presented above suggest that macroscopic necropsy findings are valuable for identifying distinct pathological profiles among wild animal carcasses collected by a general surveillance network. The construction of our typology was based on an unsupervised statistical approach; it allowed an impartial reduction of all the information present in our complex dataset and then a robust classification. This approach identified meaningful clusters, reflecting the most frequent disease groups in the database and their distinctive characteristics. To our knowledge this is the first time that this method has been used to construct clusters from animal necropsy data.
Cluster characteristics lead to the definition of eight syndromes that could classify all the investigated cases and potentially reflect all disease events including new diseases. Moreover, some of these syndromes referred to pathological entities that go beyond species and specific diseases, and could reflect environmental stresses on wildlife. Others could be used for the surveillance of zoonoses. Cluster and subsequently syndrome definition were however dependent on the focus of the surveillance network which provided the data we used.
Syndromic classification of cases based on their pathological profile has practical value because it does not need a precise diagnosis and therefore provides a rapid, reliable and rather inexpensive means of analysing wildlife health data. This approach could improve the usefulness and costeffectiveness of existing wildlife mortality monitoring systems.