The populations under surveillance were dairy cattle in two Spanish regions: R1 (Principado de Asturias) and R2 (Catalonia). The region R1 included a large number of farms with small herd sizes, mostly in extensive husbandry systems. This region R1 had green landscapes and was characterized by coastal oceanic climate with moderate to high precipitations and moderate temperatures. In contrast, the region R2 had fewer farms but with larger herd sizes, most of which were in intensive production. Region R2 was characterized by a complex and diverse orography and a combination of a continental, coastal Mediterranean and alpine climates. These regions were monitored between 1st of Jan-2006 and 31th of Jun-2015.
During this period in both regions, and according to the European Union regulations [47], fallen animals were removed from farms using specialist disposal services. According to the Royal Decree 728/2007 (Annex III) [48], all the cattle farmers were obliged to use national fallen stock companies to collect and dispose their fallen stock and communicate the cattle casualties to the official animal health services.
Types and sources of data
Two types of data from dairy cattle populations were used: fallen stock and demographic data. Fallen bovines data were provided by two national insurance companies: Entidad de Seguros Agrarios (ENESA) and Agrupación Española de Entidades Aseguradoras de los Seguros Agrarios Combinados S.A. (AGROSEGURO). For every visit where carcass disposal occurred, the following information was recorded: the identification code of the farm, date of pick-up, number of carcasses, and number of kilograms collected. On the other hand, demographic data yearly updated, were provided by the Sub-Directorate General for Animal Health and Hygiene and Traceability of the Ministry of Agriculture, Fisheries and Food of Spain (MAPA). This data set included: the identification code of the farm, type of production, number of dairy cattle per holding, and administrative aggregations (i.e. region, province, and county). Both sets of data, fallen bovine stock and bovine population, were merged in a final dataset using the identification code of the farm as a primary key.
According to the legal basis on the protection of personal data [49], the non-public staff and scientists involved in the analyses of these data signed a confidentiality agreement to set out the terms and conditions to limit the collection, handling, and disclosure of non-public information from farmers.
Statistical analysis
We built a routine that consisted of a sequence of analytical methods designed to:
-
(1)
describe and select longitudinal fallen stock data aggregated at different levels (i.e., region, province, and county);
-
(2)
fit, for each subpopulation, appropriate time series models;
-
(3)
forecast the number of fallen bovine stock at n- ahead period; and
-
(4)
detect and register aberrations in mortality counts over time, considering the conventional upper confidence limit of 95%.
Hierarchical time series structures
Firstly, for exploring and comparing baseline patterns from different sub-populations, fallen stock data aggregated at each of the three administrative levels were described and represented using HTS [13, 14]. HTS provide a rapid means to visualize, aggregate, and disaggregate series from different subpopulations. The HTS were designed by combining the information on the fallen dairy cattle at the most disaggregated level of the time series (in our case, counties) and the hierarchical organization. This dictates how the information, at the county level, had to be aggregated into provinces and regions. An explanatory illustration has been included in the supplemental material (S1 Fig) to explain the methodology used to build the HTS structures. The ‘base’ and ‘hts’ packages in the software R were used to support their creation [14].
ARIMA modelling including trend and seasonal terms
The patterns and forecasting of weekly fallen bovines at different administrative levels were analyzed using ARIMA models that included trend and seasonal components as covariates. It is important to remark that ARIMA models are preferable for regular time series of normally distributed observations. Therefore, once the data were disaggregated into regions, provinces, and counties through HTS, each series that showed counts greater than or equal to 10 in average and normally distributed [50] were considered for ARIMA modelling. Since we were dealing with counts, most of the time series were Poisson distributed [12]. According to [51], for those series with an average higher than 10 counts, the Poisson distribution is well approximated by the Normal distribution; thus our selected series could be properly modelled through a classical parametric approach such ARIMA. These series also coincided with those administrative levels in which the dairy populations were larger within these two regions of study.
The ARIMA model, an extension of the ARMA (Autoregressive and Moving Average) model, has been widely used in the classical time series analysis and has also been applied in many contexts related to veterinary and public health [7, 50, 52, 53]. The autoregressive part of an ARIMA model (AR) indicates that the temporal response variable (i.e., weekly fallen bovine counts) is regressed by its lags. While the moving average part (MA) indicates that the model error is regressed by its lags (i.e., at time t, the lags of a random variable Zt are essentially temporal delays such as Zt − 1, Zt − 2 …). The latter means that the error does not behave like white noise. These ARIMA models can also be used even if the series are non-stationary (i.e., when they present positive, negative, or quadratic trends, and/or annual and biannual seasonality, among other). In these cases, some initial differentiation can be applied one (d = 1) or more times (d > 1) to make the series stationary. Conceptually, when series are differentiating, the count at time t-1 is subtracting from the count at time t. The general ARIMA model is defined by the parameters p (autoregressive part), d (differentiation), and q (moving average part).
Let the random variable Xt be an ARMA (p, q) model such that:
$$ {\mathrm{X}}_{\mathrm{t}}=\upalpha +{\uprho}_1{\mathrm{X}}_{\mathrm{t}-1}+{\uprho}_2{\mathrm{X}}_{\mathrm{t}-2}+\dots +{\uprho}_{\mathrm{p}}{\mathrm{X}}_{\mathrm{t}-\mathrm{p}}+{\mathrm{Z}}_{\mathrm{t}}+{\uptheta}_1{\mathrm{Z}}_{\mathrm{t}-1}+{\uptheta}_2{\mathrm{Z}}_{\mathrm{t}-2}+\dots +{\uptheta}_{\mathrm{q}}{\mathrm{Z}}_{\mathrm{t}-\mathrm{q}} $$
(1)
where Xt is a stationary series at time t, α is the intercept of the model, ρ1, ρ2, …, ρp are the coefficients of the autoregressive term, θ1, θ2, …, θq are the coefficients of the moving average term, and Zt, Zt − 1, …, Zt − q are the error terms of the model which are normally distributed. In case of lack of stationarity, the series Xt can be differentiated (d > 0) before estimating the parameters in Eq.1 to avoid possible trend and seasonal effects, leading to an ARIMA (p, d, q) model. Additionally, trends and seasonal influences can be included in the ARIMA (p, d, q) model as covariates. In particular, we accounted for trend and seasonality effects in the ARIMA (p, d, q) model by using the following equation:
$$ {\mathrm{Y}}_{\mathrm{t}}={\upgamma}_0+{\upgamma}_1\mathrm{t}+{\upgamma}_2\sin \left(\frac{2\uppi \mathrm{t}}{52}\right)+{\upgamma}_3\cos \left(\frac{2\uppi \mathrm{t}}{52}\right)+{\upgamma}_4\sin \left(\frac{2\uppi \mathrm{t}}{26}\right)+{\upgamma}_5\cos \left(\frac{2\uppi \mathrm{t}}{26}\right)+{\mathrm{X}}_{\mathrm{t}}, $$
(2)
where Yt is the observed series at time t (which can be non-stationary), Xt is the stationary ARIMA (p, d, q) model expressed in Eq. 1., γ1is the trend coefficient, γ2 and γ3 are the coefficients of the annual seasonality aggregated at a weekly level, and γ4 and γ5 the coefficients of the biannual seasonality aggregated at a weekly level. Here the trigonometric part corresponds to the first and second-order Fourier terms commonly used in the analysis of time series [54].
These models were used to monitor the number of weekly fallen bovines for each subpopulation at every selected administrative level and identify abnormal peaks of mortality by comparing the predictions to the observations.
The number of weekly fallen bovines recorded between Jan-2006 and Jun-2015 at the regional, provincial, and county levels were divided into training and testing data sets. The data collected between Jan-2006 and Dec-2013 were used for estimating the model parameters for each population, while the data collected between Jan-2014 and Jun-2015 were used for testing the models and identifying unusual increases of mortality.
To estimate the most appropriate values of parameters p, d and q for the ARIMA models (Eq. 1), and also the linear trend and seasonal coefficients (Eq. 2) we built a programming routine. This routine explored a range of values between 0 and 5 for parameters p and q. For instance, if p = 5, the observed count at week t was regressed by the counts at weeks t-1, t-2, t-3, t-4 and t-5; while if q = 5, the error at week t was regressed by the errors at weeks t-1, t-2, t-3, t-4 and t-5. Here, the parameter d was constrained to a binary indicator, taking 0 (not differentiating) or 1 (differentiating once and removing the linear trend). Note that in those series in which the trend was not linear, but quadratic, the series was differentiated, and a coefficient of the linear trend (Eq. 2) was also included in the candidate model. The inclusion of this coefficient was because the trend was not linear, and hence it was not entirely removed by differentiating only once. In some scenarios, this problem could have been addressed by considering values of d greater than 1.
In order to select the most suitable ARIMA model for each time series among all candidates, three criteria were considered. The first one was based on the Bayesian Information Criterion (BIC) proposed by Schwarz [55]. The model with the smallest BIC was preferred over the whole set of candidates. The BIC was used because it typically provides more parsimonious models (in terms of the number of parameters) than those proposed by the Akaike Information Criterion (AIC). In addition, the BIC is also more robust than the AIC when coping with heterogeneity in samples [56]. The second and third criteria consisted of assessing the statistical significance of the model parameters given a significance level (i.e., 5%), and checking the behavior of the model residuals through the Auto-Correlation Function (ACF) and the Partial Auto-Correlation Function (PACF), respectively. The best ARIMA model was the one for which the third criterion was not only completely satisfied but also demonstrated good results for both the first and the second criteria. Full details can be found in [20, 21, 54].
Subsequently, each selected model was used to predict the expected number of carcasses that would be collected weekly between Jan-2014 and Jun-2015 in each population from every selected administrative aggregation. Those observed counts of fallen dairy cattle that exceeded the 95% confidence limits predicted by the selected ARIMA models were identified as mortality peaks. Reasonably, any relevant increase in mortality could sign an unusual event in the population, and thus require further investigation at field level to determine the specific causes. In case a presumed mortality peak would be detected, this routine provided a list with the identification codes of all involved farms. Additionally, the list also provided information on the number of carcasses recorded during the peak and over the preceding two weeks in those farms.
The programming routine was built using the base package of R [28].