Data from multiple sources were pooled for use in this study including the following: (1) the June 2003 Agricultural census of livestock premises combined with the department for Environment, Food and Rural Affairs (DEFRA) list of livestock premises; (2) the Cattle Tracing System (CTS), (3) Land Capability for Agriculture in Scotland; (4) E. coli non-O157 prevalence data; (5) E. coli O157 prevalence data, (6) E. coli O157 typing and PFGE data. Details on most of the data sources have been published previously however, each data set will be described briefly below.
(1) June 2003 Agricultural census data . Among the original 50,266 farms in Scotland recorded in the census, 22,286 farms are provided with the numbers of animals but only 13,704 farms have cattle. Our system comprises these 13,704 cattle farms. The data include the Council-Parish-Holding number (CPH), the X-Y coordinates of the farm-house, the area of the farm, and the numbers of cattle, sheep and pigs. The number of cattle on each farm is assumed constant at the number recorded in the census.
(2) Cattle Tracing System (CTS). The CTS is operated by DEFRA’s British Cattle Movement Service [20, 30]. In Scotland, during years 2002-2004 there were 252,496 movements among 11,464 of the cattle farms entered in the 2003 census database (the remainder are assumed not to have moved cattle to or received cattle from other farms). Movements outside Scotland and to/from abattoirs and markets are not considered here.
(3) Land Capability for Agriculture in Scotland (LCA). The LCA classification was developed by the Macaulay Institute to describe the agricultural potential of land based on the degree of limitation imposed by its biophysical properties. It is based primarily on climate, a number of soil properties, (for example depth and stoniness), wetness, erosion risk and slope. Also included are the overall pattern, i.e. variability, and, in one of the classes (Class 6), vegetation cover is also taken into account. The LCA is a seven class system where class 1 represents land that has the highest potential flexibility of use whereas class 7 land is of very limited agricultural use .
(4) E. coli O157 prevalence data. Between March 1998 and February 2004 two cross sectional surveys were conducted in Scotland. Data used in this study came from 447 farms sampled in both surveys (Figure 2). The first survey (Survey 1) was conducted by the Scottish Agricultural College through funding from the Scottish Executive Environment and Rural Affairs Department (SEERAD) from March 1998 to May 2000. The second survey (Survey 2) was carried out between February 2002 and February 2004, funded by the Wellcome Trust International Partnership Research Award in Veterinary Epidemiology (IPRAVE). The field sampling methodologies for both surveys have been described in previous literature [7, 12], however, a brief outline is given below. Further, farmers were asked to complete a farm management questionnaire from which much of the data about the farms were gathered. Prior to sampling, written consent was obtained from the farmers for participation in the study.
Both surveys preferentially sampled cattle groups composed only of store (i.e. weaned cattle before finishing for slaughter) or finishing cattle closest to sale or slaughter. If such groups did not exist, one or more mixed groups with store or finishing cattle closest to sale or slaughter were sampled. From each group fresh faecal pats were sampled. The number of pats tested in each group was determined from the number of cattle in the group using a prescribed sampling schedule. For Survey 1, sufficient numbers of faecal pats were tested to ensure prospectively an 80% chance of sampling at least one positive pat if there was a shedding prevalence of at least 2% within the group . Based on results from Survey 1, in Survey 2, it was assumed that, on average, 8% of the animals in positive groups would be shedding, with shedding distributed as seen in Survey 1 . For each group in Survey 2, sufficient fresh pat samples were taken to ensure prospectively a mean 90% probability of detecting shedding of E. coli O157 if at least one shedding animal was indeed present. Changes in sampling strategy between the two surveys had a negligible effect on the power to identify positive farms . Instead of randomly sampling farms within each Animal Health District (AHD), Survey 2 used a stratified sampling plan derived from the Survey 1 cohort to select farms to sample . Farms were selected randomly then the farms with closest Euclidean distances were sampled on the same or concurrent days, leading to clusters of 3 independent farms. The sampling of the same farms between the studies and the comparable methodology between the studies allow the use of these separate surveys as two cross-sectional time points for our analysis.
(5) E. coli non O157 prevalence data . Collection of bovine isolates for the detection of E. coli O26, O103, O111 and O145 done in parallel with Survey 2 (2002-2004). Fecal samples were taken from 338 farms to test for non-O157 E. coli strains as described in .
(6) E. coli O157 typing data. Within 48 hours of sampling, one gram of faeces from each pat sample was tested for the presence of E. coli O157 through immuno-magnetic separation (IMS) and culturing as described in detail elsewhere . Following IMS, one E. coli O157 isolate from each faecal sample was submitted to the Scottish E. coli O157/VTEC Reference Laboratory (SERL) for phage typing, and testing for the presence of genes encoding the virulence factors shigatoxin 1 (stx 1), shigatoxin 2 (stx 2) and intimin (eae) using multiplex PCR. PFGE analysis was conducted on E. coli O157 isolates from both surveys as described previously  as well as randomly selected isolates selected from human clinical samples from the same time frame. Briefly, isolates were digested with 50U of XbaI restriction enzyme, then subjected to PFGE using CHEF DRII apparatus (Bio-Rad laboratories, UK). Further analysis and categorization of PFGE results were conducted using Bionumerics 4.1 (Applied Maths, Belgium).
Agreement between surveys
The presence and absence of agreement between farm E. coli O157 status on farms in Survey 1 and Survey 2 was compared using McNemar test. McNemar’s test assesses the significance of the difference between two correlated proportions. The analysis was performed using StatXact version 8 (Cytel Software Corp, Cambridge, MA, USA). The null hypothesis is that the proportion of farms with the characteristic (or event) is the same for Survey 1 and Survey 2.
Risk factor analysis
Data from all of the sources listed above was compiled and used as the basis of the risk factor analysis. Risk factors for the presence of E. coli O157 on a farm were analysed using logistic regression analysis (Proc Logistic, SAS Institute Inc., Cary, NC). Logistic regression analyses were carried out on a single variate basis initially. All the potential risk factors (Additional file 1: Table S1, n = 49) were examined. All variables with a p value of <0.2 were retained for multiple variate analysis. Region and season were forced into the model as design factors. Seasons were defined as winter, comprising December, January, and February; spring, comprising March, April and May; Summer, comprising June, July and August; and Autumn, comprising September, October and September. Six regions, based on Veterinary Animal Health Districts (AHDs) were defined: 1 = Islands; 2 = Highland; 3 = North East; 4 = Central; 5 = South East; 6 = South West. A hierarchical forward selection and backward elimination approach with swapping (reassessment of previously included or excluded variables) were used. The change in the deviance of the model was monitored as an indicator of improved fit. Variables were added and removed based on significant improvement in the mean deviance after changes to the model. Two-way interactions were also tested in this manner. The inclusion of a random effects term for cluster did not improve the fit of the model significantly.
Model fit was assessed by fitting ROC curves to the final models and generating area under the curve (AUC) statistics for the models. The AUC can be considered to be a measure of the discriminatory power of the model . A theoretically perfect model would have an AUC = 1 while a model with no discriminatory power would have AUC = 0.5. Thus using this scale, the AUC statistics of the models and hence the probability of the model being able to discern between a positive and negative E. coli O157 farm can be assessed and compared. For the final model, the Hosmer-Lemeshow goodness-of-fit statistic was computed .
To check for multicollinearity between factors in the final model, correlations were examined for binary and nominal variables. In addition, the stability of the model was checked by systematic removal of variables. Diagnostics were performed and plots of residuals were examined, confirming goodness of fit of the model. Odds ratios and their associated 95% CI were estimated in the final model for factors statistically significantly associated with the presence of E. coli O157.
To compare the results of the statistical analysis in this study to those generated earlier by Zhang et al.  an overall odds ratio estimate that was developed for the earlier study was calculated. Values for the empirical estimate of the odds ratio were derived using the parameter estimates from the logistic regression model. These estimates were used in the statistical model to simulate binary response random variables for each farm (absence/presence of E. coli O157 on-farm). The predicted presences/absences were then related to the observed presences/absences, and aggregated over all farms, from which summaries an odds ratio was calculated by
where F++ is the number of farms that are positive for both model prediction and observed data, and the meanings of F−−, F+- and F-+ follow accordingly . An odds ratio greater than 1 indicates that the model is more likely than not to predict the correct infection status of a farm. The larger the odds ratio, the stronger the predictive power of the model. This process was repeated to produce a distribution of odds-ratios which could then be summarised.
Local spread and persistence
Survey 2 sampling clusters in which >1 farms were positive were investigated to determine whether any identical E. coli O157 strains could be found on farms within the vicinity of each other (average distance = 5.96km). BioNumerics 4.1 software (Applied Maths, Belgium) was used to analyse the PFGE profiles. Dendrograms were generated using unweighted pair group method with arithmetic mean (UPGMA) with settings of 1.00% optimisation and 1.3% position tolerance and provided a visual representation of the relationships among isolates. Each unique PFGE profile was allocated a profile identifying code (1-139). For the purposes of our local spread analysis, isolates had to be indistinguishable (ie 100% similar) for classification as locally spreading. Analysis determined whether sampling clusters were more likely to have the same PT / PFGE profile than random. To do this the clustered data were compared to 10,000 bootstrap samples of pairs of positive farms sampled at random.
Farms that were E. coli O157 positive in both Survey 1 and Survey 2 (n = 27, Table 1) were analysed using the same dendrogram protocol. This analysis was conducted to identify any strains that were present in both Survey 1 and Survey 2. The criteria for being defined as the same strain was 100% similarity using the above dendrogram protocol. This is a conservative estimate and likely a lower estimate as it does not allow for common strain development.
Ecological diversity of PFGE types
Diversity of PFGE types in both surveys was examined using multiple diversity measures, related to Renyi’s measures of generalized entrophy [35, 36] similar to the analysis done in Mather et al. . The exponential of Renyi’s entrophy measure gives an estimate of the effective number of species D
[38, 39], with its single parameter q determining the extent to which rare PFGE types (in this instance) contribute towards overall diversity (Equation 2). The following diversity indicies were calculated for both surveys: specie richness (SR), D0; Shannon Entrophy (SE), log(D1); Simpson diversity (SD), 1/D2; and Berger-Parker (BP), 1/D∞
While it is trivial to say that values of D0 would inform on the total number of different PFGE types present in the survey data, values of D∞ would carry out some information on the number of PFGE types that seems to dominate the profiles observed.
To measure the diversity of each sample (Survey 1 and Survey 2), true abundances of the sample were derived directly from the count data (pj = nj/Σnj), and the different diversities calculated from the abundance proportions. However, because sample diversity measures depend heavily on sample size , direct comparison was conducted by repeatedly subsampling the larger (Survey 2) sample to the size of the smaller (Survey 1) sample with replacement. The algorithm was applied 1000 times to generate 1000 different sub-dataset of similar size as Survey 1 and enabling the creation of confidence intervals around Dq and, therefore, every every measure examined. Analysis was conducted in R version 3.0.2 (R Development Core Team, 2013) .