Data analysis workflow for the detection of canine vector-borne pathogens using 16 S rRNA Next-Generation Sequencing

Background Vector-borne diseases (VBDs) impact both human and veterinary medicine and pose special public health challenges. The main bacterial vector-borne pathogens (VBPs) of importance in veterinary medicine include Anaplasma spp., Bartonella spp., Ehrlichia spp., and Spotted Fever Group Rickettsia. Taxon-targeted PCR assays are the current gold standard for VBP diagnostics but limitations on the detection of genetically diverse organisms support a novel approach for broader detection of VBPs. We present a methodology for genetic characterization of VBPs using Next-Generation Sequencing (NGS) and computational approaches. A major advantage of NGS is the ability to detect multiple organisms present in the same clinical sample in an unsupervised (i.e. non-targeted) and semi-quantitative way. The Standard Operating Procedure (SOP) presented here combines industry-standard microbiome analysis tools with our ad-hoc bioinformatic scripts to form a complete analysis pipeline accessible to veterinary scientists and freely available for download and use at https://github.com/eltonjrv/microbiome.westernu/tree/SOP. Results We tested and validated our SOP by mimicking single, double, and triple infections in genomic canine DNA using serial dilutions of plasmids containing the entire 16 S rRNA gene sequence of (A) phagocytophilum, (B) v. berkhoffii, and E. canis. NGS with broad-range 16 S rRNA primers followed by our bioinformatics SOP was capable of detecting these pathogens in biological replicates of different dilutions. These results illustrate the ability of NGS to detect and genetically characterize multi-infections with different amounts of pathogens in a single sample. Conclusions Bloodborne microbiomics & metagenomics approaches may help expand the molecular diagnostic toolbox in veterinary and human medicine. In this paper, we present both in vitro and in silico detailed protocols that can be combined into a single workflow that may provide a significant improvement in VBP diagnostics and also facilitate future applications of microbiome research in veterinary medicine. Supplementary Information The online version contains supplementary material available at 10.1186/s12917-021-02969-9.


Background
Vector-borne diseases (VBDs) impact both human and veterinary medicine and pose special public health challenges. According to the One Health concept, animal health, environmental health, and human health are all interrelated [1]. In fact, 60 % of all human infectious diseases are classified as zoonotic diseases, having an animal origin [2]. The World Health Organization (WHO) reports that vector-borne diseases (VBD) represent more than 17 % of all known infectious diseases worldwide, causing over 700,000 annual deaths with billions of people at risk of contracting a VBD in 129 countries [3]. Data from the World Organization for Animal Health (OIE) shows that 25 % of terrestrial vertebrate pathogens of concern are vector-borne [4]. Also, the Companion Vector-borne Diseases Organization (CVBD) highlights that in more than half of the continental regions in the world, companion animals are threatened by three or more endemic VBDs [5]. Pathogen-harboring arthropods (also called vectors) such as mosquitoes, sandflies, fleas, and ticks are natural conduits for microorganisms to infect their vertebrate hosts. These complex host-pathogen relationships are constantly remodeled by environmental conditions including anthropogenic influences such as global climate change, urbanization, economic globalization, and pesticide use [6,7]. Such phenomena increase the risk of infection of non-reservoir mammalian hosts (e.g. humans and domestic animals), which are also called incidental hosts. VBD research is therefore strategically important to maintain and improve public health.
Vector-borne pathogens (VBPs) encompass a wide variety of organisms distributed among different phylogenetic groups. In recent years, the advent and increasing adoption of Next-Generation Sequencing (NGS) and associated bioinformatics have dramatically increased our understanding of the complexity and importance of bacterial communities in a variety of environmental and clinical samples [8,9]. For example, the Human Microbiome Project (HMP), a research consortium that started over a decade ago, has used NGS to investigate the composition and function of microbial communities in over 40 different body sites on > 30,000 samples [10]. Typically, microbiome research has been descriptive, often comparing healthy and diseased communities, considering the whole microbiota [11,12]. As sequencing costs have continued to decline, more specific applications have become feasible such as identifying etiologic agents of disease in clinical samples for diagnostics and public health surveillance [13].
Several recent studies have characterized the microbial communities of arthropod vectors and their role in VBD transmission [14][15][16][17][18]. In our work, we have utilized 16 S rRNA NGS to characterize the microbiome and the presence of VBPs in cat fleas (Ctenocephalides felis) from Northern and Southern California [19]. Such approaches are increasingly being adopted in Veterinary Medicine, especially for the characterization of the blood microbiome of companion animals. For example, using NSG targeting the 16 S rRNA of the 18 S rRNA genes, high rates of Anaplasma platys, Babesia vogeli, Ehrlichia canis, Hepatozoon canis, and hemotropic Mycoplasma infection and co-infection were reported from a population of temple dogs in Thailand, where the NGS assay was reported to be more sensitive than conventional endpoint PCR diagnostic methods [20,21]. Furthermore, optimizations in the amplification of bacterial DNA by using host-specific blocking primers combined with optimal DNA extraction prior to NGS further improved the sensitivity and diversity of canine VBPs in one study [22]. The use of microbiomics or metagenomics strategies for the detection and characterization of VBPs is poised to expand and benefit not only veterinary researchers but also clinicians in the near future.
NGS can be performed on any sample of interest by either direct sequencing with no PCR amplification of DNA (metagenomics) or RNA (metatranscriptomics) or targeting an amplicon, such as the 16 S rRNA gene (16 S-NGS) for prokaryotes or the 18 S rRNA gene for eukaryotes, which have become gold-standard marker genes for phylogeny and large-scale microbiome comparisons of bacterial and protozoan microorganisms. Several NGS workflows for infectious diseases already exist [13,23], including recently published specialized methods targeting either 16 S rRNA or 18 S rRNA genes for canine VBPs comprising either bacteria or protozoa (apicomplexan piroplasms and/or euglenozoan kinetoplastids haemoparasites) [20,21]. A need remains for a detailed step-by-step data analysis protocol for VBPs in the NGS diagnostic landscape for use by nonbioinformaticians, especially in veterinary medicine.
The goal of the current article is to describe a Standard Operating Procedures (SOP) for microbiome analyses targeting VBPs of importance in companion animals. We tested the proposed bioinformatic workflow usingin-vitro data from selected VBPs. We focus exclusively on the 16 S-NGS approach due to its experimental affordability for diagnostics purposes and relatively straightforward computational requirements. Also, we determined the ability of NGS to detect simulated coinfections or multi-infections with VBPs. Our in-silico veterinary-focused microbiome methods adopt freely available industry-standard tools for 16 S-NGS analyses, adapting and executing pipelines from the QIIME [24] and Uparse [25] software packages. The computational SOP is available at https://github.com/eltonjrv/ microbiome.westernu/tree/SOP and combines these existing software tools with our BASH, PERL, and R scripts to form a complete analysis pipeline accessible to veterinary scientists. To our knowledge, this is the first release of a detailed SOP for microbiome analyses applied to VBD diagnostics available free of copyright to the Veterinary Medicine community.

S-NGS Computational Workflow
Next-generation sequencing significantly advances the field of clinical microbiology by generating millions of individual DNA sequence reads from a single sample that allows for a comprehensive evaluation of bacterial diversity that is not subject to typical limitations of culture-based approaches. A schematic microbiomics computational workflow, typically 16 S-NGS, is depicted in Fig. 1. Once followed by the in vitro protocol we describe in the Methods section, it can be adopted in the small animal diagnostics field, for instance.
Once sequenced, the library must pass through a bioinformatics pipeline that performs demultiplexing, quality control, error correction, operational taxonomic unit (OTU) clustering, and identification, and may include subsequent statistical hypothesis tests regarding treatment effects, etc. For the current study focused on detection and characterization of bloodborne pathogens, characterizations of the phylogenetic diversity of taxa of interest (ToI) were particularly important ( Fig. 1 and SOP GitHub page disclosed herein). Comprehensive microbiome analysis typically focuses on comparing sample sets (e.g. non-infected vs. infected animals) through statistical comparisons of beta-diversity, using distance matrices (e.g. bray_curtis, jaccard) produced from the microbiome compositional data sets. Pairwise phylogenetic distances between samples can be compared using industry-standard tools like Unifrac [26] to test for significant phylogenetic differences between or among groups. Finally, a variety of data visualization approaches are commonly used such as principal coordinate analysis (PCoA) charts, along with the regular taxa relative abundance bar plots (Fig. 2).
Currently in veterinary medicine, the limited broadrange evaluation of clinical samples may restrict the detection of new and emerging zoonotic diseases. 16 S-NGS can circumvent this limitation by broad-range amplification of the whole bacterial community in a sample and can discriminate among amplicons at a single-nucleotide resolution. Phylogenetic analysis of sequence types representing putative pathogens in clinical samples is also crucial for identifying new VBD etiologic agents. Our bioinformatics SOP also provides instructions on these types of analysis and will help non-expert users to extract valuable information on potential novel pathogenic bacterial species and/or strains that might be hidden in their clinical microbiome dataset. Once generated, so-called "zero-radius operational taxonomic units" (ZOTUs), also known as "exact sequence variants" (ESVs) -which are 100 % identical 16 S amplicon sequences identified among the tens to hundreds of millions of reads from all samples (see SOP subtopic 2.2 for technical details)can be classified by comparison against 16 S rRNA sequences from known pathogens using a phylogenetic inference approach (See SOP topic 4 for technical details). Of course, one must perform this comparison against a reliable and comprehensive 16 S database. We recommend the SILVA database [27] loaded within the ARB environment [28] for such a task.

Detecting single and multi-infections with 16 S-NGS
The efficiency of the NGS platform involving 16 S rRNA amplicon sequencing in detecting known amounts of bacteria was verified by performing quality controls using standard microbial communities (ZymoBIOMICS™ Microbial DNA Community Standard, CA). The mock community standards consist of eight prokaryotic genera in different proportions for 16 S rRNA genes in one sample. The community standards were amplified using the primer pairs for the V1-V2, V3-V4, and V4-V5 regions of the 16 S rRNA gene and sequenced with other samples in the same MiSeq run (see Methods section). We analyzed the amplicon data for the mock community with our newly developed pipeline. We were able to successfully detect all eight prokaryotic genera from the mock community standards and the relative abundance of most of the eight genera closely matched the theoretical composition expected for the V1-V2 and V3-V4 regions. Interestingly, for the V4-V5 region, only five out of the eight genera were detected (Fig. 2).
As demonstrated by Tables 1 and 2, one of the primary advantages of NGS is its ability to detect single and multiple organisms present in the same sample. We    (Table 1). Two Bartonella spp. and M. haemocanis were not detected in the 10 1 dilutions out of the 4 trials (Table 1). For both double and triple-infections, similarly, consistent detection was obtained ( Table 2). In general, 10 2 was the minimum threshold of detection for the assays tested here (Tables 1 and 2).

Discussion
The introduction of PCR-based assays in the early 90 s led to a substantial improvement in the diagnostics of canine vector-borne diseases. However, in light of the increasing number of new tick-borne pathogens detected from dogs and humans, such as Panola Mountain Ehrlichia sp. [29,30], E. muris [31,32], and several new Bartonella spp. [33][34][35], VBP diagnostics may benefit from the ability of high-throughput DNA sequencing assays to detect a broader range of microbial DNA from a given sample. Next-generation sequencing (NGS) is based on massively parallel sequencing of billions of nucleotide reads, covering the same targeted region up to 1000 times, with each read classified independently. Consequently, the identification of all sequence variants of the targeted area can be performed. NGS targeting 16 S rRNA genes was initially used to characterize environmental samples but has expanded into diagnostics in recent years [36,37]. In two studies in ticks, NGS was able to identify the presence of novel canine and human pathogens such as Neoehrlichia mikurensis as well as co-infections with Ehrlichia and Anaplasma [38,39], and Bartonella sp. and Rickettsia sp. were detected from fleas [19]. More recently, NGS was used to detect the presence of several VBPs including bacteria and protozoans, in single and co-infection, from dogs naturally infected in Thailand [20,21], illustrating the capabilities of NGS as a molecular diagnosticmethod.
While PCR-based assays remain the current goldstandard for molecular diagnostics of vector-borne pathogens, the use of NGS is rapidly expanding due to advantages such as: detection of unknown diseaseassociated pathogens in clinical specimens, characterization of co-infections, investigation of microbial population diversity in the host, and strain characterization [43]. Nonetheless, there are several barriers to the broader use of NGS in molecular diagnostics, including but not limited to: carryover microbial DNA contaminants from sample collection tools and lab reagents, sequencing error rates, elevated cost, result turnover time, analytical sensitivity, and preferential amplification of dominant microbial sequences [40,41]. As demonstrated in Fig. 2, different sets of primers targeting distinct variable regions of the 16 S rRNA gene may yield different results regarding the relative abundance of microbial communities. Also, the analytical sensitivity and limit of detection (LOD) achieved by the 16 S NGS assays performed in this study remain suboptimal when compared to the LOD reached by our conventional or real-time PCR assays for some of these pathogens [42,43]. Other research groups have reported NGS sensitivity to canine VBPs comparable or even superior to conventional endpoint PCR [20,21] The targeting of the 16 S rRNA gene has also limited capability in defining species within genera where this region is highly conserved, such as Rickettsia spp. and Brucella spp. Future advances in NGS technology such as longer Table 2 Summary of detection of the positive controls in the corresponding dilution for co-infections of six vector-borne pathogens. Numbers in the first bracket represent the number of times it was detected out of the number of times it was tested. Each sample was amplified for three variable regions of the 16S rRNA gene; V1-V2, V3-V4, and V4-V5

Double Infection
A. phagocytophilum plus E. canis

Triple infection
A. phagocytophilumplus E. canis plus B. vinsonii subsp. berkhoffii Vector-borne pathogen  read lengths are expected to address some of these limitations, making it a potentially transformative tool for the diagnosis of new or emerging pathogens in animals and humans.
In this methodology article, we disclosed both benchtop protocols and a "step-by-step" bioinformatics tutorial for performing microbiome assays and analyses, respectively, on VBD diagnostics in Veterinary Medicine. While in this article we only report the results from synthetic positive controls, we have successfully used this bioinformatic pipeline to detect the presence of Anaplasma phagocytophilum or Ehrlichia ewingii in dog blood [44].

Conclusions
The detailed 16 S-NGS protocols for microbiome analyses provided herein can efficiently and accurately characterize mock communities and may also be able to detect co-and multi-infections of VBPs. These procedures can be combined into a single workflow that should facilitate future applications of microbiome approaches to VBD diagnostics in Veterinary Medicine. Such approaches may improve current understandings of VBDs and their impact on both Veterinary and Human Medicine.

Positive controls and standard curves
Synthesis of positive controls and generation of standard curves were completed by cloning full 16S rRNA sequences of Anaplasma phagocytophilum, Ehrlichia canis, E. chaffeensis, B. henselae, Bartonella vinsonii berkhoffii (B.v.b), and Mycoplasma haemocanis into plasmids (Eurofins Genomics LLC, Louisville, KY, USA). A. phagocytophilum positive control was generated from the consensus sequence derived after careful analysis of the 16S rRNA sequence alignment associated with the following accession numbers; CP006618, CP006616, APHH01000002, NC_007797, and CP006617. Similarly, for E. canis, a consensus sequence of strains Jake (CP000107), Oklahoma (NR_118741), Florida (M73226), and Malaysia (KR920044) was synthesized. For E. chaffeensis, strain Arkansas (NC_007799) was used. For B. henselae strain Houston-1 (CP020742) was used as a positive control whereas for B. vinsonii berkhoffii Winnie (CP003124) was used. Lastly M. haemocanis Illinois (CP003199) was used as a reference to generate the positive control. Standardized ten-fold dilutions were then generated for each plasmid from 1 × 10 7 to 0.1 copies per microliter (µL). The dilutions were made with dog gDNA to simulate natural infection. The gDNA concentration from this dog was quantified by spectrophotometry (average concentration of 27.8 ± 1.1 ng/dL with average 260/280 ratio of 1.88 ± 0.05) and the absence of PCR inhibitors was demonstrated by amplification of a fragment of the glyceraldehyde-3-phosphate dehydrogenase gene, as described previously [45]. Positive samples were diluted using VBP free EDTA-whole blood, confirmed using the molecular vector-borne disease panel from Vector-Borne Disease Diagnostic Laboratory at North Carolina State University College of Veterinary Medicine which tests for the following genera: Anaplasma, Babesia, Bartonella, Ehrlichia, Hemotropic Mycoplasma, Spotted-Fever Rickettsia, and Leishmania [46][47][48][49]. A different modified forward primer AE16S_ 45F (5' AGCYTAACACATGCAAGTCGAACG3') was used to detect Anaplasma and Ehrlichia [48] in the realtime PCR assay.

NGS library preparation and sequencing
For 16 S rRNA gene regions V1-V2 (~350 bp) and V3-V4 (~460 bp), the NGS libraries were created with proprietary primers from Zymo (Zymo Quick-16 S NGS Library Prep Kit, Zymo Research, Irvine, CA). For the V4-V5 region, universal primers (~550 bp) were used as previously described [50], followed by a barcoding scheme based on Faircloth and Glenn [51] as previously used by our research group [52]. These barcoding tags have been validated with the EDITTAG algorithm as previously described [51] and shown to be less errorprone than earlier barcode designs [53]. Illumina MiSeq 2 × 300 bp v2 kit was used to sequence the library at the University of Southern California Genome Core Center.

Library preparation quality control
During library preparation, utmost importance was given to quality control. A biosafety cabinet was used for sample preparation in separate and dedicated space. As described in detail in Oney et al. (44) adequate measures were taken to prevent barcode cross-contamination. Libraries were further purified by magnetic clean-up, were quantified by automated electrophoresis (BioAnalyzer 2100, Agilent, Santa Clara, CA), and pooled according to Illumina recommendations [54].

Simulation of co-infection with synthetic plasmids
For single infections all total six different pathogens were amplified for three different 16 S rRNA variable regions; V1-V2, V3-V4, & V4-V5. Among them (A) phagocytophilum, (B) henselae, B. vinsonii subsp. berkhoffii, E. canis, and M. haemocanis were tested in quadruplicate, whereas E. chaffeensis was tested eight times. Serial dilutions 1 × 10 6 through 1 × 10 1 copies/µL for all six synthetic plasmids were created. To simulate different possibilities of co-infection, A. phagocytophilum serial dilutions of 1 × 10 6 to 1 × 10 2 were mixed in descending order of copies per reaction with E. canis which was at a fixed concentration of 1 × 10 6 for all reactions. In another effort, (A) phagocytophilum serial dilutions of 1 × 10 4 to 1 × 10 2 were mixed with (B) henselae at a fixed concentration of 1 × 10 4 . These conditions were also used to simulate triple co-infection consisting of (A) phagocytophilum, E. canis, and (B) vinsonii subsp. berkhoffii with E. canis being the standard at 1 × 10 6 copies per reaction while (A) phagocytophilum and (B) vinsonii subsp. berkhoffii were mixed in decreasing order of concentration of 1 × 10 6 to 1 × 10 2 . Similarly, for another triple infection control fixed dilutions (10 4 ) of E. canis and M. haemocanis were mixed with different dilutions (1 × 10 4 to 1 × 10 2 ) of B. vinsonii subsp. berkhoffii. All the double and triple infections were also repeated quadruplicate times or more (Tables 1 and 2).

PCR amplification, barcoding, and amplicon sequencing
Custom dual indexed adapters were used to amplify the single, double and triple co-infections targeting the V1-V2 (~349 bp), V3-V4 (~460 bp), and V4-V5 (~410 bp) regions of the 16 S rRNA gene. The positive control was the ZymoBIOMICS™ Microbial Community DNA Standard (Zymo Research, Irvine, CA). Amplification for the first PCR reaction was performed by conventional PCR using the Eppendorf Master cycler using single hot start at 94°C for 2 min, futher 30 cycles of denaturing at 94°C, annealing at 57°C for 45 s, and final extension at 72°C for 1 min, with the PCR reaction in a 25 µL final volume containing 10X Ex Taq Buffer, dNTPs, and Takara Ex Taq HS (Takara Bio Inc., Shiga, Japan), 10 pmol of each primer, 2 µL of DNA template. The second amplification of 15 cycles was performed with the same conditions as the first, but with 1 µL of template coming from the primary inoculation. With 16 S rRNA primers, the target amplicon was amplified during the first step. In the second step, i5 and i7 barcoded primers in 96well configurations were used to form a library for sequencing [51]. Negative controls consisted of moleculargrade water and gDNA from an uninfected dog used to prepare the serial dilutions of positive controls, to characterize any bacterial DNA contaminants from the PCR reagents, as well as from the gDNA used, respectively. 1.5 % agarose gel electrophoresis with 1kp Plus DNA Ladder. V4-V5 amplicons were then normalized according to the manufacturer's instructions (SequalPrep Normalization Plate kit, ThermoFisher Scientific, Waltham, MA, USA) and were then pooled into 5mL tubes and stored at -30°C. V1-V2 and V3-V4 amplicons were normalized using the ZymoBIOMICS™ 96 MagBead DNA Kit (Zymo Research, Irvine, CA, USA) as per the manufacturer's protocol. Samples were sequenced with Illumina MiSeq paired-end V3 sequencing chemistry, with three to four 96-well plates simultaneously multiplexed per run. The number of raw and processed reads is available in Supplementary Table 1.

Bioinformatics pipeline on Mock samples analysis
Our computational pipeline for 16 S-NGS microbiome analysis is available in detail as a Standard Operating Procedure (SOP) publicly available at https://github. com/eltonjrv/microbiome.westernu/tree/SOP. Besides detailed ordered steps and useful commented commands, the SOP URL above also provides accessible links for in-house supporting scripts plus a customized RDP reference database [55] for a better taxonomic classification of bacterial VBPs of interest in Veterinary Medicine (https://github.com/eltonjrv/microbiome. westernu/tree/refDB). Due to the open-source nature of most of the adopted programs, the SOP must be run on a Unix Platform (either Mac or Linux). After installing the required tools, both Unix-familiar users and novices will be able to run it without any further difficulties, since we have provided detailed explanations for each step of the pipeline.
Background abundance of microbial taxa is a major hindrance for accurate analysis of low diversity samples. The background diversity is expected to be present in every sample and our previous data and other studies have shown that the frequency of contamination is directly related to the sample DNA abundance within total DNA content [44,57,58]. Along with the existing tools to remove such contamination from the sample [57][58][59], we have provided two new options in our pipeline to deal with such background contaminations in the dataset: (1) An ad-hoc R script was written for subtracting negative control-derived ZOTU counts from target samples (see SOP step 3.2.1). The use of blank negative control(s) with no sample DNA is of utmost importance for the analysis of low diversity samples. Users shall analyze the blank sets as negative-control samples (with no externally supplied DNA) and generate ZOTUs/OTUs using the same parameters as individual samples. Afterward, using the first utility users can remove the abundance present in the negative control from the actual samples at ZOTU/OTU level. In addition to removing the background diversity, this particular option has its usefulness in dealing with cross-contamination as well.
In the latter case, if cross-contamination occurs during the sample preparation step from the actual sample to the blank negative control, during negative control removal it will not remove the ZOTU/OTU completely from the sample. The abundance of contaminant ZOTU/OTU in the blank control set theoretically cannot surpass the abundance of that particular ZOTU/ OTU in the actual sample. (2) An alternative ad hoc PERL code is provided for removing the whole ZOTUs/ OTUs content present in negative controls from the actual samples to be assessed. Through that PERL code, users can remove the whole negative control-derived ZOTUs/OTUs content from their target samples, rather than count subtraction (see SOP steps 3.2.2 and 3.2.3). After generating ZOTUs/OTUs from the blank controls like earlier, this option will match every ZOTU/OTU present in the control to the sample ZOTUs/OTUs and remove them from the latter. Though this option can be equally effective while dealing with contaminations from background negative controls to the samples, users need to be cautious for the cross-contamination that may happen during the sample preparation step from actual samples to the negative controls.