Putative regulation mechanism for the MSTN gene by a CpG island generated by the SINE marker Ins227bp

Background A single nucleotide polymorphism (SNP) in the first intron of the myostatin gene (MSTN) is associated with aptness of elite Thoroughbreds to race over sprint, middle or long distances. This intronic marker (g.66493737 T ≻ C), a short interspersed nuclear element (SINE) of 227 bp (Ins227bp) insertion polymorphism in the MSTN promoter, and the adjacent SNP BIEC2-417495 have not been studied for their association with racing aptness of the average Thoroughbreds raced in countries with lower status of the racing industry. This study investigated these markers regarding their prevalence and association with performance in common race horses. Markers were genotyped by amplification refractory mutation system-quantitative PCR (ARMS-qPCR) or amplicon melting. Furthermore, we asked whether the Ins227bp marker might theoretically regulate the expression of myostatin by generating a novel target for DNA methylation or by changing binding sites for transcription factors. Putative sites for DNA methylation or binding of transcription factors were predicted by MethPrimer and by the softwares JASPAR, MatInspector and UniPROBE, respectively. Results Pairwise linkage disequilibrium between g.66493737 T ≻ C and Ins227bp was high (r2 = 0.93). A lower linkage was determined for g.66493737 T ≻ C and BIEC2-417495 (r2 = 0.69) as well as for BIEC2-417495 and Ins227bp (r2 = 0.76). The estimated frequencies for the presence of Ins227bp (I) indel and the C alleles at g.66493737 T ≻ C and BIEC2-417495 were 0.46, 0.47 and 0.43, respectively. Heterozygotes represented the most abundant genotype at each locus. The best racing distance (BRD) was significantly different between the homozygotes of each SNP (p = 0.01 to 0.03). C allele homozygotes at BIEC2-417495 or g.66493737 T ≻ C, as well as Ins227bp homozygotes earned most money on a mean distance ranging from 1211 to 1230 m. Heterozygotes earned most money on races over 1690 to 1709 m. The BRD for the T/T carriers at both SNP loci and for the SINE-free genotype was 1812 to 1854 m. Other performance parameters were not significantly different between the genotypes, except of the relative success score (RSS). The RSS was significantly slightly better on a distance of ≤1300 m for all carriers of the C allele and the Ins227bp compared to homozygous T genotypes and SINE-negative horses (p = 0.037 to 0.046). For distances of more than 1300 m the RSS was not significantly different between genotypes. In silico assessment indicated that the Ins227bp promoter insertion might have generated a CpG island and a few novel putative binding sites for transcription factors. Conclusions All three target polymorphisms (Ins227bp, g.66493737 T ≻ C, BIEC2-417495) are suitable markers to assess the ability of non-elite Thoroughbreds to race at short or longer distances. The CpG island generated by Ins227bp may cause training-induced silencing of MSTN expression. Electronic supplementary material The online version of this article (doi:10.1186/s12917-015-0428-3) contains supplementary material, which is available to authorized users.


Background
The g.66493737 T ≻ C marker located in the first intron of the MSTN gene predicts the racing ability of Thoroughbreds based on the quantitative traits best racing distance (BRD) or win-race distance [1][2][3]. C/C homozygotes appear better suited for fast, short-distance races (≤1300 m), whereas C/T genotypes seem to compete better in middledistance races (1301 to 1900 m), while T/T homozygotes perform generally better over longer distances (>2114 m) [1,2]. For two cohorts of elite horses a strong association was demonstrated for the C and T alleles with the sprinting or staying performance, respectively [3,4].
The highest standard and most valuable elite Flat races are known as Group (Stakes) races, whereas Listed races are the next in status. The elite Thoroughbreds described before had won at least one Group race or a Listed race. Most previous studies have been performed with elite cohorts from countries with the most internationally regarded Thoroughbred industry. Such cohorts likely do not represent the population of Thoroughbreds raced in countries in which horse racing is regarded to be of poorer quality on an international level. It would be very interesting to see if the association between MSTN markers and best racing distance or other performance indicators holds true in a less well regarded Thoroughbred population. Therefore, we present an observational study on the previously identified variants in the equine MSTN, thought to influence the racing ability of Thoroughbred horses. For this we studied a cohort of 56 non-elite Thoroughbreds raced in Austria and Turkey. Races run were usually handicap races or other non-Group or non-Listed races.
It is currently not understood how the g.66493737 T ≻ C polymorphism, located in the middle of a relatively large intron (1.829 bp), may influence the expression of genes involved in the development of juvenile and mature equine muscles. Moreover, although some marginal increase in muscle mass has been described [6], the massive increase in muscle mass seen in other species with MSTN missense or nonsense mutations such as in knock-out mice [7], double muscled cattle [8,9] or "bully" whippets [10] was not observed. The SINE of the MSTN promoter, Ins227bp, is in high linkage disequilibrium (r 2 = 0.73 to 1) with the C allele at g.66493737 T ≻ C [2,11], but considered less appropriate to predict racing aptness [2]. Recently, haplotype data suggested that Ins227bp is contemporary to and arose upon a haplotype containing the C allele at g.66493737 T ≻ C [11]. Moreover, it is suggested that Ins227bp, rather than the intron 1 SNP of MSTN, drives muscle fiber type characteristics and is the variant targeted by selection for short-distance racing [11].
To find a possible mechanism for this, we analysed the sequence in silico to identify putative binding sites for DNA methylation and transcription factors resulting from insertion of the Ins227bp polymorphism.

Results and Discussion
Linkage disequilibrium and allelic distribution Compared to the study by Hill et al. [2] our experimental cohort of average Thoroughbreds was different in linkage disequilibrium pairwise tested for g.66493737 T ≻ C and Ins227bp as well as for g.66493737 T ≻ C and BIEC2-417495 (r 2 values of 0.73 versus 0.93 and 0.86 versus 0.69, respectively see Additional file 1: Figure S1). The lower disequilibrium observed between g.66493737 T ≻ C and BIEC2-417495 makes it less difficult to assess the functional impact of either locus independent of the other. Table 1 displays the distribution of the Ins227bp, g.66493737 T ≻ C and the BIEC2-417495 alleles in the cohort of 56 non-elite Thoroughbreds.

Performance indicators
There was no statistically significant difference in victories, places and shows, starts, life earnings, best earnings in a race and average earning per start between the genotypes for each marker (Table 2). However, the BRD was significantly different between some of the genotypes ( Table 2). The RSS was calculated for distances of ≤ 1300 m (short) and > 1300 m (Table 3). On the short distance, the RSS determined for the C/C and C/T genotypes at g.66493737 T ≻ C was significantly higher compared to T/T carriers (p = 0.037 and p = 0.046, respectively). I/I genotypes had a marginally significant better RSS than the N/N genotypes (p = 0.052). For the BIEC2-417495 genotypes no difference in RSS was found for the short distance neither for distances more than 1300 m.
Sampling bias in this study could not be prevented since assessment of the racing ability was based on results of races run on different tracks under different circumstances and over a wide range of distances. This forced us to cluster race distance slightly differently as was done by others [1,3]. Considering maximum speed of a Thoroughbred, a real sprint distance should not be more than 1000 m [12]. We chose 1300 m as the nearest suitable approximation of a sprint distance to obtain a sufficient number of performances data. The same reason requested others to make a slightly different split at 1600 m [1]. Existing data provide evidence that the proportion of anaerobic power decreases to less than 5 % if races are 2400 m or longer [13]. Thus, the empirical classification of distances ranging between 1000 and 2400 m according to the International Federation of Horseracing Authorities (www.horseracingintfed.com) should be regarded as arbitrary. In this respect, the BRD for the C/C (and I/I) genotypes on average fell within the physiological "sprint" distance (<1400 m). Ranges of BRD between the C/T (I/N) and T/T (N/N) genotypes did overlap considerably, as was also reported by others   [1]. This is plausible since in addition to genotype many more factors determine the racing success of a horse. Nevertheless, the pattern confirms the underlying genetic aptness for a specific distance and could be used by the trainer to strategically design a horse's racing career. Horses were identified as non-elite due to their noncompeting status in Group or Listed races. However, there was a large variation in price money won and some might have become elite horses in the hands of other trainers. We tried to estimate the strength of the associations of the genotypes and racing aptness in the general horse population, however the sample size of 56 horses was too small to allow further analyses of association between genotype and racing performance. Sample sizes of at least 200 horses and even more than 4500 in case of victories would have been needed to obtain a minimal power of 0.80. Therefore, it is not surprizing that in other studies with larger cohorts BRD was often the only trait that was significantly associated with genotype [1][2][3][4]. Although our BRD was not based on winning races, instead being determined by distance of race in which the horse earned most money, the association with the genotypes of g.66493737 T ≻ C in our non-elite race horse population agrees with that described for cohorts of elite and better quality horses [1][2][3][4]. The proportion of C/C homozygotes in our non-elite cohort was dissimilar to those given by Hill et al. [2] (18 % versus 29 %), but similar to that of Tozaki et al. [4]. The proportion of T/T homozygotes in our cohort was similar to that of Hill et al. [2] but smaller than that of Tozaki et al. [4] (23 % versus 31 %), likely explained by the different origins of the populations.
The Nearctic-Northern Dancer sire line is strongly associated with dispersion of the C/C genotype at g.66493737 T ≻ C [11]. Our cohort did not confirm this finding. The mean percentage of Nearctic blood in our g.66493737 T ≻ C C/C horses was not higher (p = 0.4) than in the C/T and T/T horses. Similar trends were found for the other two markers (data not shown).
The C allele is not unique for Thoroughbreds and Thoroughbred-derived populations. It was even found at a high frequency in Shetland ponies (0.32 to 0.50) and Fulani horses (0.33) [11,14]. In contrast, the Ins227bp marker appears to be more specific for Thoroughbreds, Quarter horses and related breeds and is distributed across other breeds only at minor frequency [11,15].
The reason of the statistical association of the MSTN polymorphism with racing aptness is still unknown because the strongest marker for this trait, BIEC2-417495 [2], is located far upstream (692 kb) of MSTN near the locus of the glutaminase (GLS) gene. This mitochondrial enzyme is assumed to play a role in energy production. So far, this gene or its alleles have not been studied in the horse (www.omin.org/entry/138280).
Nevertheless, the C allele of g.66493737 T ≻ C is regarded as a marker for muscularity [1][2][3][4]. Inconsistently, the tightly associated Ins227bp insertion polymorphism [2,11], however, was not found to affect muscle mass [16]. Thus, a possible effect of the C allele on muscle mass needs further confirmation. Although the MSTN polymorphisms may not clearly affect mature muscle mass, they might influence prenatal muscle differentiation and juvenile composition. In Quarter Horses and Thoroughbreds the C allele at g.66493737 T ≻ C as well as the Ins227bp marker appear to be associated with higher and lower proportions of type 2X and type I fibres, respectively [11,15]. Thus, Ins227bp could indicate the potential for high speed of Thoroughbreds too. Interestingly, Thoroughbreds being homozygous for the C allele at g.66493737 T ≻ C showed rather a higher transcript expression of MSTN in a non-trained condition compare to the C/T and T/T type. Only after a period of 10 months of training the expression level decreased to similar levels as the C/T and T/T genotypes [17]. This contradicts the simplistic hypothesis that a decreased MSTN expression leads to increased muscle mass. Theoretically, the three target polymorphisms could cause a change of MSTN expression by intron mediated enhancement [18][19][20], a distant regulatory DNA element located several hundred kilobases away [21], or by a genetic or epigenetic change of the MSTN promoter.

Novel transcription factor binding site candidates and CpG island caused by Ins227bp
It was not very surprising that the insertion of the 227 bp SINE (Ins227bp) into the promoter of the MSTN gene generated some novel putative binding sites for transcription factors . In more detail, whereas the insertion did not erase a putative transcription factor binding site according to the analysis tools JASPAR, MatInspector and UniPROBE applied under stringent settings, it created one, three or four novel putative transcription factor binding sites according to the pairwise intersections of the three prediction programs (Fig. 2). There was no site predicted by all three tools. More surprising, however, was the finding that the Ins227bp insertion created a novel CpG island (Fig. 3) including a downstream segment at the insertion site. Gene expression differences that are the result of SINE insertions are likely to be a recurrent theme in the study of complex traits [22], however, so far very few studies have conclusively demonstrated exaptation of transposable elements as transcriptional regulatory regions [23].
Their functioning as nucleation centres for de novo methylation is striking in an epigenetic context [24]. Further dissecting the effects of the genetic variants will benefit understanding the regulation of the racing ability of Thoroughbreds. Of special interest in this regard would be, to unravel whether the SINE Ins227bp of the MSTN promoter would regulate MSTN expression via the generated CpG island and/or via changed target sites for transcriptional regulator(s).

Conclusion
Each of the the three polymorphisms studied represents a suitable genetic marker to predict the sprinting ability of non-elite Thoroughbreds. Future experiments with large numbers of horses, between 200 to over 4500, depending on the studied trait should address the possible role of the SINE insertion Ins227bp as a putative cis element enabling transcriptional regulation via association with trans-acting factors and/or modulation by exercise. The use of untrained age-matched controls will exclude that methylation regulates expression of MSTN in an agedependent manner in horses of 20 and 30 months [17].

Animals and samples
Roots from hair samples were collected from Thoroughbreds in Austria (n = 20) and Turkey (n = 36). The life time performance of these horses was extracted from published race results.

Genotyping assays
The SNPs g.66493737 T ≻ C and BIEC2-417495 were typed by ARMS-qPCR) [25]. The length polymorphism Ins227bp was analysed by amplicon dissociation and agarose gel electrophoresis.
Primers (Additional file 2: Table S1) were designed with the software Primer Express 2.0 (Life Technologies, Foster City, USA) and controlled for dimer formation using the web tool NetPrimer (www.premierbiosoft.com/ netprimer/). Their specificity was evaluated with Primer-BLAST of NCBI using the "nr" database of Equus caballus. The secondary structure of the PCR product was analysed with the Mfold software [26].
Genomic DNA was extracted from hair roots using the NucleoSpin® Tissue Kit according to the manufacturer's instructions (Macherey-Nagel GmbH & Co. KG, Düren, Germany). DNA concentration was measured spectrophotometrically using the Hellma® TrayCell (Hellma Analytics, Müllheim, Germany) on the BioPhotometer 6131 (Eppendorf, Hamburg, Germany). Sample concentrations ranged between 2 and 11 ng/μl. Amplification was performed in duplicate 20-μl reactions. A single reaction consisted of 1 × reaction buffer (70 mM Tris-HCl (pH 8.3), 50 mM KCl, 10 mM (NH 4 ) 2 SO 4 , 0.1 mg/ml gelatin),  Table S1). Cycling conditions on the StepOnePlus ™ Real-Time PCR System (Life Technologies) running under the software version 2.0 were 95°C for 15 min followed by 45 cycles of 95°C for 15 s, 58°C for 20 s, and 60°C for 30 s. For dye-based qPCR (markers: Ins227bp and g.66493737 T ≻ C) amplicon dissociation analysis from 60°C to 95°C with 0.3°C/s increments and continuous acquisition of fluorescence was performed. Specific amplification was concluded when the target and the no-template control showed different melting temperatures. In addition, the amplicon of the Ins227bp assay was assessed on an 1 % agarose gel stained with a 10.000-fold dilution of the dye Midori Green Advance (Biozym Scientific GmbH, Hessisch Oldendorf, Germany) and visualised on the AlphaImager HP System (Biozym Scientific GmbH, Hessisch Oldendorf, Germany) equipped with a blue light screen.
A sample was considered homozygous or heterozygous if the difference of the quantification cycle (Cq) values obtained by the two discriminative assays of ARMS-qPCR was ≥ 7 or ≤ 2.5, respectively.

Pairwise testing of linkage disequilibrium
Haploview 4.2 was used for pairwise testing of linkage disequilibrium [27].

Prediction of transcription factor binding sites putatively created by the Ins227bp insertion
Transcription factor binding sites putatively created by the SINE insertion Ins227bp were analysed by the software tools JASPAR (version 5.0_ALPHA) [28,29], MatInspector (version 8.2) [30] and UniPROBE (state of March 2015; [31] calling upon different databases. To report only the most likely sites stringent thresholds were applied, namely a 90 % relative profile score threshold for JASPAR set to "CORE Vertebrata", a core similarity of 1.0 and a matrix similarity of at least 0.95 for MatInspector when set to vertebrates and a score threshold of 0.48 for UniPROBE set to mammalian which is slightly below the maximum value of 0.50.

CpG island prediction
The CpG island was predicted by the MethPrimer software [32] using an island size of at least 100 nucleotides, a GC percentage of at least 50 % and an observation/expectation CpG ratio of more than 0.6.

Calculation of relative success scores (RSS)
The various racing distances on which the horses had performed could only suitably be clustered into: sprint distance (≤1300 m) and non-sprint (>1300 m). A RSS was calculated for each distance class. The algorithm for the RSS was to sum up all points obtained in the respective distance class, divided by the number of starts in that class. Wins were given ten points, a 2nd place five, a 3rd place four, a 4th place three, a 5th place two and unplaced start was given one point. In this scoring system wins are twice as important as a second place, while honouring a finished race with one point allowed to include the effects of frequent starts and indicates a certain level of toughness. Furthermore, per genotype group the mean victories, mean places and shows, mean number of starts, mean life earnings, mean best racing distance based on highest earnings, mean best earnings in a race and mean earnings per start were calculated. The percentage of Nearctic blood in the pedigree (F x ) was calculated by the term Σ [0.5] x1+x2+1 [33] whereby x1 represents the number of generations from sire(s) to Nearctic and x2 the number of generations from dam(s) to Nearctic. The parameters were used to identify possible associations between Ins227bp and genotypes at BIEC2-417495 and g.66493737 T ≻ C.

Statistics
Statistical analysis was performed using IBM® SPSS® version 20 (IBM Corporation, New York, United States) statistical software. All data were tested by Shapiro-Wilks test and appeared not normally distributed (p < 0.04). Parameter