Skip to main content

Milk yield prediction in Friesian cows using linear and flexible discriminant analysis under assumptions violations

Abstract

Background

The application of novel technologies is now widely used to assist in making optimal decisions. This study aimed to evaluate the performance of linear discriminant analysis (LDA) and flexible discriminant analysis (FDA) in classifying and predicting Friesian cattle’s milk production into low (\(\:<\)4500 kg), medium (4500–7500 kg), and high (\(\:>\)7500 kg) categories. A total of 3793 lactation records from cows calved between 2009 and 2020 were collected to examine some predictors such as age at first calving (AFC), lactation order (LO), days open (DO), days in milk (DIM), dry period (DP), calving season (CFS), 305-day milk yield (305-MY), calving interval (CI), and total breeding per conception (TBRD).

Results

The comparison between LDA and FDA models was based on the significance of coefficients, total accuracy, sensitivity, precision, and F1-score. The LDA results revealed that DIM and 305-MY were the significant (P < 0.001) contributors for data classification, while the FDA was a lactation order. Classification accuracy results showed that the FDA model performed better than the LDA model in expressing accuracies of correctly classified cases as well as overall classification accuracy of milk yield. The FDA model outperformed LDA in both accuracy and F1-score. It achieved an accuracy of 82% compared to LDA’s 71%. Similarly, the F1-score improved from a range of 0.667 to 0.79 for LDA to a higher range of 0.81 to 0.83 for FDA.

Conclusion

The findings of this study demonstrated that FDA was more resistant than LDA in case of assumption violations. Furthermore, the current study showed the feasibility and efficacy of LDA and FDA in interpreting and predicting livestock datasets.

Peer Review reports

Background

Screening milk yield by investigating factors affecting production is critical for breeders. Large datasets, particularly for all animal parities, increase the challenge of prediction due to high dimensions. So, selecting the appropriate statistical method is crucial for analyzing biomedical data. Inappropriate methodological choices can lead to misleading interpretations and unsubstantiated conclusions. Understanding the assumptions and conditions of statistical methods is essential for selecting the appropriate method for data analysis. By doing this, researchers can ensure accurate and reliable data interpretation. Along with knowledge of statistical technologies, an additional important consideration is data type and nature, as well as the study’s goal [1]. In the case of categorical dependent variables, Discriminant analysis is primarily used to categorize data into two or more groups.

Discriminant analysis focuses on identifying significant factors for group distinction and categorizing unlabeled items. It involves two stages: separation and allocation. Separation aims to establish discriminant functions to separate groups, while allocation uses these functions to allocate unclassified items to recognized categories [2]. Additionally, it is carried out by computing the weights for each explanatory variable, which are calculated as between-group variance in relation to within-group variance [3]. Discriminant analysis is critical in determining the factors influencing milk production [4,5,6].

Various discriminant functions have been investigated over time, but they all serve the same purpose. Specific discriminant functions are required in certain circumstances. Linear discriminant analysis, Quadratic discriminant analysis, and Regularized discriminant analysis are examples of parametric discriminant analysis, while Flexible discriminant analysis and K-Nearest Neighbor analysis are examples of nonparametric discriminant analysis [7]. Linear methods are commonly used but may not always be sufficient in certain cases for accurate classification and error reduction. So, selecting an alternative approach can improve classification accuracy.

The flexible discriminant model (FDA) is a generalized linear discriminant analysis (LDA) method that can create non-linear boundaries for classes. The FDA’s inspiration originates from the connection between LDA, canonical correlation analysis, as well as optimal scoring. The primary difference is that the FDA can incorporate quadratic and bilinear predictors as well as nonlinear decision boundaries [8]. FDA is a valuable data prediction and classification method, and it is expected to draw the interest of medical investigators [9].

The current study aims to predict Friesian cow milk yield using linear discriminant analysis, use a more flexible classification model (flexible discriminant model) to accommodate the assumptions violation, and compare the performance of both models in anticipating and categorizing milk yield.

Materials and methods

Data collection and management

A total of 3793 dairy records were obtained for this study to investigate the significant variables influencing milk production in Friesian cows raised in Egypt. The data were gathered from one of the large farms in the El-Dakahlia governorate of Egypt. This investigation’s materials included records of dairy cows that calved between 2009 and 2020. Cows were mechanically milked three times per day under an automatic milking system. The daily milk yield was estimated using the sum of these three values. This study included cows that had completed at least one lactation. The lactation records and all other data are computer-based and updated regularly. On the farms, all cows were occasionally kept on unclean floors, in open yards, or slenderly covered yards with access to cool spraying during the summer. Animals were fed rations with a crude protein content of about 18%, according to National Research Council [10]. Isolation units were provided for sick and suspected animals and calving units were used for parturition. Storage units were also provided for feed and equipment. The calf-rearing system was an indoor system, allowing calves to stay with their dams and suckle for a limited period each day.

Investigated traits

The research’s data were used to discover the major factors affecting milk yield in Friesians using linear and flexible discriminant models. As a result, the variables under consideration were actual total milk yield (kg), which was divided into three categories: low (< 4500 kg), medium (4500–7500 kg), and high (> 7500 kg), and considered the dependent variable of interest, while the rest variables considered as predictors include age at first calving (month), lactation order of eight parities (P1-P8), days open (days), dry period (days), days in milk (days), 305-day milk yield (kg), calving season (spring – summer – autumn- winter), calving interval (month), and number of breeding per conception (number). Furthermore, the categorization of milk yield variable and the selection of independent variables have been carried out based on previous authors’ recommendations [11,12,13,14,15,16].

Data preprocessing

Data were initially checked for missing values as well as outliers. The data showed an imbalance in the number of the three levels of the outcome variable, which ranged between 3485 for the highly-producing cows, 287 for the medium class, and 21 for the low-producing category. The data was split randomly into two parts: 80% of the data was the training set for model creation and adjustment, and the second part was the testing set, representing 20% of the data for model validation and performance. The train set was processed by random over sampling (ROS) “up sampling” using the Caret package. The chi-square test and multinomial logistic regression were used to select variables for multivariate analysis, with all variables showing high significance levels (p < 0.001). For the LDA, the values of the different parameters were also centered and scaled to minimize data variability. Linear and flexible discriminant analysis models were used to classify cow milk yield into high, moderate, and low categories, creating a powerful model for future prediction.

Linear discriminant analysis

Linear Discriminant Analysis is a parametric method used to identify the weights of independent variables that best distinguish between two or more mutually exclusive and exhaustive groups of cases. LDA is also a hard classification approach. Hard classifiers only define a hard classification boundary for each group based on the explanatory variables. A new item is assigned to a group based on characteristics that fit within the group’s characteristics [17]. Each group has a mean vector that is calculated using the attribute vectors of all the objects in the group. A test observation is assigned to the class whose centroid is the closest, where distance can be expressed as a Mahalanobis metric using the pooled within-group covariance matrix [18]. LDA functions are orthogonal to previous ones, and a number of discriminant functions of either G-1 or J, whichever is less, where G represents the categories number of the dependent variable and J represents the exploratory variables number [19]. The linear discriminant model is provided below:

$$\:{D}_{jk}={b}_{o}+\:{b}_{1}{x}_{1k}+{b}_{2}{x}_{2k}+\dots\:+{b}_{n}{x}_{ik}$$
(1)

Where: \(\:{D}_{jk}\) is the discriminant score of discriminant function j for the milk yield level k, \(\:{b}_{o}\:is\:\)a constant term, \(\:{b}_{i}\) is the \(\:{i}^{th}\) predictor’s discriminant coefficient, and \(\:{x}_{ik}\:\)represent the \(\:{i}^{th}\) explanatory variable (AFC, LO, DO, DIM, DP, CFS, 305-MY, CI, TBRD) for the milk yield level k.

The function generates discriminant scores, which estimate predicted probabilities for each categorical outcome variable. These scores, along with group means (centroids), aid in classifying cases into groups. A variable’s significance in explaining an outcome is indicated by large coefficients [20]. LDA has limitations since great accuracy can be achieved with strong multivariate normality assumptions and equal covariance matrices. However, these assumptions are uncommon in practice [2].

Flexible discriminant analysis

Flexible discriminant analysis is a non-parametric generalization of linear discriminant analysis. It can be regarded as a technique of postprocessing a multi-outcome regression with an optimal scoring approach. The FDA procedure is similar to LDA, replacing linear regression with non-parametric or semi-parametric regression, allowing for various regression tools and discrimination rules. This enable a more flexible class boundaries and better discrimination [21, 22]. FDA was used to develop a coordinate system capable of distinguishing between all three levels of milk yield. To maximize separation between the data points of various milk yield types and to minimize variation among the data points within each milk production type, the data were mapped onto a two-dimensional plane. To determine the decision boundary, a regression is run, with the parameters set to reduce the average-squared residual (ASR) [22].

$$\begin{gathered} ASR = \:\frac{1}{N}\sum {\:_{k = 1}^K} \hfill \\\left[ {\sum {\:_{i = 1}^N} {{\left( {\theta {\:_k}\left( {{g_i}} \right) - \eta {\:_k}\left( {{x_i}} \right)\:} \right)}^2} + \lambda \:\:L\:\left( {\eta {\:_k}} \right)} \right] \hfill \\ \end{gathered}$$
(2)

Where the function θ allocates scores for the most accurate prediction of transformed class labels, N denotes the number of observations, K denotes the number of categories, η is the regression fit using nonparametric regression techniques, and L denotes the regularization penalty function used to penalize roughness in the regression fit, with a weighting determined by parameter λ. The FDA permits the use of nonparametric regression techniques, including multivariate additive splines (MARS) [23], neural networks [24], and generalized additive models [22] in model fitting. MARS is a spline regression that substitutes basis functions for the initial data as predictors. The MARS basis function change permits to specifically blank out regions of a variable by setting them to zero. This enables MARS to concentrate on specific sub-regions of the data, obtain optimal variables and interactions, and handle complex data structures with high dimensions [25].

In brief, the calculations are as follows:

  1. 1.

    Multivariate nonparametric regression: provide fitted values \(\:\widehat{Y}\) for a multiresponse variable, adaptive nonparametric regression of Y on X.\(\:{S}_{\:\lambda\:\:}\) is the linear operator that best fits the final model, and \(\:{{\eta\:}^{*}}_{\mathcal{l}}\:\)is the vector of fitted regression functions, shall be considered.

  2. 2.

    Optimal scores: calculate the eigen-decomposition of \(\:{Y}^{T}\widehat{Y}={Y}^{T}{S}_{\:\lambda\:\:\:}Y\), in which the eigenvectors \(\:{\Theta\:}\) are normalized, using the formula: \(\:{{\Theta\:}}^{T}{D}_{\pi\:}{\Theta\:}\:=\:I.\) A diagonal matrix of the estimated class prior probabilities is calculated as \(\:{D}_{\pi\:}=\:{Y}^{T}Y/N\)

  3. 3.

    Update the statistical model from Step 1 by applying the optimal scores: \(\:\eta\:\left(x\right)={{\Theta\:}}^{T}{\eta\:}^{*}\left(x\right).\)

Assumptions

The choice between LDA and FDA is more dependent on assumptions beyond each approach. In theory, the FDA is more flexible in terms of assumptions, especially variable distributions, and outliers. However, they both share some assumptions, such as observation independence and the absence of multicollinearity between predictors in datasets [26]. Hence, LDA required the testing of certain fundamental assumptions before its application. The independent variables were examined for multivariate normality, so, the predictors should be ratio or interval levels. Mahalanobis distance against chi-square was used to determine the presence of outliers. A simple scatter plot was used to assess the data’s linearity assumptions between the predictor and response variables [27]. The homogeneity of covariance matrices among dependent variable levels was verified using Box’s M test [28] while the variance inflation factor (VIF) was used to examine the multicollinearity of suggested discriminators; a VIF value of 1 indicates no correlation, a VIF value between 1 and 5 indicates moderate correlation, and a VIF value greater than 5 indicates strongly correlated [29].

Classification model evaluation

According to Sokolova and Lapalme [30], the prediction model is evaluated using overall classification accuracy (ACC), which measures the ratio of correctly classified observations to total observations. Researchers often use other metrics like sensitivity (SEN) and precision (PREC) to measure the accuracy of classifiers. SEN is calculated as (TP / (TP + FN)), which is the percentage of target class labels correctly classified as desired [31, 32], while PREC is calculated as (TP / (TP + FP)), which is the proportion of correctly classified examples. TP represents true positives predicted by the classifier, TN represents true negatives, FP represents false positives, and FN represents false negatives [33, 34]. An additional metric is known as the F1-score, which is regarded as the harmonic mean of SEN and PREC. It examines the precision and sensitivity of the model in identifying positive cases (PREC) as well as its ability to prevent missing positive examples (SEN). The F1-score is defined as follows: \(\:\:\text{F}1-\text{s}\text{c}\text{o}\text{r}\text{e}=\left(\:2*SEN*PREC\right)/\left(SEN+PREC\right)\) [30].

Software and packages used

Linear discriminant analysis was performed using the MASS package while FDA using the MARS method was performed using the “fda” function from the “mda” package. All analyses were carried out using the R programming language [35] and SPSS version 25.

Results

Tests of assumptions

First, the study checked the assumptions of LDA. There are no missing values in the data. Box plot indicated that the data contains univariate outliers in the majority of the explanatory variables as illustrated in Fig. 1. Whereas multivariate outlier based on Mahalanobis distance revealed that there are many outliers, as demonstrated in Fig. 2 below.

Fig. 1
figure 1

Box plot revealed that the data contains outliers in most of the predictors

Fig. 2
figure 2

Multivariate outliers using Mahalanobis distance

The homogeneity of covariance matrices was tested using the Box’s M statistic, which showed the violation of that assumption \(\:({\text{B}\text{o}\text{x}}^{{\prime\:}}\text{s}\:\text{M}=342,\:\text{a}\text{n}\text{d}\:\text{p}<0.001\:).\:\:\)Results indicated that the univariate normality of independent variables was violated while multivariate normality was met by the Henze-Zirkler test \(\:(\text{H}\text{e}\text{n}\text{z}\text{e}-\text{Z}\text{i}\text{r}\text{k}\text{l}\text{e}\text{r}\:\:\text{v}\text{a}\text{l}\text{u}\text{e}=\:106.95,\:\text{p}\:<\:0.001)\). Nonetheless, the analysis was carried out in violation of this assumption as the discriminant analysis exhibits good results in face and object recognition, even though the normality assumption is frequently violated [36]. Scatter plots tested the linearity assumption, showing a linear relationship between the outcome variable and only a few independent variables. The VIF test was used, and there is a moderate degree of multicollinearity as \(\:\text{V}\text{I}\text{F}\:\text{b}\text{e}\text{t}\text{w}\text{e}\text{e}\text{n}\:2\:\text{t}\text{o}\:5\:\) for three parameters (DO= 5.84, DIM= 3.95, and CI=2.36). As VIF does not indicate which pair of predictors are correlated, a correlation analysis was performed to determine which parameters are correlated. The results revealed that days open have a significant relationship with days in milk (\(\:r=0.85\)), and days open are also highly correlated with calving interval (\(\:r=0.76\)). As a result, we dropped the DO variable to eliminate multicollinearity.

Linear discriminant analysis

The LDA model calculates group means and the probability of belonging to the different categories for each individual. In other words, the average of each explanatory variable within each category. These values could suggest that the DIM may have a greater influence on high milk production (61.6) than on medium (23.6) and low (37.5) milk production, as was also seen with 305-MY. The model’s group means indicate good separation ability between milk production levels (high, medium, and low), as displayed in Table (1).

Table 1 Group means for each predictor within each class using LDA

The study found that all independent variables significantly contribute to data classification, indicating a significant difference between milk production levels. The Wilks’ lambda value decreases as the independent variable becomes more significant in the discrimination process. Therefore, as shown in Table (2), the most significant variables were 305-day milk yield (\(\:\gamma\:\)= 0.610), Days in milk (\(\:\gamma\:\) = 0.909), and second parity (\(\:\gamma\:\) = 0.930).

Table 2 Predictors’ role in explaining outcome using linear discriminant analysis

Table (3) displays the model’s coefficients, which are linear combinations of predictor variables used to create the LDA decision rule. Standardized canonical discriminant coefficient was used to rank the significance of each variable. A high standardized discriminant function coefficient indicates significant differences between groups on a particular variable, so the first three parities are considered the most significant factors based on these coefficients. This result contradicts Wilk’s test results for the significance of predictors observed in Table (2), which show the impact of outliers and collinearity on the data. Therefore, we focused on the structure matrix as its results matched Wilk’s test’s rank of predictor importance. According to the factor loadings, 305-MY and DIM are important parameters of discrimination on \(\:{\text{L}\text{D}}_{1}\), whereas the first and second parities are on \(\:{\text{L}\text{D}}_{2}\).

Table 3 Standardized and unstandardized discriminant coefficients for the predictors

The canonical discriminant function is built using unstandardized coefficients:

$$\eqalign{& \>{\rm{L}}{{\rm{D}}_1} = \>\>1.21 + \>2.542{\rm{*}}\>{\rm{parity}}1\> + \>2.610{\rm{*parity}}2\> \cr & + \>2.129{\rm{*parity}}3\> + \>1.278{\rm{*parity}}4 + \>1.257{\rm{*parity}}5 \cr & + \>1.004{\rm{*parity}}6 + \>1.386{\rm{*parity}}7 + \>0.147{\rm{*Autumn}}\> \cr & - \>0.198{\rm{*Spring}}\> + \>0.222{\rm{*Summer}} - \>0.002{\rm{*DIM}} \cr & + \>0.003{\rm{*DP}}\> + \>0.002{\rm{*TBRD}}\> - \>0.027{\rm{*CI}}\> \cr & + \>0.00{\rm{*}}305{\rm{MY}} + 0.002{\rm{*AFC}}.\> \cr}$$

Similarly,

$$\eqalign{& \>{\rm{L}}{{\rm{D}}_2} = \>\>2.218 + \>2.155{\rm{*}}\>{\rm{parity}}1\> + \>0.311{\rm{*parity}}2\> \cr & + \>1.138{\rm{*parity}}3\> + \>1.136{\rm{*parity}}4 + \>2.479{\rm{*parity}}5 \cr & + \>2.787{\rm{*parity}}6 + \>3.458{\rm{*parity}}7 - \>0.805{\rm{*Autumn}}\> \cr & + \>0.457{\rm{*Spring}}\> - \>0.375{\rm{*Summer}} + \>0.001{\rm{*DIM}} \cr & - \>0.006{\rm{*DP}}\> - \>0.043{\rm{*TBRD}}\> - \>0.210{\rm{*CI}}\> \cr & + \>0.00{\rm{*}}305{\rm{MY}} - 0.012{\rm{*AFC}}. \cr}$$

There were two functions created: the first had a higher eigenvalue of 0.841 and explained 84.6% of the total variance in the outcome (milk yield level), and the second had an eigenvalue of 0.153 and explained 15.4% of the total variance in the outcome variable (Table 4). Eigenvalue is a measure of the variance that a function can explain. \(\:{R}^{2}\) (Coefficient of multiple determination) is defined as the square of the canonical correlation value.

Table 4 Eigen values and Wilks’ Lambda for linear discriminant functions

The first discriminant function had mean scores for low milk production of 0.997, medium and high levels of 0.216 and − 1.217, respectively, while the second discriminant function had mean scores of -0.352 for low milk production and 0.546 and − 0.192 for medium and high levels (Table 5).

Table 5 Group centroids of low, medium and high-level milk yield by LDA

The LDA model had a good discriminating accuracy of 71% in both the training and testing sets, which reveals that 71% of the milk production was properly classified by the model as displayed in Table (6). By 79%, 63%, and 71%, respectively, the model classifies high, medium, and low cases properly. Only 78.38% of all high cases have been identified by the model, 72.52% of all medium cases were detected by the model, and only 62.58% of all low cases were identified.

Table 6 LDA evaluation metrics for the model performance on train and test sets

The discriminant function plot revealed that the low level had the highest value on function I (Fig. 3). Function II played a minor role in the discrimination process compared to function I. Three milk yield levels membership appeared on the graph at nearly the same level, indicating poor discrimination between groups.

Fig. 3
figure 3

Linear discriminant analysis biplot revealed that the discrimination between group means isn’t good as group means are so close

Flexible discriminant analysis

Linear discriminant analysis requires a lot of assumptions to be held. As the results showed a violation of all assumptions, we applied FDA which tolerated the non-linearity of the data. Regarding multivariate normality, this assumption is rarely met. FDA using the MARS method was used, and the model discriminated the data by two dimensions with a misclassification error of 0.17. The first dimension accounted for 67.4% of the between-group variance, and by the second dimension totally reached 100%.

Table 7 displays the flexible discriminant analysis coefficients. The results revealed that the first and second parities are important parameters of discrimination on \(\:{\text{F}\text{D}}_{1}\), whereas the sixth and seventh parities are on \(\:{\text{F}\text{D}}_{2}\).

Table 7 Coefficients of the model for the first two dimensions of flexible discriminant analysis

The canonical discriminant function is built using unstandardized coefficients:

$$\eqalign{& \>F{D_1} = \>\>1.069 - 1.044*\>parity1 - 1.200*parity2 \cr & - 0.899*parity3 - 0.493*parity4 - 0.413*parity5 \cr & - 0.263*parity6 - 0.403*parity7 + 0.553*parity8\> \cr & - 0.118*Spring\> + \>0.124*Summer - 0.007*Winter \cr & + \>0.149*DP\> + \>0.071*TBRD\> - 0.088*CI \cr & - 0.026*305MY - 0.072*AFC - 0.127*DO. \cr}$$

Similarly,

$$\eqalign{& \>F{D_2} = \>\> - 0.532 + \>0.841*\>parity1\> + \>0.250*parity2 \cr & + \>0.468*parity3\> + \>0.408*parity4 + \>0.863*parity5 \cr & + \>0.935*parity6 + \>1.181*parity7 - 0.089*parity8 \cr & - 0.257*Spring\> - \>0.136*Summer - 0.005*Winter \cr & - 0.167*DP\> - \>0.079*TBRD\> - 0.029*CI\> \cr & + \>0.004*305MY - 0.104*AFC - 0.110*DO. \cr}$$

The class score means were presented in Table (8), with clear separation in the first dimension, however in the second dimension, the high and low categories are quite near. Overall, the separation appears to be satisfactory as seen in Fig (4).

Table 8 Group centroids of low, medium and high-level milk yield by FDA
Fig. 4
figure 4

Flexible discriminant analysis’s discrimination biplot. The plot showed remarkable differentiation between the three categories of milk production. (1) The color orange stands for a high milk production. (2) The color blue stands for a medium category. (3) The color green stands for a low category

The FDA model showed a superior discriminating accuracy of 82.45% and 83.19% in the training and testing sets, respectively, indicating accurate identification of milk production, as shown in Table (9). In the training set, the model correctly classifies high, medium, and low instances by 82.71%, 87.16%, and 77.39%, respectively. The model identified only 83.43% of all high cases, 75.90% of all medium cases, and 90.20% of all low cases.

Table 9 FDA evaluation metrics for the model performance on train and test sets

Comparison of classification techniques

The suggested LDA and FDA model’s robustness is validated through classification performance measures, each metric is reported for the three classes (high, moderate, and low). Notably, the FDA model achieved higher sensitivity (0.83, 0.87, and 0.77), precision (0.83, 0.759, and 0.90), and F1 score (0.83, 0.81, and 0.83) compared to LDA achieved the highest metrics for the (high) class (0.79, 0.78, and 0.79) for sensitivity, precision, and F1 score, respectively. Additionally, the FDA model exhibited a higher overall classification accuracy of 82% compared to LDA’s 71% (Tables 6 and 9). The model’s predictive power is low, and to improve results, the days open variable was removed. The findings, however, demonstrated that the assumptions were invalid. As a result, flexible discriminant analysis is created.

The evaluation on the test set revealed a clear advantage for the FDA model in terms of accuracy (82.45 − 83.6%) compared to LDA (71.09 − 76.0%). This difference suggests that LDA may be overfitting the training data, leading to lower generalizability on unseen data. While LDA achieved reasonable accuracy for the “high” class, its performance suffered for the “moderate” and “low” classes. This suggests an inability to accurately predict these categories. Conversely, the FDA model demonstrated good to high sensitivity across all classes (0.85, 0.67, and 0.67 for high, moderate, and low, respectively). While precision and F1-score for FDA decreased for moderate and low classes compared to the high class, they remained superior to LDA’s performance for all classes.

Our study aimed to evaluate the impact of assumption violations on model performance. However, we observed that both models, particularly LDA, struggled to predict some classes due to the inherent class imbalance in the data. To isolate the effect of assumption violation and ensure a fair comparison, we re-evaluated the models using a balanced test set. This step aimed to mitigate any potential confounding effect of class imbalance on our results and ensure our choice of model was based solely on the impact of assumption violations. The re-evaluated results, considering the balanced test set, further confirmed the superiority of the FDA model.

Discussion

The current study classified and predicted Friesian milk yield using two classification methods, LDA and FDA. Based on the literature review, no similar studies were observed that implement and investigate LDA assumptions violations and compares performance with the FDA in prediction and classification. Based on past studies, the accurate classification method can vary depending on the data structure. This study highlights the importance of considering data characteristics and model assumptions when selecting classification methods for biological data analysis. El-Bayomi et al. [37] emphasize the value of choosing appropriate statistical techniques to ensure reliable and accurate outcomes.

The data initially displayed a severe imbalance, addressed through up-sampling, as recommended by Hassan et al. [38]. We opted against Synthetic Minority Over-sampling Technique (SMOTE) due to its potential drawbacks for sparse minority classes [39] and using SMOTE can disrupt the class separation that discriminant analysis relies on [40].

The LDA model exhibits violation of all assumptions and data contains outliers, even though the multivariate normality assumption is rarely met but still all other assumptions are violated. Therefore, we used another adaptable model that doesn’t necessitate linearity, and the results showed that FDA was superior to LDA in prediction and discrimination, which is supported by Moisen and Frescino [41], who noticed that the MARS model performed more accurately for predicting forest aspects than linear models.

If the data set is large and assumptions are violated, the FDA outperforms the LDA. Multidimensional data with outliers and multicollinearity can make it challenging for the LDA model performance to be applied. Application of machine learning algorithms such as artificial neural networks, random forests, and other machine learning models can improve results as stated by Gouda et al. [42], who reported that machine learning can handle both multidimensional and small datasets.

Selecting appropriate evaluation metrics is crucial and challenging, especially for imbalanced data. Traditional metrics like accuracy can be misleading in such scenarios Sokolova and Lapalme [30]. A model might achieve high accuracy by simply predicting the majority class most of the time, but this wouldn’t reflect its ability to correctly classify the minority class. Additionally, the Kappa value is inappropriate evaluation metric due to its potential bias and inaccuracy, as highlighted by previous study [43]. To address this challenge, we employed precision, recall, and F1 score as primary evaluation metrics. These metrics are well-suited for imbalanced datasets providing a more comprehensive picture of model performance [44].

Previous studies assessed LDA and FDA’s effectiveness using various criteria. As stated by Solberg [18], the FDA is a less computationally intensive algorithm for non-Gaussian feature classification without advanced parameter specifications. Öztürk and Özdamar [9] stated that FDA performed better than LDA. Furthermore, they concluded that FDA methods can be applied to nonnormal and heterogeneous data sets, while LDA should be applied to normal and homogeneous data sets. LDA is effective for classification but not for problems with non-normal distributions, according to Fukunaga [45]. Naghibi et al. [46] found that LDA is sensitive to outliers and cannot ensure a linear combination between the explained variable and other explanatory variables. So, the FDA exhibited the best performance among discriminant analysis models. However, more guidelines are required to choose between these methods properly.

According to LDA, the most important predictors in the discrimination process are DIM, 305-MY. This result is in agreement with the outcomes reported by Vijayakumar et al. [47] and Akkuş et al. [12] who stated that the amount of milk produced by Holstein cows was affected by lactation length. Some studies reported a negative correlation between DIM and milk yield [48, 49], others Mellado et al. [50] stated that there is low correlation between the total milk production and the 305-MY.While parities are a major factor in the FDA’s discriminating process, this result is consistent with Munim et al. [51], M’hamdi et al. [52], and Moawed and Abd El-Aziz [16] who observed a significant \(\:(P<0.05)\) influence of parity on milk output. Nevertheless, the findings contrasted with those of Habib et al. [53], who indicated that lactation number did not affect milk output \(\:(P>0.05)\).

Conclusion

Linear discriminant analysis and flexible discriminant analysis are both suitable for classifying dairy cow data and yielding useful information, but their assumptions and methodology differ. Linear discriminant analysis requires more assumptions about underlying data, while the FDA doesn’t. Flexible discriminant analysis is more robust in cases of assumption violations. Overall, the FDA is a more powerful and adaptable classification method than LDA. However, the computing cost is higher, and the findings may be more challenging to understand.

Data availability

No datasets were generated or analysed during the current study.

Abbreviations

LDA:

Linear discriminant analysis

FDA:

Flexible discriminant analysis

AFC:

Age at first calving

LO:

Lactation order

DO:

Days open

DIM:

Days in milk

DP:

Dry period

CFS:

Calving season

305-MY:

305-day milk yield

CI:

Calving interval

TBRD:

Total breedings per conception

MARS:

Multivariate additive splines

VIF:

Variance inflation factor

ACC:

Overall classification accuracy

SEN:

Sensitivity

PREC:

Precision

SMOTE:

Synthetic Minority Over-sampling Technique

ROS:

Random over sampling

References

  1. Nayak BK, Hazra A. How to choose the right statistical test? Indian J Ophthalmol. 2011;59(2):85.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Yan H, Dai Y. The comparison of five discriminant methods. 2011 International Conference on Management and Service Science; Wuhan, China: IEEE; 2011. pp. 1–4.

  3. Iyer E, Murti V, Arora V. A Comparison of Artificial Neural Network and Multiple Discriminant Analysis Models for Bankruptcy Prediction in India. Int J Allied Pract Res Rev. 2016;3(1):12–21.

    Google Scholar 

  4. Mele M, Macciotta N, Cecchinato A, Conte G, Schiavon S, Bittante G. Multivariate factor analysis of detailed milk fatty acid profile: effects of dairy system, feeding, herd, parity, and stage of lactation. J Dairy Sci. 2016;99(12):9820–33.

    Article  CAS  PubMed  Google Scholar 

  5. Moawed SA, Osman MM. The robustness of Binary Logistic Regression and Linear Discriminant Analysis for the classification and differentiation between dairy cows and Buffaloes. Int J Stat Appl. 2017;7(6):304–10.

    Google Scholar 

  6. Radwan H, El Qaliouby H, Elfadl EA. Classification and prediction of milk yield level for Holstein Friesian cattle using parametric and non-parametric statistical classification models. J Adv Veterinary Anim Res. 2020;7(3):429–35.

    Article  Google Scholar 

  7. Hastie T, Tibshirani R, Friedman JH. The elements of statistical learning: Data Mining, Inference, and Prediction. 2nd ed. Springer New York, NY; 2009.

  8. Reynès C, Sabatier R, Molinari N. Choice of B-splines with free parameters in the flexible Discriminant Analysis Context. Comput Stat Data Anal. 2006;51(3):1765–78.

    Article  Google Scholar 

  9. Öztürk A, Özdamar K. Comparison of Linear, quadratic and flexible discriminant analysis by using generated and Real Data. J Clin Pract Res. 2008;30(4):266–77.

    Google Scholar 

  10. NRC. Nutrient Requirements of Dairy Cattle. Seventh Revised Edition. Washington, DC; 2001.

  11. Akkus O, Ozkoc H. A comparison of the models over the data on the interest level in politics in Turkey and Countries that are members of the European Union: Multinomial or ordered Logit Model? Res J Appl Sci Eng Technol. 2012;4(19):2646–57.

    Google Scholar 

  12. Akkuş ö. Sevinc v, Takma ç, Güneri öi. Estimation of Parametric single index ordered Logit Model on milk yields. Kafkas Üniversitesi Veteriner Fakültesi Dergisi. 2019;25(5):597–602.

    Google Scholar 

  13. Akkus Q, Sevinc V. Use of ordered Logit Model with Time Series Data for determining the factors affecting the milk yield of Holstein friesians. Indian J Anim Res. 2020;54(4):488–93.

    Google Scholar 

  14. El-Kasrawy NI, Swelum AA, Abdel-Latif MA, Alsenosy AA, Beder NA, Alkahtani S, et al. Efficacy of different drenching regimens of gluconeogenic precursors during Transition Period on Body Condition score, production, Reproductive Performance, subclinical ketosis and economics of dairy cows. Animals. 2020;10(6):937.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Fathy A, Elsayed DH, Ibrahim IM, El-Azzazi FE. Multivariate Analysis on Reproductive and productive traits of Egyptian buffaloes. J Adv Veterinary Res. 2023;13(3):394–9.

    Google Scholar 

  16. Moawed SA, Abd El-Aziz AH. The estimation and interpretation of ordered Logit models for assessing the factors connected with the Productivity of holstein–friesian dairy cows in Egypt. Trop Anim Health Prod. 2022;54(6):345.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Cramer D. Advanced quantitative data analysis. 1st ed. McGraw-Hill Education (UK); 2003.

  18. Solberg AS. Texture Fusion and Classification Based on Flexible Discriminant Analysis. In Proceedings of 13th International Conference on Pattern Recognition. 1996;2:596–600.

  19. Poulsen J, French A. Discriminant function analysis. Biology 710—Advanced Biometry. San Francisco, CA, USA: San Francisco State University; 2008.

    Google Scholar 

  20. Worth AP, Cronin MTD. The use of discriminant analysis, logistic regression and classification tree analysis in the development of classification models for human health effects. J Mol Struct (Thoechem). 2003;622(1):97–111.

    Article  CAS  Google Scholar 

  21. Ripley BD. Neural networks and related methods for classification. J Roy Stat Soc B. 1994;56(3):409–37.

    Article  Google Scholar 

  22. Hastie T, Tibshirani R, Buja A. Flexible discriminant analysis by Optimal Scoring. J Am Stat Assoc. 1994;89(428):1255–70.

    Article  Google Scholar 

  23. Friedman JH. Multivariate adaptive regression splines. Annals Stat. 1991;19(1):1–67.

    Google Scholar 

  24. Hertz J, Krogh A, Palmer RG, Horner H. Introduction to the theory of neural computation. Phys Today. 1991;44(12):70.

    Article  Google Scholar 

  25. Steinberg D, Colla P, Kerry M. MARS user guide. San Diego, CA, USA: Salford Systems; 1999.

    Google Scholar 

  26. Hallgren W, Santana F, Low-Choy S, Zhao Y, Mackey B. Species distribution models can be highly sensitive to Algorithm Configuration. Ecol Model. 2019;408:108719.

    Article  Google Scholar 

  27. Utts JM. Seeing through statistics. 4th ed. Stamford, CT: Cengage; 2014.

    Google Scholar 

  28. Hahs-Vaughn DL. Applied Multivariate Statistical concepts. 1st ed. Routledge: New York, NY, USA;; 2016.

  29. Suleiman S, Badamsi S. Effect of Multicollinearity in Predicting Diabetes Mellitus using statistical neural network. Eur J Adv Eng Technol. 2019;6(6):30–8.

    CAS  Google Scholar 

  30. Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manage. 2009;45(4):427–37.

    Article  Google Scholar 

  31. Weng SF, Reps J, Kai J, Garibaldi JM, Qureshi N. Can Machine-Learning Improve Cardiovascular Risk Prediction using Routine Clinical Data? PLoS ONE. 2017;12(4):e0174944.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in Cancer Prognosis and Prediction. Comput Struct Biotechnol J. 2015;13:8–17.

    Article  CAS  PubMed  Google Scholar 

  33. Higaki S, Miura R, Suda T, Andersson LM, Okada H, Zhang Y, et al. Estrous detection by continuous measurements of vaginal temperature and conductivity with supervised machine learning in cattle. Theriogenology. 2019;123:90–9.

    Article  CAS  PubMed  Google Scholar 

  34. Asri H, Mousannif H, Al Moatassime H, Noel T. Using machine learning algorithms for breast cancer risk prediction and diagnosis. Procedia Comput Sci. 2016;83:1064–9.

    Article  Google Scholar 

  35. R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2021.

    Google Scholar 

  36. Hart PE, Stork DG, Duda RO. Pattern classification. 2nd ed. Wiley Hoboken; 2000.

  37. El-Bayomi K, Mohamed F, Eltarabany M, Gouda H. Application of different biostatistical methods in Biological Data Analysis. Zagazig Veterinary J. 2019;47(2):203–12.

    Article  Google Scholar 

  38. Hassan FA, Moawed SA, El-Araby IE, Gouda HF. Machine learning based prediction for solving Veterinary data problems: a review. J Adv Veterinary Res. 2022;12(6):798–802.

    Google Scholar 

  39. Branco P, Torgo L, Ribeiro RP. A Survey of Predictive modeling on Imbalanced domains. 2016;49(2%J ACM Comput. Surv.):Article 31.

  40. Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 2013;14:106.

    Article  PubMed  PubMed Central  Google Scholar 

  41. Moisen G, Frescino T. Comparing five modelling techniques for Predicting Forest characteristics. Ecol Model. 2002;157(2–3):209–25.

    Article  Google Scholar 

  42. Gouda HF, Hassan FA, El-Araby EE, Moawed SA. Comparison of machine learning models for Bluetongue Risk Prediction: a seroprevalence study on small ruminants. BMC Vet Res. 2022;18(1):1–10.

    Article  Google Scholar 

  43. Li M, Gao Q, Yu T. Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters. BMC Cancer. 2023;23(1):799.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Riyanto S, Sitanggang IS, Djatna T, Atikah TDJIJoACS. Applications. Comparative Analysis using Various Performance Metrics in Imbalanced Data for Multi-class Text Classification. 2023.

  45. Fukunaga K. Introduction to statistical pattern recognition. 2nd ed. Academic; 2013.

  46. Naghibi SA, Pourghasemi HR, Abbaspour K. A comparison between ten Advanced and Soft Computing models for Groundwater Qanat Potential Assessment in Iran using R and GIS. Theoretical Appl Climatology. 2018;131:967–84.

    Article  Google Scholar 

  47. Vijayakumar M, Park JH, Ki KS, Lim DH, Kim SB, Park SM, et al. The Effect of Lactation Number, Stage, length, and milking frequency on milk yield in Korean holstein dairy cows using Automatic Milking System. Asian-Australasian J Anim Sci. 2017;30(8):1093.

    Article  Google Scholar 

  48. Collier R, Dahl G, VanBaale M. Major Advances Associated with Environmental effects on dairy cattle. J Dairy Sci. 2006;89(4):1244–53.

    Article  CAS  PubMed  Google Scholar 

  49. Sabek A, Li C, Du C, Nan L, Ni J, Elgazzar E, et al. Effects of parity and days in milk on milk composition in correlation with β-hydroxybutyrate in Tropic dairy cows. Trop Anim Health Prod. 2021;53:1–8.

    Article  Google Scholar 

  50. Mellado M, Antonio-Chirino E, Meza-Herrera C, Veliz F, Arevalo J, Mellado J, et al. Effect of lactation number, year, and season of initiation of lactation on milk yield of cows hormonally induced into lactation and treated with recombinant bovine somatotropin. J Dairy Sci. 2011;94(9):4524–30.

    Article  CAS  PubMed  Google Scholar 

  51. Munim T, Hussain S, Hoque M, Khandoker M. Genetic and non-genetic effects on Productive and Reproductive traits of different genetic groups of cows. Bangladesh J Anim Sci. 2006;35(1):1–12.

    Google Scholar 

  52. M’hamdi N, Bouallegue M, Frouja S, Ressaissi Y, Brar SK, Hamouda MB. Effects of Environmental factors on milk yield, lactation length and dry period in Tunisian holstein cows. Milk Production-An Up-to-date overview of Animal Nutrition. Management and Health: IntechOpen; 2012.

    Google Scholar 

  53. Habib M, Bhuiyan A, Bhuiyan M, Khan A. Performance of Red Chittagong Cattle in Bangladesh Agricultural University Dairy Farm. Bangladesh J Anim Sci. 2003;32(1–2):101–8.

    Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

No funding was obtained for this study.

Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).

Author information

Authors and Affiliations

Authors

Contributions

SAM: conceptualization, data curation, supervision, methodology, writing‑review & editing. EM: conceptualization, methodology, data cleaning and analyses and writing‑original draft, writing‑review &editing. AE: conceptualization, data curation, supervision, methodology, writing‑review & editing. HFG: conceptualization, data curation, supervision, methodology, writing‑review & editing. AF: conceptualization, data curation, supervision, methodology, writing‑review & editing. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Esraa Mahrous.

Ethics declarations

Ethical approval and consent to participate

Not applicable.

Consent for publishing

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moawed, S.A., Mahrous, E., Elaswad, A. et al. Milk yield prediction in Friesian cows using linear and flexible discriminant analysis under assumptions violations. BMC Vet Res 20, 392 (2024). https://doi.org/10.1186/s12917-024-04234-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12917-024-04234-1

Keywords