*By: Christos Palaiokostas, Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden *
The ability to predict disease resistance using genomic information in aquaculture species has attracted considerable research efforts. In the current study, various Machine Learning (ML) models were evaluated in terms of their efficiency in detecting disease-resistant animals through their genomic profile.
Advancements in sequencing technologies over the last decade have transformed the field of aquaculture breeding and genomics. It is not uncommon nowadays for selection decisions in aquaculture breeding programs to be guided by genomic information derived either through the usage of single nucleotide polymorphisms (SNPs) arrays or genotyping by sequencing (GBS) platforms.
Furthermore, a plethora of research studies in the last five years has demonstrated the value of genomic selection (GS) practices in a wide range of aquaculture species, including, amongst others, salmonids, tilapias, carps, bass, and oysters.
Current knowledge suggests that genomic information is particularly valuable in studying traits related to disease resistance. Disease outbreaks in farmed fish tend to be devastating both in economic and welfare aspects. Since there is a lack of efficient therapeutic agents for various commonly encountered diseases in aquaculture, selective breeding practices can offer solutions.
“It is not uncommon nowadays for selection decisions in aquaculture breeding programs to be guided by genomic information derived either through the usage of single nucleotide polymorphisms (SNPs) arrays or genotyping by sequencing (GBS) platforms.”
GS practices are usually considered the preferred route of action as resistance to diseases usually resembles a polygenic trait. Therefore, the most common applications of GS typically involve the usage of algorithms based on genomic best linear unbiased predictor (GBLUP) or its variants like single-step approaches and Bayesian linear regressions.
With a few exceptions, the vast majority of published studies to date have assessed the prediction efficiency of GS models for disease resistance based on data of a single generation.
The above is mainly due to two reasons:
- 1. Disease challenge experiments have high-cost requirements.
- 2. Aquaculture breeding programs up to date are relatively new compared to their livestock counterparts, and in many cases, genomic information beyond a single generation is not available.
- Therefore, most of the studies aiming to pick the best performing model for predicting disease resistance have used cross-validation strategies on animals from the same generation to train GS models and minimize the chances of overfitting. However, the aforementioned does not necessarily provide information regarding the model that best predicts future performance, which is the overall aim of selective breeding. In contrast, in equivalent situations in livestock, it is common to train the GS models on multi-generational datasets and validate the latest generation(s).
“A most common approach is where disease resistance is regarded as a binary trait. In such situations, the objective of the tested model is to efficiently classify the animals of each category (resistant vs. non-resistant) based on the available genomic information. However, limited attention has been placed in the scenario where the phenotypic distribution among resistant and non-resistant animals is skewed towards one category.”
Machine learning (ML) tools have been recently in the spotlight, finding applications in numerous real-life situations. ML algorithms have also been gaining momentum, finding applications in a wide range of prediction tasks in animal breeding. Even though no single model, whether based on ML or more affiliated with traditional animal breeding, seems to provide optimal predictions for all traits of interest and breeding schemes, ML appears to have a role in the animal breeder’s toolbox.
It should be noted that ML models compared to commonly used animal breeding models usually shine in scenarios where interactions influencing the phenotype of interest exist amongst the model predictors.
In the current study, the prediction efficiency of Decision Trees (DT), Support Vector Machines (SVM), Random Forests RF) and boosting based approaches like AdaBoost and Extreme Gradient Boosting (XGB) was compared against GBLUP-MCMC. Each model prediction efficiency was also evaluated in situations where the ratio of the two observed phenotypes (resistant vs. non-resistant) is unbalanced. Finally, the required computational time for training each ML model was benchmarked against GBLUP-MCMC.
Materials and methods
The QMSim software was used for simulating phenotypic and their corresponding genotypic datasets. The initial historic population consisted of 2,000 generations with a constant size of 10,000 animals. The used parameters for simulating the historic population included equal sex ratio, random mating, and discrete generations. After that, ten discrete non-overlapping recent generations were simulated using a breeding design often encountered in salmonids.
“In particular, 100 sires were considered to be uniquely mated with 200 dams in each generation, with 30 animals from each family being phenotyped. The heritability of the simulated trait was equal to 0.3 with 300 biallelic and randomly located quantitative trait loci (QTL) affecting the trait. Furthermore, individuals from generation nine and ten (12,000 animals) were genotyped for 9,000 SNPs randomly distributed across a genome consisting of 30 chromosomes each of 100 cM in length.”
The animals were assigned into two categories using different thresholds on their true breeding value to simulate a binary phenotypic trait. The thresholds were chosen to simulate a scenario where the phenotypic distribution amongst the two categories (resistant vs. non-resistant) was approximately balanced and another scenario where the percentage of resistant and non-resistant animals was between 20 and 25 % and between 75 and 80 %, respectively.
An intercept term (known as bias in ML terminology) and the SNP genotypes were used as predictors (known as features in ML terminology) in all the ML models. The response variable in all the tested scenarios was a vector containing the disease resistance status of each animal. In order to reduce overfitting, appropriate regularization hyperparameters for each model were applied. In the case of DT, the maximum tree depth was restricted to 8.
“Even though no single model, whether based on ML or more affiliated with traditional animal breeding, seems to provide optimal predictions for all traits of interest and breeding schemes, ML appears to have a role in the animal breeder’s toolbox.”
The magnitude of regularization in the case of SVM was controlled through the C parameter using a value of 1. For the ensembles, RF and XGB, a learning rate of 0.1 was used to minimize overfitting in addition to a maximum tree depth of 8. In the case of Adaboost, the maximum tree depth was fixed to 1. Moreover, the ensembles (RF; AdaBoost; XGB) were fitted using a maximum number of 2,000 base estimators.
The prediction efficiency of each tested model was assessed using receiver operating characteristic (ROC) curves. The models were ranked based on the area under the curve (AUC) metric, which by construction ranges between zero and one, with the latter representing the perfect classifier.
“Notably, the tested ensembles (DT, RF, Adaboost, XGB) provide estimates regarding the importance of each feature. With the exception of RF the rest of the ensembles performed and variable selection by assigning values of zero to certain features.”
Two different scenarios were tested in the current study regarding the phenotypic distribution of animals characterized as resistant or susceptible. More specifically, the model performance was tested in cases where the two recorded phenotypic categories had approximately an equal number of observations and in cases where the phenotypic distribution was skewed towards non-resistant animals. Overall, the model ranking was not affected by the ratio of resistant to non-resistant animals, with differences in AUC scores.
Carp resistance to the koi herpes virus
Model performance was inferred by following a 5-fold cross-validation scheme consisting of sets of 1,004 animals for training and 251 animals for validation purposes. The percentage of resistant animals amongst the training and validation sets ranged between 33–37%. Overall, the ranking of models was the same as in the case of the simulation datasets.
Hyperparameter tuning – computational time
The number of available hyperparameters for the ML models ranged between 5 – 18. Adaboost had the lowest number of hyperparameters, while XGB the highest. The magnitude of influencing the predictive ability of each ML by hyperparameter tuning varied substantially amongst the tested models. Hyperparameter tuning had a more profound effect in the case of Adaboost, where fixing the maximum allowed depth of the underlying DT classifiers to 1 resulted in 40–50 % increase of the AUC score. On the other hand, changing the hyperparameter values from the default ones in the case of SVM resulted in worse predictions.
All ML models required substantially less computational time compared to GBLUP-MCMC for fitting and prediction purposes.
The ability to predict disease resistance using genomic information in aquaculture species has attracted considerable research efforts. In the current study, various ML models were evaluated regarding their efficiency in detecting disease-resistant animals through their genomic profile. Overall, promising results were obtained with the derived predictions of the best performing ML models, being in close proximity or even higher than the equivalent ones from GBLUP-MCMC.
Traditionally the performance of various GS models for regression tasks in aquaculture species is mainly evaluated based on the so-called accuracy metric, which is, in fact, the Pearson correlation coefficient between the predicted values and the true breeding values (in case of a simulated dataset) or the phenotypic recordings (in case of empirical data) of the validation-test dataset (usually adjusted for fixed effects). Interestingly, it was recently pointed out that reliance solely on the correlation coefficient can result in a non-optimal model selection.
“The results of the current study, including both simulated and empirical datasets, demonstrated that ML models could be successfully applied in classification problems relevant to breeding.”
The usage of the above accuracy term is the most common approach also for binary traits, even though the definition of correlation, in this case, could be deemed somewhat problematic. However, the accuracy term is also commonly encountered in a broad literature of various classification problems where it denotes the number of cases predicted successfully out of the whole prediction attempts.
Nevertheless, it can be argued that none of the above definitions-usages of accuracy is optimal for binary traits. More specifically, the usage of accuracy for evaluating either GS or ML model performance in binary traits with a skewed ratio among the two observed phenotypic categories conveys limited practical value.
The results of the current study, including both simulated and empirical datasets, demonstrated that ML models could be successfully applied in classification problems relevant to breeding. According to the current results, the ranking of the tested models was not affected in the cases where an unbalanced distribution amongst the two observed phenotypes was used.
“Even though no application of XGB in aquaculture selective breeding seems to have been documented as of now in the literature, the results of the current study coupled with the fact that it is one of the most powerful ML algorithms suggest that it could be a valuable tool in future genetics studies of disease resistance in aquaculture.”
Interestingly, XGB was amongst the best performing models in terms of prediction efficiency for either sire conception rate in Holstein bulls or simulated datasets. Furthermore, in the latter case, XGB ranked first in scenarios where non-additive genetic effects primarily controlled the trait of interest.
Notably, as is the case for most ML algorithms, XGB is particularly prone to overfitting, especially in datasets where the number of features (SNPs in the current case) far surpasses the number of observations. As such, XGB requires the a priori setting of regularization hyperparameters, which in the current case was achieved primarily by using the hyperparameters of learning rate and the maximum number of estimators.
“From all the tested ML models, hyperparameter fine-tuning had the most substantial effect in the case of Adaboost, where setting a single hyperparameter resulted in a 40–50 % increase of the AUC score. On the other extreme, changing hyperparameter values from the default ones resulted in worse predictions in the case of SVM, indicating that fine-tuning hyperparameters in ML is a far from trivial task.”
Especially in the case of models with a high number of hyperparameters like XGB, an exhaustive search would be deemed particularly difficult and time-consuming. Interestingly, XGB, Adaboost, and RF are ensemble learning algorithms relying on aggregating the outcomes of base estimators (e.g., weak learners like DT) following different optimization routes like bagging or pasting. In all three cases, the most common base estimator is the DT, with the fundamental idea that through aggregating across the outcomes of several simple estimators, the prediction efficiency of the model can be improved compared to the equivalent of a single estimator.
Even though gaining a full picture of the exact internal optimization route for each of the ensemble models is most challenging, it was evident from the acquired results that substantial differences exist in terms of the magnitude of variable selection.
Constraining our focus on the task of predicting disease resistance in aquaculture and taking into consideration the wide variation of the underlying genetic mechanisms involved in various diseases, it is doubtful that a single model, whether from the GS or ML, will be optimal for all cases.
“However, it is fair to state that GBLUP-MCMC is a robust approach, as was also clearly shown in the current study. Nevertheless, a significant advantage of the tested ML models lies in substantial reductions of computational time compared to GBLUP-MCMC in terms of model fitting.”
In the current study, a relatively high number of iterations was used, as in the case of binary traits, the mixing of the MCMC is slow. Despite the above, it is still apparent that ML, mainly due to parallelization of the assigned tasks, clearly outperforms MCMC based algorithms in terms of computational efficiency. Notably, more substantial differences could be expected between the two classes using high-performance computing (HPC).
Elaborating on the latter in a former study of genetic resistance of sea bream to pasteurellosis where the percentage of resistant animals was approximately only 5%, a naïve classifier always predicting for a non-resistant animal would have achieved an accuracy of about 0.95. Model assessment was performed in the current study with ROC curves using the AUC metric.
“It is important to stress that the model evaluation was conducted based on disease resistance being simplified as a binary trait. Even though this approach is appealing from a practical perspective, it could be argued that genetic resistance against disease is a far more complicated process. As such, future studies, including information regarding the resilience and tolerance of the host against pathogens, can shed additional light and contribute to expediting genetic progress through selective breeding.”
Moreover, the performed simulations considered the genetic architecture of the trait as purely additive. Even though the latter has repeatedly proven to be a reliable approximation, it could well be the case that various interactive effects amongst the determining genetic components play an essential role in disease resistance. Interestingly, ML models usually shine in detecting non-linear patterns and interactions.
The present study results suggest that ML can be valuable tools in aquaculture breeding studies that aim to predict disease-resistant animals. XGB was the model that ranked first, conveying a slight advantage over GBLUP-MCMC that ranged between 1–4%. Furthermore, SVM and RF delivered competitive predictions as well. The application of solely DT is not recommended as low predictions were obtained consistently in all tested datasets. Finally, in terms of required computational time, all ML models outperformed GBLUP-MCMC.
*This is a summarized version developed by the Aquaculture Magazine editorial team of the original article “Predicting for disease resistance in aquaculture species using machine learning models” written by Christos Palaiokostas from the Department of Animal Breeding and Genetics at the Swedish University of Agricultural Sciences. The article was originally published through the Aquaculture Reports Journal of Elsevier in 2021 and it can be found online through this link: https://doi.org/10.1016/j.aqrep.2021.100660