Reducing computational cost of large-scale genomic evaluation by using indirect genomic prediction

Graphical Abstract Summary: In the US dairy population, the number of genotyped animals has increased rapidly over the decade. The computation of large-scale genomic evaluations can be highly expensive, especially when conducting more frequent evaluations. One solution to reduce time and cost could be the use of indirect genomic prediction. We investigated how indirect prediction should be conducted and discovered a practical approach to calculate accurate and unbiased indirect genomic predictions using SNP marker effects from a small number of randomly selected genotyped animals. The results of this study can be applicable and useful in other breeds and species.


Graphical Abstract Summary
In the US dairy population, the number of genotyped animals has increased rapidly over the decade. The computation of large-scale genomic evaluations can be highly expensive, especially when conducting more frequent evaluations. One solution to reduce time and cost could be the use of indirect genomic prediction.
We investigated how indirect prediction should be conducted and discovered a practical approach to calculate accurate and unbiased indirect genomic predictions using SNP marker effects from a small number of randomly selected genotyped animals. The results of this study can be applicable and useful in other breeds and species.
H eavy computation is inevitable in large-scale genomic evaluations. Since the national genomic evaluation for US Holsteins started in 2009, the number of genotyped animals has increased considerably. When conducting genomic prediction for a large number of genotyped animals, solving large mixed model equations (MME) is the most time-consuming process. Constructing the inverse of the combined relationship matrix (H −1 )-which is composed of the inverse of the pedigree-based relationship matrix (A −1 ) for all animals, the inverse of the genomic relationship matrix (G −1 ), and the inverse of the pedigree-based relationship matrix A 22 1 − ( ) for all genotyped animals (Aguilar et al., 2010;Christensen and Lund, 2010)-is another key process in single-step genomic BLUP (ssGBLUP). The A −1 can be recursively obtained with Henderson's method (Henderson, 1976;Hudson et al., 1982), and the matrix by vector multiplication by iteration on data is highly efficient without constructing A −1 in solving the MME for a large number of animals. In addition, A 22 1 − can be calculated efficiently (Strandén and Mäntysaari, 2014;Masuda et al., 2017;Strandén et al., 2017). However, calculation of G −1 is expensive using a direct inversion method, even with sparse matrix operations (Pérez-Enciso et al., 1994;Masuda et al., 2015). With reference to Henderson's method, Misztal et al. (2014a) andFragomeni et al. (2015) proposed a recursive method called the algorithm of proven and young animals (APY). This method calculates the approximated G −1 for a large number of noncore genotyped animals based on the direct G −1 for a minimum number of core genotyped animals, assuming that these core animals represent most of the independent chromosome segments in the genome. As of 2020, the number of genotyped US Holsteins has reached over 3 million, with half a million genotyped animals being added every year (CDCB, 2020). Even using the APY method, the computational cost is still high when including all genotyped animals. This could create a bottleneck in the future, especially when conducting frequent evaluations (e.g., every month or every week). One way to overcome issues with computing cost is to remove old genotyped animals that have been already culled and had neither progeny, nor phenotypes, nor semen stock (Koivula et al., 2018). However, when millions of animals are genotyped in several years, removing old genotyped animals will not be a substantial solution to the heavy computation required in genomic evaluation. Another option is to remove the young genotyped animals with neither progeny nor phenotypes from the ssGBLUP and calculate indirect genomic predictions (IGP) for those animals. If SNP effects based on genomic (G)EBV for other animals have sufficient genomic information, IGP for those young genotyped animals can be calculated Reducing computational cost of large-scale genomic evaluation by using indirect genomic prediction S. Tsuruta, 1 * D. A. L. Lourenco, 1 Y. Masuda, 1 T. J. Lawlor, 2 and I. Misztal 1 by a linear function of SNP effects and contents. Therefore, it may be more practical to predict their genomic performance separately and indirectly, rather than to include them in ssGBLUP evaluations to obtain GEBV.
About 90% of the genotyped animals included in the genomic evaluation for type traits in US Holsteins are young females (heifers), which may have neither phenotypes for type traits nor progeny in the future, although they may have other phenotypes (e.g., production traits; Tsuruta et al., 2021). Including all of these animals in the main routine evaluation may not be reasonable because of the computing cost. Garcia et al. (2020) reported that the accuracy of IGP was as high as that of GEBV in American Angus data, where about 70% of the genotyped animals had phenotypes. In that study, they did not investigate which animals should be included in the computation of GEBV and IGP or how GEBV for those genotyped animals affect accuracy and bias in IGPs when those genotyped animals have neither phenotype nor progeny. When genotyped animals have phenotypes, those phenotypes should be used to calculate GEBV; therefore, those genotyped animals should not be the target for IGP. In addition, the total computing cost to obtain GEBV and IGP is unknown. The IGP, which is the genomic prediction for genotyped animals indirectly calculated from SNP marker effects based on GEBV for other genotyped animals, could be a practical choice to reduce the computational cost in genomic evaluations. This is true if the IGP has no impairment in accuracy and bias. The objective of this study was to investigate a practical approach to calculating IGP to reduce the computational cost without deteriorating accuracy and bias in genomic predictions for a large number (i.e., over 2 million) of genotyped animals in US Holsteins using 18 linear type traits.
Phenotypes for 18 linear type traits and pedigree information used in the 2018 genetic evaluation were provided by the Holstein Association USA Inc. (Brattleboro, VT). Genotypes up to 2018 were provided by the Council on Dairy Cattle Breeding (Bowie, MD). The full data set consisted of 10,946,264 repeated records for 7,216,767 cows, including 7,044,210 cows with no genotypes up to 2018 calving, 13,591,145 animals in the pedigree, and 2,334,951 animals genotyped for 79,294 SNP. Different start dates (ranging from 2014 until 2018) based on year of birth were used to create 5 different sets of genotyped animals. Genotyped animals with phenotypic records or progeny were included in each genomic data set to obtain GEBV with the ssGBLUP. Genotyped animals with neither progeny nor phenotypes and born after the start date would obtain IGP using the SNP effects based on the GEBV for other animals with phenotypes or progeny. First, genomic prediction was conducted via ssGBLUP using the full data set (i.e., benchmark). When calculating GEBV, 20K genotyped animals were randomly chosen as core animals for APY. The MME for the 18-trait animal model (Tsuruta et al., 2002) was solved via the BLUP90IOD program, which uses the preconditioned conjugate gradient method by iteration on data (Tsuruta et al., 2001). The convergence criterion of 10 −12 based on relative adjusted right-hand sides (Tsuruta et al., 2001) was used. The program was originally created to calculate BLUP and revised with the ssGBLUP feature later. Second, ssGB-LUP was conducted to predict GEBV using each genomic data set from 2014-2018 to 2018, in addition to all phenotypes and pedigree information as described before. Third, SNP effects were calculated via POSTGSF90 (Aguilar et al., 2014;Misztal et al.,  When calculating SNP effects by GEBV using all genotyped animals with phenotypes or progeny in each year group, the same 20K core animals were used for APY. In contrast, when calculating SNP effects by GEBV using randomly selected genotyped animals from those animals, ranging from 15K to 60K, the APY was not needed (i.e., no core animals), so these animals were not the same animals as in the core previously used for APY to obtain GEBV. Last, IGP for other genotyped animals, which had neither progeny nor phenotypes (i.e., young bulls or heifers) in 2014-2018, 2015-2018, 2016-2018, 2017-2018, and 2018 were calculated from the SNP marker effects as IGP = Zû (Strandén and Garrick, 2009;Garcia et al., 2020) via PREDF90 (Misztal et al., 2014b). This calculation of IGP requires accurate SNP estimates from the GEBV. To calculate the SNP marker effects, all genotyped animals or a small number of randomly selected genotyped animals can be used . In a nutshell, the steps of this process for the 2014-2018 genomic data set, for example, involved the computation of GEBV using 38% of all genotyped animals (886,176) and all nongenotyped animals with phenotypes and progeny available in ssGBLUP (Table 1). Next, SNP marker effects were calculated based on GEBV from those genotyped animals or random subset thereof. Finally, IGP for the remaining 62% of the genotyped animals (1,448,775) that had neither progeny nor phenotypes were computed by those SNP effects (Table 1). Likewise, in the 2018 genomic data, 88% of the genotyped animals directly received GEBV, all or a randomly selected subset of them were used to estimate SNP effects, and 12% of the genotyped animals were used to estimate IGP. Table 1 shows computing time (wall-clock time) in hours using 4 CPU cores and the number of iterations for genomic data sets 2014-2018, 2015-2018, 2016-2018, 2017-2018, and 2018. Strictly speaking, computing time can vary depending on the computational environment, such as software, hardware, and sharing conditions. With the full data set, the computing time for GEBV was 177 h with 1,433 iterations. The computing time was mostly consumed by calculating G −1 with APY (26 h) and solving the MME (150 h). As the number of genotyped animals for GEBV increased, the computing time for IGP in Table 1 increased substantially due to the heavy calculation of SNP effects in ˆ, u DZ'G a = − λ 1 when computing G −1 , even with APY for a large number of genotyped animals in a. On the other hand, the matrixvector multiplication in IGP = Zu took a few minutes for all data sets. The total computing time for calculations of GEBV and IGP ranged from 99 h for 2014-2018 to 223 h for 2018. In this study, the POSTGSf90 program was used to calculate SNP marker effects, and about 80% of the computing time was spent to create G −1 . However, if the SNP prediction is implemented inside the BLUP90IOD2 program, this additional computing time can be saved by avoiding creating the same G −1 twice. Another choice is To further reduce the computing time for the calculation of IGP, SNP marker effects were predicted from only a portion of the genotyped animals in a; that is, randomly selected from 38% of all genotyped animals (i.e., 886,176 from 2014 to 2018): 15K, 20K, 25K, 30K, 35K, 40K, 45K, 50K, 55K, and 60K. The computation of G −1 for those small number of animals in a was fast, even with the direct inversion, and the corresponding computing time of SNP effects took less than 1 h using any number of selected genotyped animals from 15K to 60K. As a result, the total computing time was reduced from 99 h to 73 h for the 2014-2018 data and from 223 h to 151 h for the 2018 data (Table 1). Table 2 shows b 0 and b 1 on the regression model fitting GEBV = b 0 + b 1 × IGP for genotyped animals that had neither progeny nor phenotypes in each year group, and mean absolute differences (MEAN) and maximum absolute differences (MAX) between GEBV and IGP for the genomic data from 2014 to 2018 using SNP effects in IGP from randomly selected 30K genotyped animals. These values from the 2015-2018, 2016-2018, 2017-2018, and 2018 year groups are not shown in the table because they were similar to those from 2014-2018 (e.g., b 0 ranging from 1.2 to 1.4 for males and from 1.2 to 1.3 for females). The MEAN values were similar to b 0 values except for traits with small or negative genetic gains. The sum of the b 0 value and b 1 × IGP must be reported together to avoid bias with the interpretation of IGP. The correlations of MEAN, MAX, and b 0 with standardized genetic progress (ΔG) from Tsuruta et al. (2021) were high (0.96 to 0.98) for all genomic data sets. To compare b 0 and ΔG on the same scale, MEAN, MAX, and b 0 were also standardized by dividing by each genetic standard deviation. The high positive correlations indicate underprediction of IGP when ΔG is greater or selection is more intense. This bias in IGP is attributable to the different genetic base for each trait, and it can be adjusted by the mean difference between GEBV and IGP (Lourenco et al., 2018). The whole calculation of GEBV is required to obtain the exact genetic difference; however, because the target genotyped animals for IGP from the 2014-2018 genomic data spanned 5 years, the adjustment based on b 0 or the genetic gains will be simple, rational, and practical. Without adjusting the genetic base correctly, these genotyped animals with IGP cannot be ranked together with other animals with GEBV. In this case, these animals with IGP should be ranked separately. Table 2 also shows b 1 values on the regression model, which is the scaling factor or the slope on IGP, using the genomic data from 2014 to 2018. The b 1 values from the 2015 to 2018 genomic data groups were similar to those from 2014 to 2018 on average (not shown in Table 2), ranging from 0.99 to 1.01 for males and 0.99 to 1.00 for females. The scaling factor indicates inflation (deflation) of IGP when b 1 <1.0 (>1.0). Overall, the average scaling factor showed no inflation or deflation, ranging from 0.99 to 1.01 on average for all data sets. However, b 1 ranged from 0.94 to 1.06 for males and from 0.95 to 1.05 for females for the individual 18 traits. The correlation between b 1 and ΔG was high (0.84), implying that IGP is more deflated or GEBV is more inflated when the ΔG is larger (i.e., more directional selection), and less deflated IGP or less inflated GEBV when the trait has an intermediate optimum or assortative mating is being practiced (Tsuruta et al., 2021). Table 2 also shows the coefficient of determination (R 2 ) of the regression model and the correlation between R 2 and ΔG for the genomic data from 2014 to 2018. The R 2 were high (0.950 and 0.949 for males and females on average, respectively, equivalent to correlations from 0.975 and 0.974) and ranged from 0.908 to 0.975 for males and from 0.905 to 0.974 for females for 18 traits. The R 2 did not change over the years by increasing the number of genotyped animals for calculation of GEBV, ranging from 0.95 to 0.97 for males and from 0.95 to 0.96 for females (R 2 from 2015 to 2018 are not shown in Table 2). This indicates that IGP were accurate regardless of the number of genotyped animals used to compute GEBV (for genotyped animals with phenotypes or progeny) and IGP (for genotyped animals with neither phenotypes nor progeny). Negative correlations (−0.44 and −0.55 for males and females, respectively) between R 2 and ΔG indicate that traits with more directional selection tend to have lower accuracy in IGP.
As described before, 10 randomly selected genotyped animal groups were used to reduce the computing time for calculation of IGP in the 2014-2018 data (i.e., 15K, 20K, 25K, 30K, 35K, 40K, 45K, 50K, 55K, and 60K genotyped animals from 886,176). Figure 1 shows the average b 1 for all 18 traits changing over these 10 animal groups. The b 1 indicates slight inflation of IGP when all 886,176 genotyped animals in 2014-2018 were used. However, when randomly selected genotyped animals were used, IGP Computing times in parentheses are when randomly selected genotyped animals were used to calculate IGP. In computing times for IGP and Total, "− X" indicates hours without creating the inverse of genomic relationship matrix.
showed slight deflation to inflation when the number of animals increased from 15K to 60K, indicating that the genetic variance was larger in IGP than in GEBV as more genotyped animals were selected. When 25K genotyped animals were used, b 1 was close to 1.0. This result suggests that a number of genotyped animals between 25K and 35K could be the appropriate range for IGP when 359 Tsuruta et al. | Indirect genomic prediction for US Holsteins Table 2. Intercept (b 0 ), regression coefficient b 1 , and R 2 in genomic (G)EBV = b 0 + b 1 × IGP, 1 mean absolute differences (MEAN) and maximum absolute differences (MAX) between GEBV and IGP, and correlations of ΔG 2 with each parameter, using SNP estimates for randomly selected 30K genotyped animals from 2014 to 2018 genomic data for 18 type traits 3 The values of b 0 , ΔG, MEAN, and MAX were standardized by dividing by the genetic standard deviation for each trait. we select genotyped animals randomly to calculate SNP marker effects. Figure 1 also shows the average R 2 corresponding to b 1 described above. When 35K or 40K genotyped animals were used to calculate IGP, the R 2 reached the same level as the R 2 from all 886,176 genotyped animals (i.e., 0.960 for males and 0.954 for females). These numbers from 25K to 35K could be considered appropriate as the number of core animals in APY. In our study, the number of eigenvalues explaining 98% or 99% in the variation in G was around 20K, which means that 20K genotyped animals could provide sufficient information to accurately estimate the effects of all independent chromosome segments in this population. Therefore, 20K animals can be used as the core in APY under ssGBLUP evaluations. However, some of the selected core animals may provide redundant information (i.e., some of the animals may be highly correlated), possibly increasing the required number of core animals that represent all independent chromosome segments to a range from 25K to 35K. An additional comparison using the same 20K core animals from APY to obtain GEBV and calculate SNP marker effects was conducted. However, no difference was found using these 20K core animals and the randomly selected 20K animals; therefore, the results are not presented.
Considering the computing time and both biasedness and accuracy in IGP, genomic evaluations for a large number of genotyped animals can be conducted separately in GEBV and IGP. A small number of randomly selected genotyped animals can be used to accurately estimate SNP marker effects and can dramatically reduce the computational cost for IGP. The results of this study describe a practical solution when conducting a large-scale genomic evaluation and can make more frequent evaluations less costly.