class: center, middle, inverse, title-slide # Genomic Selection in Practice ### Jinliang Yang ### April 14, 2022 --- # RR-BLUP vs. QTL mapping In RR-BLUP, usually, the number of markers ( `\(p\)` ) is greatly larger than observation ( `\(n\)` ), or `\(p >> n\)` - a large p, small n problem ### Shrinkage The marker effects with RR-BLUP are __shrinkage toward zero__. - Modeling `\(V_{M_i}\)` as constant across all loci caused an overshrinkage of effects for markers close to major QTL (large effect QTL). - And an undershrinkage of effects for markers close to minor QTL. > Meuwissen et al., 2001. -- Furthermore, for a given QTL, several marker loci will capture its effect if using the RR-BLUP method. --- # GBLUP vs. RR-BLUP ### GBLUP Using genome-wide markers to quantify the amount of information extracted from relatives. ### RR-BLUP Quantifying the effects of genome-wide markers from a population of related individuals. -- ---------- Theory has shown that GBLUP and RR-BLUP are equivalent: - If the number of QTL is large - No major QTL are present - And the QTL are evenly distributed across the genome > Fernando, 1998; Habier et al., 2007; Goddard, 2009. --- # Framework of GS <div align="center"> <img src="gs.png" height=250> </div> #### Model training In the context of GS, the lines with both genotype and phenotype data constitute a __training population__. => `\(\mathbf{\hat{m}}\)` #### Prediction Predict the performance of the untested lines using their genotype data. `\(\mathbf{\hat{y}} = \mathbf{1}\mu + \mathbf{Z} \mathbf{\hat{m}}\)` --- # Framework of GS In the model training process, a __cross-validation__ method will be used within the training population. - __k-fold__ cross-validation - The training population is divided in `\(k\)` sets (i.e., 5-fold). - __delete-one__ (or __leave-one-out__) cross-validation - Use `\(n-1\)` to train the model. -- ### Prediction accuracy The prediction accuracy (denoted by `\(r_{MG}\)`) is the correlation between the true genotypic value and the genotypic value predicted from marker effects. --- # A real world example: Loblolly pine data Loblolly pine (Pinus taeda) data >Resende Jr. et al. (2012) ([DOI: 10.1534/genetics.111.137026](http://dx.doi.org/10.1534/genetics.111.137026)) - __Mating Design__: 70 full-sib families and 951 individuals in total using _a circular mating design_ - __Phenotyping__: 17 traits with distinct heritabilities and genetic architectures - __Genotyping__: with 4,853 SNPs using the SNP array method. - The full dataset can be downloaded from the paper. --- # A real world example: Loblolly pine data In this example, we will use the breeding values of crown width across the planting beds at age 6 (CWAC6). ```r # read phenotype and SNP files pheno_file <- "https://jyanglab.com/img/data/DATA_nassau_age6_CWAC.csv" geno_file <- "https://jyanglab.com/img/data/Snp_Data.csv" pheno <- read.csv(pheno_file, header=TRUE, stringsAsFactors = FALSE) hist(pheno$Derregressed_BV, main="Crown width at Age 6", xlab="width") ``` ![](w13lab_gs2_files/figure-html/unnamed-chunk-1-1.png)<!-- --> --- # Loblolly pine data ### Remove missing phenotypes There are some accessions containing no phenotype. We need to remove these accessions first. ```r na.index <- which(is.na(pheno$Derregressed_BV)) # length(na.index) pheno <- pheno[-na.index, ] # Keep genotypes for these remaining lines geno <- geno[geno$Genotype %in% pheno$Genotype, ] # phenotypes y <- pheno$Derregressed_BV y <- matrix(y, ncol=1) # markers geno <- geno[,-1] # 861 x 4853 geno[geno == -9] <- NA ``` --- # Genotype data: SNP quality control In the `geno` matrix, row indicates individual, column indicates SNPs. ### Missingness and MAF ```r geno <- read.csv(geno_file, header=TRUE, stringsAsFactors = FALSE) dim(geno) # Keep genotypes for these remaining lines geno <- geno[geno$Genotype %in% pheno$Genotype, ] # markers geno <- geno[,-1] # 861 x 4853 geno[geno == -9] <- NA # missing rate missing <- apply(geno, 2, function(x){sum(is.na(x))/length(x)}) # minor allele frequency maf <- apply(geno, 2, function(x){ frq <- mean(x, na.rm=TRUE)/2 # 1 allele return(ifelse(frq > 0.5, 1-frq, frq)) }) ``` --- # Genotype data: SNP quality control In the `geno` matrix, row indicates individual, column indicates SNPs. #### Plot the results ```r hist(missing, breaks=100, col="blue", xlab="SNP Missing rate") hist(maf, breaks=100, col="blue", xlab="Minor Allele Freq") ``` --- # SNP quality control Removing SNPs with high missing rate (missingness > 0.2) and low MAF (MAF < 0.05) - Question: How many markers are removed? ```r idx1 <- which(missing > 0.2) #154 idx2 <- which(maf < 0.05) #1647 idx <- unique(c(idx1, idx2)) #1784 geno2 <- geno[, -idx] dim(geno2) ``` --- # SNP quality control ### Missing marker imputation Replace missing marker genotypes with __mean values__. Then store the marker genotypes in a matrix object `Z`. ```r Z <- matrix(0, ncol=ncol(geno2), nrow=nrow(geno2)) for (j in 1:ncol(geno2)){ #cat("j = ", j, '\n') Z[,j] <- ifelse(is.na(geno2[,j]), mean(geno2[,j], na.rm=TRUE), geno2[,j]) } # sum(is.na(Z)) write.table(Z, "data/Z.txt", sep="\t", row.names = FALSE, col.names=FALSE, quote=FALSE) ``` --- # Genomic relationship ### SNP Matrix standardization Standardize the genotype matrix to have a mean of zero and variance of one. Save this matrix as `Zs`. ```r Zs <- scale(Z, center = TRUE, scale = TRUE) # dimensions n <- nrow(Zs) m <- ncol(Zs) ``` --- # Genomic relationship ### Calcualte genomic relationship - Compute the second genomic relationship matrix of [VanRaden (2008)](https://www.ncbi.nlm.nih.gov/pubmed/18946147) using the entire markers. - Then add a very small positive constant (e.g., 0.001) to the diagonal elements so that `G` matrix is invertible. ```r # Given matrices x and y as arguments, return a matrix cross-product. # This is formally equivalent to (but usually slightly faster than) # the call t(x) %*% y (crossprod) or x %*% t(y) (tcrossprod). G <- tcrossprod(Zs) / ncol(Zs) # G <- Zs %*% t(Zs) /ncol(Zs) G <- G + diag(n)*0.001 ``` --- # Solve MME for GBLUP Set up mixed model equations (MME) by fitting the model: `$$\mathbf{y = 1\mu + Zu + e}$$` - where `\(\mu\)` is the intercept, - `\(\mathbf{Z}\)` is the incident matrix of individuals, - `\(\mathbf{u}\)` is the breeding value of the individuals, - and `\(\mathbf{e}\)` is the residual. Directly take the inverse of LHS to obtain the solutions for GBLUP. Report the estimates of intercept and additive genetic values. Use `\(\lambda = 1.35\)`. ```r lambda <- 1.35 # fit$Ve / fit$Vm Ginv <- solve(G) ones <- matrix(1, ncol=1, nrow=n) Z <- diag(n) # Given matrices x and y as arguments, return a matrix cross-product. #This is formally equivalent to (but usually slightly faster than) #the call t(x) %*% y (crossprod) or x %*% t(y) (tcrossprod). LHS1 <- cbind(crossprod(ones), crossprod(ones, Z)) LHS2 <- cbind(crossprod(Z, ones), crossprod(Z) + Ginv*lambda) LHS <- rbind(LHS1, LHS2) RHS <- rbind( crossprod(ones, y), crossprod(Z,y) ) sol <- solve(LHS, RHS) head(sol) tail(sol) ``` --- # R package: `rrBLUP` Fit GBLUP by using the `mixed.solve` function in the [rrBLUP](https://cran.r-project.org/web/packages/rrBLUP/index.html) R package. - Report the estimates of intercept and additive genetic values. - Do they agree with previous estimates? - Also, report the estimated genomic heritability and the ratio of variance components `\(\lambda = \frac{V_e}{V_A}\)`. ```r #install.packages("rrBLUP") library(rrBLUP) fit <- mixed.solve(y = y, K=G) # additive genetic variance fit$Vu # residual variance fit$Ve # intercept fit$beta # additive genetic values head(fit$u) tail(fit$u) # genomic h2 fit$Vu / (fit$Vu + fit$Ve) # ratio of variance components fit$Ve / fit$Vu # plot(x=sol[-1], y=fit$u) ``` --- # RR-BLUP Set up mixed model equations (MME) by fitting the model `\(\mathbf{y = 1b + Zm + e}\)`, where `\(\mathbf{b}\)` is the intercept, `\(\mathbf{Z}\)` is the standardized marker genotypes (`Zs`), `\(\mathbf{m}\)` is the additive marker genetic effects, and `\(\mathbf{e}\)` is the residual. `\begin{align*} \begin{bmatrix} \mathbf{\hat{b}} \\ \mathbf{\hat{m}} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{X^{'}R^{-1}X} & \mathbf{X^{'}R^{-1}Z} \\ \mathbf{Z^{'}R^{-1}X} & \mathbf{Z^{'}R^{-1}Z} + \mathbf{I} V_e/V_{M_i} \\ \end{bmatrix}^{-1} \begin{bmatrix} \mathbf{X^{'}R^{-1}y} \\ \mathbf{Z^{'}R^{-1}y} \\ \end{bmatrix} \end{align*}` Directly take the inverse of LHS to obtain the solutions for marker-based GBLUP (RR-BLUP). Report the estimates of intercept and marker additive genetic effects. Use `\(\lambda = 4326.212\)`. -- ```r lambda <- 4326.212 # fit$Ve / fit$Vu ones <- matrix(1, ncol=1, nrow=n) I <- diag(m) LHS1 <- cbind(crossprod(ones), crossprod(ones, Zs)) LHS2 <- cbind(crossprod(Zs, ones), crossprod(Zs) + I*lambda) LHS <- rbind(LHS1, LHS2) RHS <- rbind( crossprod(ones, y), crossprod(Zs,y) ) sol2 <- solve(LHS, RHS) head(sol2) tail(sol2) ``` --- # Use `rrBLUP` package Fit RR-BLUP by using the `mixed.solve` function in the [rrBLUP](https://cran.r-project.org/web/packages/rrBLUP/index.html) R package. - Report the estimates of intercept and marker additive genetic effects. - o they agree with the estimates with the manual calculation? - Also, report the ratio of variance components `\(\lambda = \frac{V_e}{V_A}\)`. ```r library(rrBLUP) fit2 <- mixed.solve(y = y, Z=Zs) # marker additive genetic variance fit2$Vu # residual variance fit2$Ve # intercept fit2$beta # marker additive genetic effects head(fit2$u) tail(fit2$u) # ratio of variance components fit2$Ve / fit2$Vu # plot(x=sol2[-1], y=fit2$u) ``` --- # K-fold validation Repeat GBLUP but treat the first 600 individuals as a training set and predict the additive genetic values of the remaining individuals in the testing set. - What is the predictive correlation in the testing set? Use `\(\lambda = 1.348411\)`. ```r n.trn <- 600 n.tst <- 261 y.trn <- y[1:n.trn] y.tst <- y[n.trn+1:n.tst] Zs.trn <- Zs[1:n.trn,] Zs.tst <- Zs[n.trn+1:n.tst,] Gtrn <- tcrossprod(Zs.trn) / ncol(Zs.trn) Gtrn <- Gtrn + diag(n.trn)*0.001 Gtst.trn <- tcrossprod(Zs.tst, Zs.trn) / ncol(Zs.tst) #Gtrn <- G[1:n.trn, 1:n.trn] #Gtst.trn <- G[n.trn+1:n.tst, 1:n.trn] lambda <- 1.348411 # fit$Ve / fit$Vu Ginv.trn <- solve(Gtrn) ones <- matrix(1, ncol=1, nrow=n.trn) Z <- diag(n.trn) LHS1 <- cbind(crossprod(ones), crossprod(ones, Z)) LHS2 <- cbind(crossprod(Z, ones), crossprod(Z) + Ginv.trn*lambda) LHS <- rbind(LHS1, LHS2) RHS <- rbind( crossprod(ones, y.trn), crossprod(Z,y.trn) ) sol.trn <- solve(LHS, RHS) # prediction y.hat <- Gtst.trn %*% Ginv.trn %*% matrix(sol.trn[c(2:(n.trn+1))]) cor(y.hat, y[(n.trn+1):n]) # plot(y.hat, y[(n.trn+1):n]) ``` --- # K-fold validation Repeat RR-BLUP but treat the first 600 individuals as a training set and predict the additive genetic values of the remaining individuals in the testing set. - What is the predictive correlation in the testing set? Use `\(\lambda = 4326.212\)`. - Also, compare this predictive correlation to the one from GBLUP. ```r Zs.trn <- Zs[1:n.trn, ] Zs.tst <- Zs[n.trn+1:n.tst, ] lambda <- 4326.212 # fit$Ve / fit$Vu ones <- matrix(1, ncol=1, nrow=n.trn) I <- diag(m) LHS1 <- cbind(crossprod(ones), crossprod(ones, Zs.trn)) LHS2 <- cbind(crossprod(Zs.trn, ones), crossprod(Zs.trn) + I*lambda) LHS <- rbind(LHS1, LHS2) RHS <- rbind( crossprod(ones, y.trn), crossprod(Zs.trn, y.trn) ) sol.trn <- solve(LHS, RHS) # prediction y.hat2 <- Zs.tst %*% matrix(sol.trn[-1]) # cor(y.hat2, y[(n.trn+1):n]) # plot(y.hat2, y[(n.trn+1):n]) ```