class: center, middle, inverse, title-slide # Genomic Selection ### Jinliang Yang ### April 12, 2022 --- # The Linear Mixed Model `\begin{align*} \mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{Z}\mathbf{u} + \mathbf{e} \\ \end{align*}` where, - `\(\mathbf{y}\)` is a vector of observed phenotypes - `\(\mathbf{X}\)` is the __design__ or __incidence__ matrix - `\(\mathbf{b}\)` is the vector of the __fixed__ effects to be estimated - `\(\mathbf{Z}\)` is the __incidence__ matrix for random effects - `\(\mathbf{u}\)` is the vector of the __random__ effects to be predicted - `\(\mathbf{e}\)` is the vector of residuals. `\begin{align*} \begin{bmatrix} \mathbf{u} \\ \mathbf{e} \end{bmatrix} \sim N \begin{bmatrix} \mathbf{0}, & \Bigg( \begin{array}{cccc} \mathbf{G \sigma_u^2} & \mathbf{0} \\ \mathbf{0} & \mathbf{R\sigma_e^2} \\ \end{array} \Bigg) \end{bmatrix} \\ \end{align*}` Or, `\begin{align*} \mathbf{u} \sim N\mathbf{(0, \mathbf{G^*})} \\ \mathbf{e} \sim N\mathbf{(0, \mathbf{R^*})} \\ \end{align*}` --- # The covarinace matrix using pedigree | | Morex | Robust | Stander | Excel | | :-------: | :-------: | :-----------: | :-----------: | | Morex | 1 | 1/2 | 11/32 | 7/16 | | Robust | | 1 | 43/64 | 27/32 | Stander | | | 1 | 91/128 | | Excel | | | | 1 | `\begin{align*} V(\mathbf{u}) & = \mathbf{A}V_A \\ & = \begin{bmatrix} 2 & 1 & 11/16 & 7/8 \\ 1 & 2 & 43/32 & 27/16 \\ 11/16 & 43/32 & 2 & 91/64 \\ 7/8 & 27/16 & 91/64 & 2 \\ \end{bmatrix} V_A \end{align*}` Where the elements of `\(\mathbf{A}\)`, __the additive relationship matrix__, are equal to twice the `\(f_{XY}\)` among inbreds. ----------------- Genetic covariances for general relatives is `\begin{align*} Cov(X, Y) = 2f_{XY}V_A + \Delta_{XY}V_D \end{align*}` --- # Pedigree-based relationship Parental contributions to progeny may differ from those expected from pedigrees. -- - For example, F2 => Recombinant Inbred lines (RILs) - Expected to have a parental contribution of 50%, but `\(f_{XY}\)` deviates from its expected value. <div align="center"> <img src="f2_3.png" height=300> </div> --- # Genomic relationship ### Marker similarity The expected __marker similarity__ between X and Y is equal to: `\begin{align*} S_{XY} = f_{XY} - (1- f_{XY}) \theta_{XY} \\ \end{align*}` - Where `\(\theta_{XY}\)` is the probability that a marker allele from a random parent `\(X\)` and a marker allele from a random parent of `\(Y\)` are IBS, given that they are not IBD. - If `\(f_{XY}=0\)`, `\(S_{XY} = -\theta_{XY}\)` -- - `\(\theta_{XY}\)` can be estimated from unrelated individuals as - the average nucleotide diversity: `\(\theta\)` --- # Genomic relationship Rearrange the marker similarity equation and replace `\(\theta_{XY}\)` with `\(\theta\)`: `\begin{align*} f_{XY} & = \frac{S_{XY} - \theta_{XY}}{ 1- \theta_{XY}} \\ & = \frac{S_{XY} - \theta}{ 1- \theta} \end{align*}` -- __Marker-based `\(f_{XY}\)`__: the probability of identity by descent (IBD) between two individuals by accounting for the frequency of marker alleles that are identity by state (IBS). -- ### Example: 1. two unrelated individual A and B: `\(\theta_{AB}=0.5\)` 2. A and C are related: `\(S_{AC}=0.6\)` 3. therefore, `\(f_{AC} = \frac{S_{AC} - \theta}{ 1- \theta} = 0.2\)` --- # GBLUP - Step1: __Coefficients of coancestry__ calculated from genome-wide markers instead of pedigree records - Step2: __Genomic relationship (G) matrix__ replaces the additive relationship matrix or __A__ matrix - G matrix can also be described as a realized relationship matrix because it captures the realized rather than the expected relatedness. - Step3: Fit the linear mixed model or the GBLUP model. -- ----------- - Empirical results in plants have shown that GBLUP is superior to BLUP. - For maize yield, the predictive ability ranged from 0.7 to 0.8 with GBLUP vs. 0.66 to 0.79 with BLUP (Bernardo, 1994) - Recent results with high density markers: 0.10-0.25 higher (Albrecht et al., 2014; Schrag et al., 2019) --- # Genomic Selection __Genomic selection__ is a procedure that utilizes a large set of random markers in marker-based selection. - GS can be considered as marker-based selection without QTL mapping! #### Prediction: Obtaining genome-enabled prediction of genotypic value #### Selection: Selection is conducted on the basis of such predicted values. -- Meuwissen et al. (2001) published the landmark paper outlining the application of BLUP of allele effects to breeding, specifically in the context of animal breeding. > Meuwissen, et al.,2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819-1829. --- # Genomic Selection Procedure <div align="center"> <img src="gs.png" height=350> </div> > Heffner, E.L., M.E. Sorrells, and J.L. Jannink. 2009. Genomic selection for crop improvement. Crop Sci. 49:1-12. --- # A GS experimental design > Bernardo 2020. textbook - Suppose __N=150 F3 families__ are developed from the cross between two maize inbreds. - These F3 families are evaluated for their testcross performance and are genotyped with __N=384 random SNP markers__. - Pooled amplicon sequencing - Genotyping-by-sequencing - SNP array - Whole genome sequencing - The yield trials are conducted in the same set of environments and the data are assumed balanced so that the only fixed effect is the overall mean. --- # A GS example The performance of the testcrosses on an entry-mean basis can be modeled as: `\begin{align*} \mathbf{y} &= \mathbf{1} \mu + \mathbf{Zm} + \mathbf{e} \\ \end{align*}` where, - `\(\mathbf{y}\)` is a vector of testcross phenotypic means - `\(\mathbf{1}\)` is an `\(N \times 1\)` vector with all elements equal to 1 - `\(\mu\)` is the __fixed__ effect of the overall mean - `\(\mathbf{Z}\)` is the __incidence__ matrix for random effects of __SNP genotype__ - `\(\mathbf{m}\)` is the vector of the __random effects for each of the SNP markers__ - `\(\mathbf{e}\)` is the vector of residuals -- ### About Z matrix - The `\(\mathbf{Z}\)` matrix is coded (first homozygous `\(=1\)`, heterozygous `\(=0\)`, 2nd homozygous `\(=-1\)`) with marker genotype. - The __SNP effect__ is defined as the effect associated with the first homozygous inbred, coded with 1. --- # A GS example The performance of the testcrosses on an entry-mean basis can be modeled as: `\begin{align*} \mathbf{y} &= \mathbf{1} \mu + \mathbf{Zm} + \mathbf{e} \\ \end{align*}` Note that fitting marker effects as random instead of fixed does not require degrees of freedom - The number of marker loci (N=384) can exceed the population size (N=150) -- The covariance matrix of `\(\mathbf{m}\)` can be modeled as: > Meuwiseen et al., 2001. `\begin{align*} V(\mathbf{m}) &= \mathbf{I}V_{M_i} \\ & = \mathbf{I}(V_{G}/N_M) \end{align*}` - where `\(\mathbf{I}\)` is an identity matrix - `\(V_{M_i}\)` is the variance due to each marker locus - `\(V_{G}\)` is the type of genetic variance expressed among the progeny being evaluated. - `\(N_M\)` is the number of markers --- # Ridge regression-BLUP `\begin{align*} \mathbf{y} &= \mathbf{1} \mu + \mathbf{Zm} + \mathbf{e} \\ \end{align*}` `\begin{align*} V(\mathbf{m}) &= \mathbf{I}V_{M_i} \\ & = \mathbf{I}(V_{G}/N_M) \end{align*}` This convenient way to calculated the effects of genome-wide markers by BLUP is called __ridge regression BLUP__ or __RR-BLUP__. -- ### Two assumptions in RR-BLUP - (1) Each random marker is assumed to account for an equal amount of the genetic variance - This does not mean that the predicted marker effects are equal - It simply means that the marker effects have the same underlying genetic variance - (2) Epistasis is ignored for convenience in the prediction --- # The Barley Example ### Materials Four __related__ barley cultivars: - __M__orex, __R__obust, __E__xcel, __S__tander ### Experimental design - M, R, and S in 18 environments (set1) - R, E, and S in 9 environments (set2) -- | Env | Number| inbred | Grain yield | SNP1 | SNP2 | SNP3 | | :-------: | :-------: | :-----: | :--: | :--: | :--: | :--: | :--: | | Set1 | 18 | M | 4.45 | C | A | C | | Set1 | 18 | R | 4.61 | C | G | C | | Set1 | 18 | S | 5.27 | T | A | A | | Set2 | 9 | R | 5.00 | C | G | C | | Set2 | 9 | E | 5.82 | T | G | C | | Set2 | 9 | S | 5.79 | T | A | A | --- # The Linear Mixed Model `\begin{align*} \mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{Z}\mathbf{m} + \mathbf{e} \\ \end{align*}` - `\(\mathbf{y}\)` is a vector of observed grain yield phenotype - `\(\mathbf{X}\)` is the __design__ or __incidence__ matrix - `\(\mathbf{b}\)` is the vector of the __fixed__ effects due to sets of yield trials - `\(\mathbf{e}\)` is the vector of residuals. -- - `\(\mathbf{Z}\)` is the incidence matrix for __the SNP markers__ - `\(\mathbf{m}\)` is the vector of the __random effects due to SNPs__ `\begin{align*} V(\mathbf{m}) &= \mathbf{I}V_{M_i} \\ & = \mathbf{I}(V_{G}/N_M) \end{align*}` -- #### SNP Coding We will use biallelic SNPs and consider Morex (coded as 1) as our reference - `\(1\)`: If an individual is homozygous for the allele carried by Morex - `\(0\)`: heterozygous - `\(-1\)`: if homozygous for the allele not carried by Morex --- # Z Matrix | Env | Number| inbred | Grain yield | SNP1 | SNP2 | SNP3 | | :-------: | :-------: | :-----: | :--: | :--: | :--: | :--: | :--: | | Set1 | 18 | __M__ | 4.45 | __C__ | __A__ | __C__ | | Set1 | 18 | R | 4.61 | C | G | C | | Set1 | 18 | S | 5.27 | T | A | A | | Set2 | 9 | R | 5.00 | C | G | C | | Set2 | 9 | E | 5.82 | T | G | C | | Set2 | 9 | S | 5.79 | T | A | A | #### Get Z matrix into R -- ```r # let's input data by column Z <- matrix(c(1,1,-1,1,-1,-1, 1,-1,1,-1,-1,1, 1,1,-1,1,1,-1), byrow=FALSE, nrow=6) Z ``` ``` ## [,1] [,2] [,3] ## [1,] 1 1 1 ## [2,] 1 -1 1 ## [3,] -1 1 -1 ## [4,] 1 -1 1 ## [5,] -1 -1 1 ## [6,] -1 1 -1 ``` --- # Solve the MME `\begin{align*} \begin{bmatrix} \mathbf{\hat{b}} \\ \mathbf{\hat{u}} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{X^{'}R^{-1}X} & \mathbf{X^{'}R^{-1}Z} \\ \mathbf{Z^{'}R^{-1}X} & \mathbf{Z^{'}R^{-1}Z} + \mathbf{G^{-1}} V_e/V_u \\ \end{bmatrix}^{-1} \begin{bmatrix} \mathbf{X^{'}R^{-1}y} \\ \mathbf{Z^{'}R^{-1}y} \\ \end{bmatrix} \end{align*}` -- In our case, `\begin{align*} \begin{bmatrix} \mathbf{\hat{b}} \\ \mathbf{\hat{m}} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{X^{'}R^{-1}X} & \mathbf{X^{'}R^{-1}Z} \\ \mathbf{Z^{'}R^{-1}X} & \mathbf{Z^{'}R^{-1}Z} + \mathbf{I} V_e/V_{M_i} \\ \end{bmatrix}^{-1} \begin{bmatrix} \mathbf{X^{'}R^{-1}y} \\ \mathbf{Z^{'}R^{-1}y} \\ \end{bmatrix} \end{align*}` -- Here, we suppose the heritability is `\(H^2 = 0.5\)`. `\begin{align*} H^2 & = \frac{V_G}{V_G + V_e} \\ V_{M_i} & = \frac{V_G}{N_M} \\ \frac{V_e}{V_{M_i}} & = \frac{(1 - H^2) \times N_M}{H^2} \\ & = 3 \end{align*}` --- # Solve the MME ```r y <- matrix(c(4.45, 4.61, 5.27, 5.00, 5.82, 5.79), byrow=FALSE, nrow=6) X <- matrix(c(1,1,1,0,0,0, 0, 0,0, 1, 1,1), byrow=FALSE, nrow=6) R <- matrix(c(1/18,0,0,0,0,0, 0,1/18,0,0,0,0, 0,0,1/18,0,0,0, 0,0,0,1/9,0,0, 0,0,0,0,1/9,0, 0,0,0,0,0,1/9), nrow=6, byrow=T) ``` -- `\begin{align*} \begin{bmatrix} \mathbf{\hat{b}} \\ \mathbf{\hat{m}} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{X^{'}R^{-1}X} & \mathbf{X^{'}R^{-1}Z} \\ \mathbf{Z^{'}R^{-1}X} & \mathbf{Z^{'}R^{-1}Z} + \mathbf{I} V_e/V_{M_i} \\ \end{bmatrix}^{-1} \begin{bmatrix} \mathbf{X^{'}R^{-1}y} \\ \mathbf{Z^{'}R^{-1}y} \\ \end{bmatrix} \end{align*}` -- ```r solve_mme <- function(X, y, R, Z, H2, nmarker){ a11 <- t(X) %*% solve(R) %*% X a12 <- t(X) %*% solve(R) %*% Z a21 <- t(Z) %*% solve(R) %*% X a22 <- t(Z) %*% solve(R) %*% Z v = (1-H2)*nmarker/H2 a22_2 <- diag(3)*3 lhs <- rbind(cbind(a11, a12), cbind(a21, a22 + a22_2)) rhs <- rbind(t(X) %*% solve(R) %*% y, t(Z) %*% solve(R) %*% y) eff <- solve(lhs) %*% rhs return(eff) } eff <- solve_mme(X, y, R, Z, H2=0.5, nmarker=3) ``` --- # The marker effects | Env | Number| inbred | Grain yield | SNP1 | SNP2 | SNP3 | | :-------: | :-------: | :-----: | :--: | :--: | :--: | :--: | :--: | | Set1 | 18 | __M__ | 4.45 | __C__ | __A__ | __C__ | | Set1 | 18 | R | 4.61 | C | G | C | | Set1 | 18 | S | 5.27 | T | A | A | | Set2 | 9 | R | 5.00 | C | G | C | | Set2 | 9 | E | 5.82 | T | G | C | | Set2 | 9 | S | 5.79 | T | A | A | ```r eff ``` ``` ## [,1] ## [1,] 4.93475643 ## [2,] 5.41898437 ## [3,] -0.34842360 ## [4,] -0.06523449 ## [5,] -0.06061120 ``` ---------- The effect of SNP1 `\(\hat{m_1}=-0.35\)` indicates that at SNP1, the allele carried by `Morex` leads to a lower trait value. If the genotype at SNP1 changes from CC to CT, the predicted value for yield would increase by 0.35. CC -> TT, increase by 0.7.