class: center, middle, inverse, title-slide # Genome-wide association study ### Jinliang Yang ### April 21, 2022 --- # Genetic architecture of a phenotypic trait A quantitative trait is sometimes controlled jointly by - __major QTLs__ with large effects - __minor QTLs__ with small effects -- ### Prevent Shrinkage If markers that correspond to the major QTLs are known - then these markers can be treated as having fixed effects - It will prevents _shrinkage of their estimates_ - the remaining markers can be treated as having random effects - their effects can still be estimated through RR-BLUP or other approaches. --- # Major QTL Simulation results have suggested that a major QTL should have an __ `\(R^2 > 10\%\)`__ for it to be considered as having a fixed effects. > Bernardo, 2014 -- Fitting major QTL as having fixed effects has been shown to improve the prediction accuracy in genomic selection > Rutkoski et al., 2014; Herter et al., 2019; etc. -- ### Molecular Function - Identify two major QTLs conferring upright plant architecture, _UPA1_ ( _Upright Plant Architecture1_ ) and _UPA2_ ( _Upright Plant Architecture2_ ) - Altering endogenous brassinosteroid content and leaf angle. > Tian et al., 2019 --- # Ridge regression-BLUP `\begin{align*} \mathbf{y} &= \mathbf{Xb} + \mathbf{Zm} + \mathbf{e} \\ \end{align*}` `\begin{align*} V(\mathbf{m}) &= \mathbf{I}V_{M_i} \\ & = \mathbf{I}(V_{G}/N_M) \end{align*}` -- ### Fit a marker as a fixed effect With some modification of the above RR-BLUP model, a mixed-model approach can be used for assocation mapping: `\begin{align*} \mathbf{y} &= \mathbf{Xb} + \mathbf{w_i}m_i + \mathbf{Zm^*} + \mathbf{e} \\ \end{align*}` - Where `\(\mathbf{m_i}\)` is the fixed effect due to the `\(i\)`th SNP marker - `\(\mathbf{w_i}\)` is an incidence vector for the SNP marker --- # G Model `\begin{align*} \mathbf{y} &= \mathbf{Xb} + \mathbf{w_i}m_i + \mathbf{Zm^*} + \mathbf{e} \\ \end{align*}` - Where `\(\mathbf{m_i}\)` is the fixed effect due to the `\(i\)`th SNP marker - `\(\mathbf{w_i}\)` is an incidence vector for the SNP marker --------------- This G model utilizes RR-BLUP marker effects to account for variation due to QTL found on the background chromosomes. > Bernardo, 2013 - Like the QTL Composite interval mapping approach - The disadvantage of this type of approach is the uncertainty in how many background markers should be included. - If too few, the background variation will be underestimated - If too many, overfitting the model. --- # K model for GWAS `\begin{align*} \mathbf{y} &= \mathbf{Xb} + \mathbf{w_i}m_i + \mathbf{Zu} + \mathbf{e} \\ \end{align*}` In GBLUP, the covariance matrix of `\(\mathbf{u}\)` becomes equal to `\(\mathbf{A}V_A'\)` - Where `\(\mathbf{A}\)` is the additive relationship matrix, or kinship ( `\(\mathbf{K}\)` ) matrix - And `\(V_A'\)` is the portion of the additive variance that is not accounted for by `\(\mathbf{m_i}\)` - `\(V_A'\)` will need to be estimated by an iteractive procedure. -- ### Multiple marker model With multiple SNP markers - `\(\mathbf{w_i}\)` => an incidence matrix `\(\mathbf{W}\)` - `\(m_i\)` => a vector of `\(\mathbf{m}\)` --- # Multiple marker GWAS model `\begin{align*} \mathbf{y} &= \mathbf{Xb} + \mathbf{Wm} + \mathbf{Zu} + \mathbf{e} \\ \end{align*}` ### A two-step approach - First, a single marker is included at a time - The significance of individual marker effects is then tested by z-tests -- - Second, the markers found significant in the single-marker analyses are included in a multiple-marker model. - A standard model-selection procedure, such as __backward elimination__, maybe used to determine which markers should be incorporated in the final multiple-marker model. --- # Multiple subpopulations In breeding context, we define each germplasm group or heterotic pattern as a __sub-population__ of the larger pool of inbred, hybrids, or clones. - In maize, dent (Iowa Stiff Stalk Synthetic, BSSS) and flint (non-BSSS) - Barley inbreds comprise six-row and two-row types -- ### Population structure - Separate analysis: One-subpopulation-at-a-time approach - Joint analysis to account for the differences between the sub-populations --- # QK model for multiple subpopulations ### K model `\begin{align*} \mathbf{y} &= \mathbf{Xb} + \mathbf{w_i}m_i + \mathbf{Zu} + \mathbf{e} \\ \end{align*}` -- ### QK model `\begin{align*} \mathbf{y} &= \mathbf{Xb} + \mathbf{Qv} + \mathbf{w_i}m_i + \mathbf{Zu} + \mathbf{e} \\ \end{align*}` - In this model, the effects due to different subpopulations are captured by `\(\mathbf{Qv}\)` - The relatedness among lines within each subpopulation is specified by the covariance matrix of `\(\mathbf{u}\)`. --- # Use PCA method to construct Q matrix > Price et al., 2006 Proposed a method to use principal components analysis (PCA) of marker-allele frequencies and the use of PCA scores as the `\(\mathbf{Qv}\)` matrix. -- ### What is in the Q matrix: 1. The columns in Q correspond to different PCA axes 2. The rows in Q correspond to PCA scores of the lines in `\(\mathbf{y}\)` -- ### How many PCs? The first PC captures the largest amount of variation, and the 2nd captures the second-largest amount of variaiton, and so on. No fixed rule, but knowing the number of germplasm groups will help. --- # GWAS methods summary #### K model Account for relatedness using either pedigree records or marker data. - Mainly `\(\mathbf{A}\)` matrix (only considering additive relationship) - `\(\mathbf{AD}\)` matrix might be better #### QK model Account for effects of subpopulations and the relatedness within each subpopulation - K matrix estimated from marker better than from pedigree > Stich et al., 2008; Yu et al., 2006 -- ------------------- #### G model or QG model Utilizes RR-BLUP marker effects to account for variation due to QTL found on the background chromosomes. --- # GWAS methods summary <div align="center"> <img src="qg.png" height=300> </div> > Bernardo, 2013 Through simulation, - different genetic architecture (15 or 30 QTLs) - population size (N=384, 768, and 1,536) --- # The barley Example | Env | Number| inbred | Grain yield | SNP1 | SNP2 | SNP3 | | :-------: | :-------: | :-----: | :--: | :--: | :--: | :--: | :--: | | Set1 | 18 | __M__ | 4.45 | __C__ | __A__ | __C__ | | Set1 | 18 | R | 4.61 | C | G | C | | Set1 | 18 | S | 5.27 | T | A | A | | Set2 | 9 | R | 5.00 | C | G | C | | Set2 | 9 | E | 5.82 | T | G | C | | Set2 | 9 | S | 5.79 | T | A | A | ### Fit the G model `\begin{align*} \mathbf{y} &= \mathbf{Xb} + \mathbf{w_i}m_i + \mathbf{Zm^*} + \mathbf{e} \\ \end{align*}` `\begin{align*} V(\mathbf{m}) &= \mathbf{I}V_{M_i} \\ & = \mathbf{I}(V_{G}/N_M) \end{align*}` --- # The barley Example | Env | Number| inbred | Grain yield | SNP1 | SNP2 | SNP3 | | :-------: | :-------: | :-----: | :--: | :--: | :--: | :--: | :--: | | Set1 | 18 | __M__ | 4.45 | __C__ | __A__ | __C__ | | Set1 | 18 | R | 4.61 | C | G | C | | Set1 | 18 | S | 5.27 | T | A | A | | Set2 | 9 | R | 5.00 | C | G | C | | Set2 | 9 | E | 5.82 | T | G | C | | Set2 | 9 | S | 5.79 | T | A | A | ### Get matricies into R `\begin{align*} \mathbf{y} &= \mathbf{Xb} + \mathbf{w_i}m_i + \mathbf{Zm^*} + \mathbf{e} \\ \end{align*}` ```r # let's input data by column Z <- matrix(c(1,1,-1,1,-1,-1, 1,-1,1,-1,-1,1, 1,1,-1,1,1,-1), byrow=FALSE, nrow=6) y <- matrix(c(4.45, 4.61, 5.27, 5.00, 5.82, 5.79), byrow=FALSE, nrow=6) X <- matrix(c(1,1,1,0,0,0, 0, 0,0, 1, 1,1), byrow=FALSE, nrow=6) R <- matrix(c(1/18,0,0,0,0,0, 0,1/18,0,0,0,0, 0,0,1/18,0,0,0, 0,0,0,1/9,0,0, 0,0,0,0,1/9,0, 0,0,0,0,0,1/9), nrow=6, byrow=T) ## Adjust X and X X2 <- cbind(X, Z[,1]) Z2 <- Z[,-1] ``` --- # The barley Example `\begin{align*} \begin{bmatrix} \mathbf{\hat{b}} \\ \mathbf{\hat{m}} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{X^{'}R^{-1}X} & \mathbf{X^{'}R^{-1}Z} \\ \mathbf{Z^{'}R^{-1}X} & \mathbf{Z^{'}R^{-1}Z} + \mathbf{I} V_e/V_{M_i} \\ \end{bmatrix}^{-1} \begin{bmatrix} \mathbf{X^{'}R^{-1}y} \\ \mathbf{Z^{'}R^{-1}y} \\ \end{bmatrix} \end{align*}` ```r solve_mme <- function(y,X,Z,R,lambda){ a11 <- t(X) %*% solve(R) %*% X a12 <- t(X) %*% solve(R) %*% Z a21 <- t(Z) %*% solve(R) %*% X a22 <- t(Z) %*% solve(R) %*% Z a22_2 <- diag(nrow(a22))*lambda lhs <- rbind(cbind(a11, a12), cbind(a21, a22 + a22_2)) rhs <- rbind(t(X) %*% solve(R) %*% y, t(Z) %*% solve(R) %*% y) return(solve(lhs) %*% rhs) } ``` -- ### Solve the MME ```r lambda = 3 eff1 <- solve_mme(y,X,Z,R,lambda) eff2 <- solve_mme(y,X=X2,Z=Z2,R,lambda) ``` --- # The barley Example | Env | Number| inbred | Grain yield | SNP1 | SNP2 | SNP3 | __Type__ | | :---: | :---: | :--: | :--: | :--: | :--: | :--: | :--: | :------: | | Set1 | 18 | M | 4.45 | __C__ | __A__ | __C__ | 2r | | Set1 | 18 | R | 4.61 | C | G | C | 2r | | Set1 | 18 | __S__ | 5.27 | T | A | A | 6r | | Set2 | 9 | R | 5.00 | C | G | C | 2r | | Set2 | 9 | E | 5.82 | T | G | C | 2r | | Set2 | 9 | __S__ | 5.79 | T | A | A | 6r | ### Fit the QG model `\begin{align*} \mathbf{y} &= \mathbf{Xb} + \mathbf{Qv} + \mathbf{w_i}m_i + \mathbf{Zm^*} + \mathbf{e} \\ \end{align*}` `\begin{align*} V(\mathbf{m}) &= \mathbf{I}V_{M_i} \\ & = \mathbf{I}(V_{G}/N_M) \end{align*}` ```r Q <- matrix(c(1,1,-1,1,1,-1),byrow=FALSE, nrow=6) ``` --- # Solve the MME `\begin{align*} \mathbf{y} &= \mathbf{Xb} + \mathbf{Qv} + \mathbf{w_i}m_i + \mathbf{Zm^*} + \mathbf{e} \\ \end{align*}` ### Estimate the effect of the first marker ```r X3 <- cbind(X, Q, Z[,1]) eff3 <- solve_mme(y,X=X3,Z=Z2,R,lambda) eff3 ``` ``` ## [,1] ## [1,] 4.93732258 ## [2,] 5.39500000 ## [3,] -0.02848387 ## [4,] -0.38922581 ## [5,] -0.06425806 ## [6,] 0.00000000 ```