Genome-wide association study

class: center, middle, inverse, title-slide

# Genome-wide association study
### Jinliang Yang
### April 21, 2022

---

# Genetic architecture of a phenotypic trait

A quantitative trait is sometimes controlled jointly by
- __major QTLs__ with large effects 
- __minor QTLs__ with small effects

### Prevent Shrinkage

If markers that correspond to the major QTLs are known 
- then these markers can be treated as having fixed effects
  - It will prevents _shrinkage of their estimates_
  
- the remaining markers can be treated as having random effects
  - their effects can still be estimated through RR-BLUP or other approaches.

---

# Major QTL

Simulation results have suggested that a major QTL should have an __ `\(R^2 > 10\%\)`__ for it to be considered as having a fixed effects.
> Bernardo, 2014

Fitting major QTL as having fixed effects has been shown to improve the prediction accuracy in genomic selection
> Rutkoski et al., 2014; Herter et al., 2019; etc.

### Molecular Function

- Identify two major QTLs conferring upright plant architecture, _UPA1_ ( _Upright Plant Architecture1_ ) and _UPA2_ ( _Upright Plant Architecture2_ )

- Altering endogenous brassinosteroid content and leaf angle.
> Tian et al., 2019

---

# Ridge regression-BLUP

`\begin{align*}
\mathbf{y} &= \mathbf{Xb} + \mathbf{Zm} + \mathbf{e} \\
\end{align*}`

`\begin{align*}
V(\mathbf{m}) &= \mathbf{I}V_{M_i} \\
 & = \mathbf{I}(V_{G}/N_M) 
\end{align*}`

### Fit a marker as a fixed effect

With some modification of the above RR-BLUP model, a mixed-model approach can be used for assocation mapping:

`\begin{align*}
\mathbf{y} &= \mathbf{Xb} + \mathbf{w_i}m_i + \mathbf{Zm^*} + \mathbf{e} \\
\end{align*}`

- Where `\(\mathbf{m_i}\)` is the fixed effect due to the `\(i\)`th SNP marker
- `\(\mathbf{w_i}\)` is an incidence vector for the SNP marker

---

# G Model

`\begin{align*}
\mathbf{y} &= \mathbf{Xb} + \mathbf{w_i}m_i + \mathbf{Zm^*} + \mathbf{e} \\
\end{align*}`

- Where `\(\mathbf{m_i}\)` is the fixed effect due to the `\(i\)`th SNP marker
- `\(\mathbf{w_i}\)` is an incidence vector for the SNP marker

---------------

This G model utilizes RR-BLUP marker effects to account for variation due to QTL found on the background chromosomes.
> Bernardo, 2013

- Like the QTL Composite interval mapping approach

- The disadvantage of this type of approach is the uncertainty in how many background markers should be included.
 - If too few, the background variation will be underestimated
 
 - If too many, overfitting the model.

---

# K model for GWAS

`\begin{align*}
\mathbf{y} &= \mathbf{Xb} + \mathbf{w_i}m_i + \mathbf{Zu} + \mathbf{e} \\
\end{align*}`

In GBLUP, the covariance matrix of `\(\mathbf{u}\)` becomes equal to `\(\mathbf{A}V_A'\)`
- Where `\(\mathbf{A}\)` is the additive relationship matrix, or kinship ( `\(\mathbf{K}\)` ) matrix
- And `\(V_A'\)` is the portion of the additive variance that is not accounted for by `\(\mathbf{m_i}\)`
- `\(V_A'\)` will need to be estimated by an iteractive procedure.

### Multiple marker model

With multiple SNP markers
- `\(\mathbf{w_i}\)` => an incidence matrix `\(\mathbf{W}\)`

- `\(m_i\)` => a vector of `\(\mathbf{m}\)`

---

# Multiple marker GWAS model

`\begin{align*}
\mathbf{y} &= \mathbf{Xb} + \mathbf{Wm} + \mathbf{Zu} + \mathbf{e} \\
\end{align*}`

### A two-step approach

- First, a single marker is included at a time
  - The significance of individual marker effects is then tested by z-tests
  
--

- Second, the markers found significant in the single-marker analyses are included in a multiple-marker model.
  - A standard model-selection procedure, such as __backward elimination__, maybe used to determine which markers should be incorporated in the final multiple-marker model.

---
# Multiple subpopulations

In breeding context, we define each germplasm group or heterotic pattern as a __sub-population__ of the larger pool of inbred, hybrids, or clones.

- In maize, dent (Iowa Stiff Stalk Synthetic, BSSS) and flint (non-BSSS)

- Barley inbreds comprise six-row and two-row types

### Population structure

- Separate analysis: 
One-subpopulation-at-a-time approach

- Joint analysis to account for the differences between the sub-populations

---
# QK model for multiple subpopulations

### K model

`\begin{align*}
\mathbf{y} &= \mathbf{Xb} + \mathbf{w_i}m_i + \mathbf{Zu} + \mathbf{e} \\
\end{align*}`

### QK model

`\begin{align*}
\mathbf{y} &= \mathbf{Xb} + \mathbf{Qv} + \mathbf{w_i}m_i + \mathbf{Zu} + \mathbf{e} \\
\end{align*}`

- In this model, the effects due to different subpopulations are captured by `\(\mathbf{Qv}\)`

- The relatedness among lines within each subpopulation is specified by the covariance matrix of `\(\mathbf{u}\)`.

---
# Use PCA method to construct Q matrix

> Price et al., 2006

Proposed a method to use principal components analysis (PCA) of marker-allele frequencies and the use of PCA scores as the `\(\mathbf{Qv}\)` matrix.

### What is in the Q matrix:

1. The columns in Q correspond to different PCA axes
2. The rows in Q correspond to PCA scores of the lines in `\(\mathbf{y}\)`

### How many PCs?

The first PC captures the largest amount of variation, and the 2nd captures the second-largest amount of variaiton, and so on.

No fixed rule, but knowing the number of germplasm groups will help.

---
# GWAS methods summary

#### K model
Account for relatedness using either pedigree records or marker data.
- Mainly `\(\mathbf{A}\)` matrix (only considering additive relationship)
- `\(\mathbf{AD}\)` matrix might be better

#### QK model 
Account for effects of subpopulations and the relatedness within each subpopulation
- K matrix estimated from marker better than from pedigree
> Stich et al., 2008; Yu et al., 2006

-------------------

#### G model or QG model

Utilizes RR-BLUP marker effects to account for variation due to QTL found on the background chromosomes.

---
# GWAS methods summary

<div align="center">
<img src="qg.png" height=300>
</div>
> Bernardo, 2013

Through simulation,
- different genetic architecture (15 or 30 QTLs)
- population size (N=384, 768, and 1,536)

---

# The barley Example

| Env | Number| inbred | Grain yield  | SNP1 | SNP2 | SNP3 |
| :-------: | :-------: | :-----: | :--: | :--: | :--: | :--: | :--: | 
| Set1  | 18  |  __M__  | 4.45 | __C__ | __A__ | __C__ |
| Set1  | 18  |  R  | 4.61 | C | G | C |
| Set1  | 18  |  S  | 5.27 | T | A | A |
| Set2  | 9   |  R  | 5.00 | C | G | C |
| Set2  | 9   |  E  | 5.82 | T | G | C |
| Set2  | 9   |  S  | 5.79 | T | A | A |

### Fit the G model

`\begin{align*}
\mathbf{y} &= \mathbf{Xb} + \mathbf{w_i}m_i + \mathbf{Zm^*} + \mathbf{e} \\
\end{align*}`

`\begin{align*}
V(\mathbf{m}) &= \mathbf{I}V_{M_i} \\
 & = \mathbf{I}(V_{G}/N_M) 
\end{align*}`

---

# The barley Example

### Get matricies into R
`\begin{align*}
\mathbf{y} &= \mathbf{Xb} + \mathbf{w_i}m_i + \mathbf{Zm^*} + \mathbf{e} \\
\end{align*}`

```r
# let's input data by column
Z <- matrix(c(1,1,-1,1,-1,-1, 1,-1,1,-1,-1,1, 1,1,-1,1,1,-1), byrow=FALSE, nrow=6)
y <- matrix(c(4.45, 4.61, 5.27, 5.00, 5.82, 5.79), byrow=FALSE, nrow=6)
X <- matrix(c(1,1,1,0,0,0, 0, 0,0, 1, 1,1), byrow=FALSE, nrow=6)
R <- matrix(c(1/18,0,0,0,0,0, 0,1/18,0,0,0,0, 0,0,1/18,0,0,0,
              0,0,0,1/9,0,0, 0,0,0,0,1/9,0, 0,0,0,0,0,1/9), nrow=6, byrow=T)
## Adjust X and X
X2 <- cbind(X, Z[,1])
Z2 <- Z[,-1]
```

---

# The barley Example

`\begin{align*}
  \begin{bmatrix}
    \mathbf{\hat{b}} \\
    \mathbf{\hat{m}} \\
  \end{bmatrix}
  =
  \begin{bmatrix}
    \mathbf{X^{'}R^{-1}X} & \mathbf{X^{'}R^{-1}Z} \\
    \mathbf{Z^{'}R^{-1}X} & \mathbf{Z^{'}R^{-1}Z} + \mathbf{I} V_e/V_{M_i} \\
  \end{bmatrix}^{-1}
  \begin{bmatrix}
    \mathbf{X^{'}R^{-1}y} \\
    \mathbf{Z^{'}R^{-1}y} \\
  \end{bmatrix}
\end{align*}`

```r
solve_mme <- function(y,X,Z,R,lambda){
  a11 <- t(X) %*% solve(R) %*% X
  a12 <- t(X) %*% solve(R) %*% Z
  a21 <- t(Z) %*% solve(R) %*% X
  a22 <- t(Z) %*% solve(R) %*% Z 
  a22_2 <- diag(nrow(a22))*lambda
  lhs <- rbind(cbind(a11, a12), cbind(a21, a22 + a22_2))
  rhs <- rbind(t(X) %*% solve(R) %*% y, t(Z) %*% solve(R) %*% y)
  return(solve(lhs) %*% rhs)
}
```

### Solve the MME

```r
lambda = 3
eff1 <- solve_mme(y,X,Z,R,lambda)
eff2 <- solve_mme(y,X=X2,Z=Z2,R,lambda)
```

---

# The barley Example

| Env | Number| inbred | Grain yield  | SNP1 | SNP2 | SNP3 | __Type__ |
| :---: | :---: | :--: | :--: | :--: | :--: | :--: | :--: | :------: | 
| Set1  | 18  |  M  | 4.45 | __C__ | __A__ | __C__ | 2r |
| Set1  | 18  |  R  | 4.61 | C | G | C | 2r |
| Set1  | 18  |  __S__  | 5.27 | T | A | A | 6r |
| Set2  | 9   |  R  | 5.00 | C | G | C | 2r |
| Set2  | 9   |  E  | 5.82 | T | G | C | 2r |
| Set2  | 9   |  __S__  | 5.79 | T | A | A | 6r |

### Fit the QG model

`\begin{align*}
\mathbf{y} &= \mathbf{Xb} + \mathbf{Qv} + \mathbf{w_i}m_i + \mathbf{Zm^*} + \mathbf{e} \\
\end{align*}`

`\begin{align*}
V(\mathbf{m}) &= \mathbf{I}V_{M_i} \\
 & = \mathbf{I}(V_{G}/N_M) 
\end{align*}`

```r
Q <- matrix(c(1,1,-1,1,1,-1),byrow=FALSE, nrow=6)
```

---
# Solve the MME

`\begin{align*}
\mathbf{y} &= \mathbf{Xb} + \mathbf{Qv} + \mathbf{w_i}m_i + \mathbf{Zm^*} + \mathbf{e} \\
\end{align*}`

### Estimate the effect of the first marker

```r
X3 <- cbind(X, Q, Z[,1])

eff3 <- solve_mme(y,X=X3,Z=Z2,R,lambda)

eff3
```

```
##             [,1]
## [1,]  4.93732258
## [2,]  5.39500000
## [3,] -0.02848387
## [4,] -0.38922581
## [5,] -0.06425806
## [6,]  0.00000000
```