GWAS in practice

class: center, middle, inverse, title-slide

# GWAS in practice
### Jinliang Yang
### May 1, 2025

---

# GWAS practice using rice data

[Zhao et al., 2011](https://www.nature.com/articles/ncomms1467) Genome-wide association mapping reveals a rich genetic architecture of complex traits in _Oryza sativa_

### Materials and methods
- __Materials__: 413 diverse accessions (inbreds) of rice from 82 countries
- __Genotype__: 44,100 SNPs using 44 K SNP array
- __Phenotype__: One location in two years. Two reps per year in a randomized complete block design.
  - 34 traits

### Five sub-populations
- _Indica_
- _Aus_
- _Temperate Japonica_
- _Tropical Japonica_
- _Aromatic_

---

# LD Decay

Linkage disequilibrium (LD) refers to the nonrandom associations of alleles at different loci. 
- The decay of LD is affected by recombination rate and the number of generations of recombination. Therefore, investigating LD decay may reveal the population recombination history.
- For GWAS, it helps to estimate the number of markers needed.

### using PLINK to compute LD

- With `--r2`, when a table format report is requested, pairs with r2 values less than 0.2 are normally filtered out of the report.

- Use `--ld-window-r2` to adjust this threshold.

---

# LD Decay

- Using HCC to conduct the analysis

```bash
### log onto HCC
ssh YOUR_USER_ID@crane.unl.edu
# Enter your passcode

### request computing node
srun --qos=short --nodes=1 --licenses=common --ntasks=4 --mem 32G --time 6:00:00 --pty bash
# cd to your project repo
# git pull
```

- Load the `PLINK` module

```bash
module load plink
cd largedata/RiceDiversity_44K_Genotypes_PLINK/

# By default, when a limited window report is requested, 
# every pair of variants with at least (10-1) variants between them, 
# or more than 1000 kilobases apart, is ignored. 
# You can change the first threshold with `--ld-window`, and the second threshold with `--ld-window-kb`.
plink -bfile binary_sativas413 --r2 --ld-window 100 --ld-window-kb 100 --ld-window-r2 0 --out binary_sativas413 
```

---

# Summarize LD decay rate

```r
library("data.table")
# cd back to your project home dir
df <- fread("largedata/RiceDiversity_44K_Genotypes_PLINK/binary_sativas413.ld", data.table=FALSE)

BINSIZE = 100
df$dist <- df$BP_B - df$BP_A
df$bin <- round(df$dist/BINSIZE, 0)

library(plyr)

df2 <- ddply(df, .(bin), summarise,
      meanr2 = mean(R2))

write.table(df2, "cache/ld_in_100bp_bin.csv", sep=",", row.names=FALSE, quote=FALSE)
```

---

### Plot LD decay results

- Plot the figure and sync the figure through github

```r
ld <- read.csv("cache/ld_in_100bp_bin.csv")

pdf("graphs/ld_decay.pdf", width=10, height=10)
plot(ld$bin*100, ld$meanr2, xlab="Physical distance (bp)", ylab="R2", main="LD decay rate in rice")
abline(h=0.3, col="red")
dev.off()
```

- Sync the results through github and plot the figure on local computer

```r
ld <- read.csv("cache/ld_in_100bp_bin.csv")

plot(ld$bin*100, ld$meanr2, xlab="Physical distance (bp)", ylab="R2", main="LD decay rate in rice")
abline(h=0.3, col="red")
```

---

# Population structure using PCA

PCA (principal component analysis) is a method often used to compress the high dimensional data without losing as much information.

Basically, it creates linear combinations of the columns of matrix information, `$\mathbf{X}$`, and generates, at most, `$p$` linear combinations, called principal components.

`\begin{align*}
PC_1 = \mathbf{w_1X} \\
PC_2 = \mathbf{w_2X} \\
PC_p = \mathbf{w_pX} \\
\end{align*}`
Here, `$\mathbf{w_p}$` is the __eigenvector__ of `$PC_p$`.

The first PC, or PC1, captures the largest variance, the 2nd PC, or PC2, captures the 2nd largest variance, and so on.

---

# PCA using PLINK

- By default, `--pca` extracts the top 20 principal components; you can change the number by passing a numeric parameter. 
- Eigenvectors are written to `plink.eigenvec`, and top eigenvalues are written to `plink.eigenval`. 
- The 'header' modifier adds a header line to the `.eigenvec` file(s).

```bash
cd largedata/RiceDiversity_44K_Genotypes_PLINK
plink -bfile binary_sativas413 --pca 'header' --out sativas413 
cd ../../
cp largedata/RiceDiversity_44K_Genotypes_PLINK/sativas413.eigenvec cache/
```

---

# PCA using PLINK

### Plot the PCA results

```r
pca <- read.table("cache/sativas413.eigenvec", header=TRUE)
plot(pca$PC1, pca$PC2, xlab="PC1", ylab="PC2")
plot(pca$PC3, pca$PC4, xlab="PC3", ylab="PC4")
```

```r
# install.packages("scatterplot3d")
library("scatterplot3d")

fsize=16
pdf("graphs/pca_3d.pdf", width=10, height=10)
scatterplot3d(pca[,3:5], pch = 16, cex.symbol=1.2, color="#00BFC4", main="Maize Diversity Panel", angle=40)
dev.off()
```

---

# GWAS using the `gemma` software package

### Fit the QK model

`\begin{align*}
\mathbf{y} &= \mathbf{Qv} + \mathbf{w_i}m_i + \mathbf{Zu} + \mathbf{e} \\
\end{align*}`

```bash
module load gemma
# To calculate centered relatedness matrix (will take ~ 1 min):
gemma -bfile binary_sativas413 -gk 1 -o binary_sativas413
```

-9 Phenotype doesn't go well with `gemma`. We have to change a little bit.

---

# GWAS using the `gemma` software package

### Fit the QK model

```r
library("data.table")

ped <- fread("sativas413.ped", header=FALSE)
ped$V6 <- 1
fwrite(ped, "sativas413.ped", sep="\t", row.names=FALSE, col.names = FALSE, quote=FALSE)

fam <- fread("sativas413.fam", header=FALSE)
fam$V6 <- 1
fwrite(fam, "sativas413.fam", sep="\t", row.names=FALSE, col.names = FALSE, quote=FALSE)
```

```bash
module plink
plink --file sativas413 --make-bed --out binary_sativas413
# To calculate centered relatedness matrix (will take ~ 1 min):
gemma -bfile binary_sativas413 -gk 1 -o binary_sativas413
```

```r
library("data.table")
k <- fread("largedata/RiceDiversity_44K_Genotypes_PLINK/output/binary_sativas413.cXX.txt", header=FALSE)
dim(k)
```

---
# GWAS using the gemma software package

### Q matrix
If one has covariates other than the intercept and wants to adjust for those covariates simultaneously, one should provide `GEMMA` with a covariates file containing an intercept term explicitly.
> from Gemma manual

```
1 1 -1.5
1 2 0.3
1 2 0.6
1 1 -0.8
1 1 2.0
```

```r
# cd to largedata/RiceDiversity_44K_Genotypes_PLINK
pca <- read.table("sativas413.eigenvec", header=TRUE)
pca[,2] <- 1
write.table(pca[,2:5], "pc3.txt", sep="\t", row.names=FALSE, 
            quote=FALSE, col.names = FALSE)
```

---
# Phenotypic data

```r
pheno <- read.delim("http://ricediversity.org/data/sets/44kgwas/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt", header=TRUE)

library(ggplot2)
ggplot(pheno, aes(x=Plant.height)) + 
  geom_histogram(aes(y=..density..), bins=50, fill="#999999")+
  geom_density(alpha=.2, fill="#FF6666") +
  labs(title="Phenotype histogram plot",x="Plant Height", y = "Density")+
  theme_classic()
```

- On HCC, write the `pheno.txt` to the genotypic data folder

```r
pheno <- read.delim("http://ricediversity.org/data/sets/44kgwas/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt", header=TRUE)
write.table(pheno[, -1:-2], "largedata/RiceDiversity_44K_Genotypes_PLINK/pheno.txt", 
            sep="\t", row.names=FALSE, quote=FALSE, col.names = FALSE)
```

---
# GWAS using the gemma software package

`\begin{align*}
\mathbf{y} &= \mathbf{Qv} + \mathbf{w_i}m_i + \mathbf{Zu} + \mathbf{e} \\
\end{align*}`

```bash
gemma -bfile binary_sativas413 -c pc3.txt -k output/binary_sativas413.cXX.txt -p pheno.txt -lmm 4 -n 12 -o Plant.height -miss 0.9 -r2 1 -hwe 0 -maf 0.05

cp output/Plant.height.assoc.txt ../../cache
```

- `lmm`: specify frequentist analysis choice (default 1; valid value 1-4; 1: Wald test; 2:
likelihood ratio test; 3: score test; 4: all 1-3.)
- `n`: specify phenotype column in the phenotype file (default 1); or to specify which
phenotypes are used in the mvLMM analysis
- `o`: specify output file prefix
- `miss`: specify missingness threshold (default 0.05)
- `r2`: specify r-squared threshold (default 0.9999)
- `hwe`: specify HWE test p value threshold (default 0; no test)
- `maf`: specify minor allele frequency threshold (default 0.01)

---
# The Manhattan plot

```r
library(qqman)
library("data.table")
res <- fread("cache/Plant.height.assoc.txt")

manhattan(x = res, chr = "chr", bp = "ps", p = "p_wald", snp = "rs", col = c("blue4", "orange3"), logp = TRUE)
```

---

# rrBLUP package for GWAS

### PLINK format

```r
# install.packages("BGLR")
library("BGLR")
ped <- read_ped("data/RiceDiversity_44K_Genotypes_PLINK/sativas413.ped") 
```

Recoding the data to `0,1,2`

```r
p=ped$p
n=ped$n
out=ped$x
#Recode snp to 0,1,2 format using allele 1
# 0 --> 0
# 1 --> 1
# 2 --> NA
# 3 --> 2
out[out==2]=NA
out[out==3]=2
Z <- matrix(out, nrow=p, ncol=n, byrow=TRUE)
Z <- t(Z) 
dim(Z) # # 413 x 36901
```

---

# rrBLUP package for GWAS

### Read fam file

```r
# accession ID
fam <- read.table("data/RiceDiversity_44K_Genotypes_PLINK/sativas413.fam", header = FALSE, stringsAsFactors = FALSE)  
head(fam)
rownames(Z) <- paste0("NSFTV_", fam$V2) # 413 x 36901
```

### SNP imputation

```r
for (j in 1:ncol(Z)){
  Z[,j] <- ifelse(is.na(Z[,j]), mean(Z[,j], na.rm=TRUE), Z[,j])
}
```

---
# rrBLUP package for GWAS

```r
# install.packages("rrBLUP")
library(rrBLUP)
map <- read.table("data/RiceDiversity_44K_Genotypes_PLINK/sativas413.map", header = FALSE, stringsAsFactors = FALSE)
mygeno <- data.frame(marker=map[,2], chrom=map[,1], pos=map[,4], t(Z-1), check.names = FALSE) # Z = \in{-1, 0, 1}
pheno <- read.delim("http://ricediversity.org/data/sets/44kgwas/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt", header=TRUE)

pheno$NSFTVID <- paste0("NSFTV_", pheno$NSFTVID)
mypheno <- data.frame(NSFTV_ID=pheno$NSFTVID, y=pheno$Plant.height)

res2 <- GWAS(mypheno, mygeno, fixed=NULL, n.PC=3, min.MAF=0.05, P3D=TRUE, plot=FALSE)

# Returns a data frame where the first three columns are the marker name, chromosome, and position, and subsequent columns are the marker scores (-log_{10}p) for the traits.
library(qqman)
pdf("graphs/mht_res2.pdf", width=10, height=5)
manhattan(x = res2, chr = "chrom", bp = "pos", p = "y", snp = "marker", col = c("blue4", "orange3"), logp = FALSE)
dev.off()
```