GWAS in practice

class: center, middle, inverse, title-slide

# GWAS in practice
### Jinliang Yang
### April 16th, 2020

---

# GWAS practice using rice data

[Zhao et al., 2011](https://www.nature.com/articles/ncomms1467) Genome-wide association mapping reveals a rich genetic architecture of complex traits in _Oryza sativa_

### Materials and methods
- __Materials__: 413 diverse accessions (inbreds) of rice from 82 countries
- __Genotype__: 44,100 SNPs using 44 K SNP array
- __Phenotype__: One location in two years. Two reps per year in a randomized complete block design.
  - 34 traits

### Five sub-populations
- _Indica_
- _Aus_
- _Temperate Japonica_
- _Tropical Japonica_
- _Aromatic_

---

# Geographic distribution

```r
### read in the data from step one
df <- read.csv("http://ricediversity.org/data/sets/44kgwas/RiceDiversity.44K.germplasm.csv", skip=1)
df$Latitude <- as.numeric(as.character(df$Latitude))
range(df$Latitude, na.rm = T)
```

```
## [1] -38.4161  55.7500
```

```r
df$Longitude <- as.numeric(as.character(df$Longitude))
range(df$Longitude, na.rm = T)
```

```
## [1] -102.5528  179.4144
```

---

# Geographic distribution

```r
library(ggmap)
##lowerleftlon, lowerleftlat, upperrightlon, upperrightlat
myloc <- c(-105, -40, 170, 56)

mymap <- get_map(location=myloc, source="stamen", crop=FALSE, color="bw")
ggmap(mymap) + 
    geom_point(aes(x = Longitude, y = Latitude), data = df,
               alpha = .9, size = 1, col="red")
```

---

# Weight the size of the dots

Count the number of the accessions for each country.

```r
library(plyr)
c <- ddply(df, .(Country.of.origin), nrow)
c <- subset(c, Country.of.origin != "")
df2 <- merge(c, df[, c("Country.of.origin",  "Latitude", "Longitude")], by="Country.of.origin")
df2 <- df2[!duplicated(df2$Country.of.origin), ]

mymap <- get_map(location=myloc, source="stamen", crop=FALSE, color="bw")
ggmap(mymap) + 
    geom_point(aes(x = Longitude, y = Latitude), data = df2,
               alpha = .9, size = df2$V1/10, col="red")
```

---

# Genotypic data

- `data/RiceDiversity_44K_Genotypes_PLINK/`
  - sativas413.fam
  - sativas413.map
  - sativas413.ped
- Or download data from: [zipped data](http://ricediversity.org/data/sets/44kgwas/)

--------------------

### PLINK PED File format

__.fam__: A text file with no header line, and one line per sample with the following six fields:

- Family ID ('FID')
- Within-family ID ('IID'; cannot be '0')
- Within-family ID of father ('0' if father isn't in dataset)
- Within-family ID of mother ('0' if mother isn't in dataset)
- Sex code ('1' = male, '2' = female, '0' = unknown)
- Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

---

# Genotypic data

- `data/RiceDiversity_44K_Genotypes_PLINK/`
  - sativas413.fam
  - sativas413.map
  - sativas413.ped
- Or download data from: [zipped data](http://ricediversity.org/data/sets/44kgwas/)

--------------------

### PLINK PED File format

__.map__: A text file with no header file, and one line per variant with the following 3-4 fields:

- Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.
- Variant identifier
- Position in morgans or centimorgans (optional; also safe to use dummy value of '0')
- Base-pair coordinate

---

# Genotypic data

- `data/RiceDiversity_44K_Genotypes_PLINK/`
  - sativas413.fam
  - sativas413.map
  - sativas413.ped
- Or download data from: [zipped data](http://ricediversity.org/data/sets/44kgwas/)

--------------------
### PLINK PED File format

__.ped__: Contains no header line, and one line per sample with 6+2V fields where V is the number of variants.

- The first six fields are the same as those in a .fam file. 
- The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on.

---

# Genotypic data manipulation

```bash
cp -r data/RiceDiversity_44K_Genotypes_PLINK largedata/

module load plink/1.90
# convert it to binary file
cd largedata/RiceDiversity_44K_Genotypes_PLINK
plink --file sativas413 --make-bed --out binary_sativas413
```

### Calculate MAF and missingness

[plink v1.9](https://www.cog-genomics.org/plink/1.9/)

- `--freq`: writes a minor allele frequency report to plink.frq
- `--missing`: produces sample-based and variant-based missing data reports.

```bash
plink -bfile binary_sativas413 --freq --missing --out sativas413

# copy results back to cache folder!
cp largedata/RiceDiversity_44K_Genotypes_PLINK/sativas413.frq cache/
cp largedata/RiceDiversity_44K_Genotypes_PLINK/sativas413.lmiss cache/
```

---
# Visulize MAF and locus missing rate

```r
# install.packages("data.table")
library("data.table")

maf <- fread("cache/sativas413.frq", header=TRUE)
lmiss <- fread("cache/sativas413.lmiss", header=TRUE)

pdf("graphs/maf_lmiss.pdf", width = 10, height=5)
par(mfrow=c(1,2))
hist(maf$MAF, breaks=50, col="#cdb79e", main="MAF (SNP = 36,901)", xlab="Minor Allele Freq")
#abline(v=0.01, lty=2, col="black", lwd=3)
abline(v=0.05, lty=2, col="red", lwd=3)

hist(lmiss$F_MISS, breaks=35, col="#cdb79e", main="Missingness (SNP = 36,901)", xlab="Missing Rate")
#abline(v=0.6, lty=2, col="red", lwd=3)
#abline(v=0.05, lty=2, col="red", lwd=3)
dev.off()
```

---
# LD Decay

- With `--r2`, when a table format report is requested, pairs with r2 values less than 0.2 are normally filtered out of the report.

- Use `--ld-window-r2` to adjust this threshold.

- By default, when a limited window report is requested, every pair of variants with at least (10-1) variants between them, or more than 1000 kilobases apart, is ignored. You can change the first threshold with `--ld-window`, and the second threshold with `--ld-window-kb`.

```bash
plink -bfile binary_sativas413 --r2 --ld-window 100 --ld-window-kb 100 --ld-window-r2 0 --out binary_sativas413 
```

---

# Summarize LD decay rate

```r
library("data.table")

df <- fread("largedata/RiceDiversity_44K_Genotypes_PLINK/binary_sativas413.ld", data.table=FALSE)

BINSIZE = 100
df$dist <- df$BP_B -df$BP_A
df$bin <- round(df$dist/BINSIZE, 0)

library(plyr)

df2 <- ddply(df, .(bin), summarise,
      meanr2 = mean(R2))

write.table(df2, "cache/ld_in_100bp_bin.csv", sep=",", row.names=FALSE, quote=FALSE)
```

### Plot LD

```r
ld <- read.csv("cache/ld_in_100bp_bin.csv")

plot(ld$bin*100, ld$meanr2, xlab="Physical distance (bp)", ylab="R2", main="LD decay rate in rice")
abline(h=0.3, col="red")
```

---

# Population structure

- By default, `--pca` extracts the top 20 principal components of the variance-standardized relationship matrix; you can change the number by passing a numeric parameter. 
- Eigenvectors are written to `plink.eigenvec`, and top eigenvalues are written to `plink.eigenval`. 
- The 'header' modifier adds a header line to the .eigenvec file(s).

```bash
plink -bfile binary_sativas413 --pca 'header' --out sativas413

cp largedata/RiceDiversity_44K_Genotypes_PLINK/sativas413.eigenvec cache/
```

### Plot the PCA results

```r
pca <- read.table("cache/sativas413.eigenvec", header=TRUE)
plot(pca$PC1, pca$PC2, xlab="PC1", ylab="PC2")
```

---
# GWAS using the gemma software package

### Fit the QK model

`\begin{align*}
\mathbf{y} &= \mathbf{Qv} + \mathbf{w_i}m_i + \mathbf{Zu} + \mathbf{e} \\
\end{align*}`

`\begin{align*}
V(\mathbf{u}) &= \mathbf{G}V_{g} \\
\end{align*}`

```bash
# To calculate centered relatedness matrix (will take ~ 1 min):
gemma -bfile binary_sativas413 -gk 1 -o binary_sativas413
```

-9 Phenotype doesn't go well with Gemma. We have to change a little bit.

```r
library("data.table")

ped <- fread("sativas413.ped", header=FALSE)
ped$V6 <- 1
fwrite(ped, "sativas413.ped", sep="\t", row.names=FALSE, col.names = FALSE, quote=FALSE)

fam <- fread("sativas413.fam", header=FALSE)
fam$V6 <- 1
fwrite(fam, "sativas413.fam", sep="\t", row.names=FALSE, col.names = FALSE, quote=FALSE)
```

---
# GWAS using the gemma software package

### Q matrix
If one has covariates other than the intercept and wants to adjust for those covariates simultaneously, one should provide GEMMA with a covariates file containing an intercept term explicitly.
> from Gemma manul

```
1 1 -1.5
1 2 0.3
1 2 0.6
1 1 -0.8
1 1 2.0
```

```r
# cd to largedata/RiceDiversity_44K_Genotypes_PLINK
pca <- read.table("sativas413.eigenvec", header=TRUE)
pca[,2] <- 1
write.table(pca[,2:4], "pc3.txt", sep="\t", row.names=FALSE, quote=FALSE, col.names = FALSE)
```

---
# Phenotypic data

```r
pheno <- read.delim("http://ricediversity.org/data/sets/44kgwas/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt", header=TRUE)
write.table(pheno[, -1:-2], "largedata/RiceDiversity_44K_Genotypes_PLINK/pheno.txt", sep="\t", row.names=FALSE, quote=FALSE, col.names = FALSE)

library(ggplot2)
ggplot(pheno, aes(x=Plant.height)) + 
  geom_histogram(aes(y=..density..), bins=50, fill="#999999")+
  geom_density(alpha=.2, fill="#FF6666") +
  labs(title="Phenotype histogram plot",x="Plant Height", y = "Density")+
  theme_classic()
```

---
# GWAS using the gemma software package

`\begin{align*}
\mathbf{y} &= \mathbf{Qv} + \mathbf{w_i}m_i + \mathbf{Zu} + \mathbf{e} \\
\end{align*}`

`\begin{align*}
V(\mathbf{u}) &= \mathbf{G}V_{g} \\
\end{align*}`

```bash
gemma -bfile binary_sativas413 -k output/binary_sativas413.cXX.txt -c pc3.txt -p pheno.txt -lmm 4 -n 12 -o Plant.height -miss 0.9 -r2 1 -hwe 0 -maf 0.05

cp output/Plant.height.assoc.txt ../../../cache
```

- `lmm`: specify frequentist analysis choice (default 1; valid value 1-4; 1: Wald test; 2:
likelihood ratio test; 3: score test; 4: all 1-3.)
- `n`: specify phenotype column in the phenotype file (default 1); or to specify which
phenotypes are used in the mvLMM analysis
- `o`: specify output file prefix
- `miss`: specify missingness threshold (default 0.05)
- `r2`: specify r-squared threshold (default 0.9999)
- `hwe`: specify HWE test p value threshold (default 0; no test)
- `maf`: specify minor allele frequency threshold (default 0.01)

---
# The Manhattan plot

```r
library(qqman)
res <- fread("cache/Plant.height.assoc.txt")

manhattan(x = res, chr = "chr", bp = "ps", p = "p_wald", snp = "rs", col = c("blue4", "orange3"), logp = TRUE)
```

---

# rrBLUP package for GWAS

### PLINK format

```r
# install.packages("BGLR")
library("BGLR")
ped <- read_ped("data/RiceDiversity_44K_Genotypes_PLINK/sativas413.ped") 
```

Recoding the data to 0,1,2

```r
p=ped$p
n=ped$n
out=ped$x
#Recode snp to 0,1,2 format using allele 1
# 0 --> 0
# 1 --> 1
# 2 --> NA
# 3 --> 2
out[out==2]=NA
out[out==3]=2
Z <- matrix(out, nrow=p, ncol=n, byrow=TRUE)
Z <- t(Z) 
dim(Z) # # 413 x 36901
```

---

# rrBLUP package for GWAS

### Read fam file

```r
# accession ID
fam <- read.table("data/RiceDiversity_44K_Genotypes_PLINK/sativas413.fam", header = FALSE, stringsAsFactors = FALSE)  
head(fam)
rownames(Z) <- paste0("NSFTV_", fam$V2) # 413 x 36901
```

### SNP imputation

```r
for (j in 1:ncol(Z)){
  Z[,j] <- ifelse(is.na(Z[,j]), mean(Z[,j], na.rm=TRUE), Z[,j])
}
```

---
# rrBLUP package for GWAS

```r
# install.packages("rrBLUP")
library(rrBLUP)
map <- read.table("data/RiceDiversity_44K_Genotypes_PLINK/sativas413.map", header = FALSE, stringsAsFactors = FALSE)
?GWAS
mygeno <- data.frame(marker=map[,2], chrom=map[,1], pos=map[,4], t(Z-1), check.names = FALSE) # Z = \in{-1, 0, 1}

pheno$NSFTVID<- paste0("NSFTV_", pheno$NSFTVID)
mypheno <- data.frame(NSFTV_ID=pheno$NSFTVID, y=pheno$Plant.height)

res2 <- GWAS(mypheno, mygeno, n.PC=3, min.MAF=0.05, P3D=TRUE, plot=FALSE)

# Returns a data frame where the first three columns are the marker name, chromosome, and position, and subsequent columns are the marker scores (-log_{10}p) for the traits.
manhattan(x = res2, chr = "chrom", bp = "pos", p = "y", snp = "marker", col = c("blue4", "orange3"), logp = FALSE)
```