class: center, middle, inverse, title-slide # GWAS in practice ### Jinliang Yang ### April 16th, 2020 --- # GWAS practice using rice data [Zhao et al., 2011](https://www.nature.com/articles/ncomms1467) Genome-wide association mapping reveals a rich genetic architecture of complex traits in _Oryza sativa_ ### Materials and methods - __Materials__: 413 diverse accessions (inbreds) of rice from 82 countries - __Genotype__: 44,100 SNPs using 44 K SNP array - __Phenotype__: One location in two years. Two reps per year in a randomized complete block design. - 34 traits -- ### Five sub-populations - _Indica_ - _Aus_ - _Temperate Japonica_ - _Tropical Japonica_ - _Aromatic_ --- # Geographic distribution ```r ### read in the data from step one df <- read.csv("http://ricediversity.org/data/sets/44kgwas/RiceDiversity.44K.germplasm.csv", skip=1) df$Latitude <- as.numeric(as.character(df$Latitude)) range(df$Latitude, na.rm = T) ``` ``` ## [1] -38.4161 55.7500 ``` ```r df$Longitude <- as.numeric(as.character(df$Longitude)) range(df$Longitude, na.rm = T) ``` ``` ## [1] -102.5528 179.4144 ``` --- # Geographic distribution ```r library(ggmap) ##lowerleftlon, lowerleftlat, upperrightlon, upperrightlat myloc <- c(-105, -40, 170, 56) mymap <- get_map(location=myloc, source="stamen", crop=FALSE, color="bw") ggmap(mymap) + geom_point(aes(x = Longitude, y = Latitude), data = df, alpha = .9, size = 1, col="red") ``` --- # Weight the size of the dots Count the number of the accessions for each country. ```r library(plyr) c <- ddply(df, .(Country.of.origin), nrow) c <- subset(c, Country.of.origin != "") df2 <- merge(c, df[, c("Country.of.origin", "Latitude", "Longitude")], by="Country.of.origin") df2 <- df2[!duplicated(df2$Country.of.origin), ] mymap <- get_map(location=myloc, source="stamen", crop=FALSE, color="bw") ggmap(mymap) + geom_point(aes(x = Longitude, y = Latitude), data = df2, alpha = .9, size = df2$V1/10, col="red") ``` --- # Genotypic data - `data/RiceDiversity_44K_Genotypes_PLINK/` - sativas413.fam - sativas413.map - sativas413.ped - Or download data from: [zipped data](http://ricediversity.org/data/sets/44kgwas/) -------------------- ### PLINK PED File format __.fam__: A text file with no header line, and one line per sample with the following six fields: - Family ID ('FID') - Within-family ID ('IID'; cannot be '0') - Within-family ID of father ('0' if father isn't in dataset) - Within-family ID of mother ('0' if mother isn't in dataset) - Sex code ('1' = male, '2' = female, '0' = unknown) - Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) --- # Genotypic data - `data/RiceDiversity_44K_Genotypes_PLINK/` - sativas413.fam - sativas413.map - sativas413.ped - Or download data from: [zipped data](http://ricediversity.org/data/sets/44kgwas/) -------------------- ### PLINK PED File format __.map__: A text file with no header file, and one line per variant with the following 3-4 fields: - Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not. - Variant identifier - Position in morgans or centimorgans (optional; also safe to use dummy value of '0') - Base-pair coordinate --- # Genotypic data - `data/RiceDiversity_44K_Genotypes_PLINK/` - sativas413.fam - sativas413.map - sativas413.ped - Or download data from: [zipped data](http://ricediversity.org/data/sets/44kgwas/) -------------------- ### PLINK PED File format __.ped__: Contains no header line, and one line per sample with 6+2V fields where V is the number of variants. - The first six fields are the same as those in a .fam file. - The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on. --- # Genotypic data manipulation ```bash cp -r data/RiceDiversity_44K_Genotypes_PLINK largedata/ module load plink/1.90 # convert it to binary file cd largedata/RiceDiversity_44K_Genotypes_PLINK plink --file sativas413 --make-bed --out binary_sativas413 ``` ### Calculate MAF and missingness [plink v1.9](https://www.cog-genomics.org/plink/1.9/) - `--freq`: writes a minor allele frequency report to plink.frq - `--missing`: produces sample-based and variant-based missing data reports. ```bash plink -bfile binary_sativas413 --freq --missing --out sativas413 # copy results back to cache folder! cp largedata/RiceDiversity_44K_Genotypes_PLINK/sativas413.frq cache/ cp largedata/RiceDiversity_44K_Genotypes_PLINK/sativas413.lmiss cache/ ``` --- # Visulize MAF and locus missing rate ```r # install.packages("data.table") library("data.table") maf <- fread("cache/sativas413.frq", header=TRUE) lmiss <- fread("cache/sativas413.lmiss", header=TRUE) pdf("graphs/maf_lmiss.pdf", width = 10, height=5) par(mfrow=c(1,2)) hist(maf$MAF, breaks=50, col="#cdb79e", main="MAF (SNP = 36,901)", xlab="Minor Allele Freq") #abline(v=0.01, lty=2, col="black", lwd=3) abline(v=0.05, lty=2, col="red", lwd=3) hist(lmiss$F_MISS, breaks=35, col="#cdb79e", main="Missingness (SNP = 36,901)", xlab="Missing Rate") #abline(v=0.6, lty=2, col="red", lwd=3) #abline(v=0.05, lty=2, col="red", lwd=3) dev.off() ``` --- # LD Decay - With `--r2`, when a table format report is requested, pairs with r2 values less than 0.2 are normally filtered out of the report. - Use `--ld-window-r2` to adjust this threshold. - By default, when a limited window report is requested, every pair of variants with at least (10-1) variants between them, or more than 1000 kilobases apart, is ignored. You can change the first threshold with `--ld-window`, and the second threshold with `--ld-window-kb`. ```bash plink -bfile binary_sativas413 --r2 --ld-window 100 --ld-window-kb 100 --ld-window-r2 0 --out binary_sativas413 ``` --- # Summarize LD decay rate ```r library("data.table") df <- fread("largedata/RiceDiversity_44K_Genotypes_PLINK/binary_sativas413.ld", data.table=FALSE) BINSIZE = 100 df$dist <- df$BP_B -df$BP_A df$bin <- round(df$dist/BINSIZE, 0) library(plyr) df2 <- ddply(df, .(bin), summarise, meanr2 = mean(R2)) write.table(df2, "cache/ld_in_100bp_bin.csv", sep=",", row.names=FALSE, quote=FALSE) ``` ### Plot LD ```r ld <- read.csv("cache/ld_in_100bp_bin.csv") plot(ld$bin*100, ld$meanr2, xlab="Physical distance (bp)", ylab="R2", main="LD decay rate in rice") abline(h=0.3, col="red") ``` --- # Population structure - By default, `--pca` extracts the top 20 principal components of the variance-standardized relationship matrix; you can change the number by passing a numeric parameter. - Eigenvectors are written to `plink.eigenvec`, and top eigenvalues are written to `plink.eigenval`. - The 'header' modifier adds a header line to the .eigenvec file(s). ```bash plink -bfile binary_sativas413 --pca 'header' --out sativas413 cp largedata/RiceDiversity_44K_Genotypes_PLINK/sativas413.eigenvec cache/ ``` ### Plot the PCA results ```r pca <- read.table("cache/sativas413.eigenvec", header=TRUE) plot(pca$PC1, pca$PC2, xlab="PC1", ylab="PC2") ``` --- # GWAS using the gemma software package ### Fit the QK model `\begin{align*} \mathbf{y} &= \mathbf{Qv} + \mathbf{w_i}m_i + \mathbf{Zu} + \mathbf{e} \\ \end{align*}` `\begin{align*} V(\mathbf{u}) &= \mathbf{G}V_{g} \\ \end{align*}` ```bash # To calculate centered relatedness matrix (will take ~ 1 min): gemma -bfile binary_sativas413 -gk 1 -o binary_sativas413 ``` -- -9 Phenotype doesn't go well with Gemma. We have to change a little bit. ```r library("data.table") ped <- fread("sativas413.ped", header=FALSE) ped$V6 <- 1 fwrite(ped, "sativas413.ped", sep="\t", row.names=FALSE, col.names = FALSE, quote=FALSE) fam <- fread("sativas413.fam", header=FALSE) fam$V6 <- 1 fwrite(fam, "sativas413.fam", sep="\t", row.names=FALSE, col.names = FALSE, quote=FALSE) ``` --- # GWAS using the gemma software package ### Q matrix If one has covariates other than the intercept and wants to adjust for those covariates simultaneously, one should provide GEMMA with a covariates file containing an intercept term explicitly. > from Gemma manul ``` 1 1 -1.5 1 2 0.3 1 2 0.6 1 1 -0.8 1 1 2.0 ``` -- ```r # cd to largedata/RiceDiversity_44K_Genotypes_PLINK pca <- read.table("sativas413.eigenvec", header=TRUE) pca[,2] <- 1 write.table(pca[,2:4], "pc3.txt", sep="\t", row.names=FALSE, quote=FALSE, col.names = FALSE) ``` --- # Phenotypic data ```r pheno <- read.delim("http://ricediversity.org/data/sets/44kgwas/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt", header=TRUE) write.table(pheno[, -1:-2], "largedata/RiceDiversity_44K_Genotypes_PLINK/pheno.txt", sep="\t", row.names=FALSE, quote=FALSE, col.names = FALSE) library(ggplot2) ggplot(pheno, aes(x=Plant.height)) + geom_histogram(aes(y=..density..), bins=50, fill="#999999")+ geom_density(alpha=.2, fill="#FF6666") + labs(title="Phenotype histogram plot",x="Plant Height", y = "Density")+ theme_classic() ``` --- # GWAS using the gemma software package `\begin{align*} \mathbf{y} &= \mathbf{Qv} + \mathbf{w_i}m_i + \mathbf{Zu} + \mathbf{e} \\ \end{align*}` `\begin{align*} V(\mathbf{u}) &= \mathbf{G}V_{g} \\ \end{align*}` ```bash gemma -bfile binary_sativas413 -k output/binary_sativas413.cXX.txt -c pc3.txt -p pheno.txt -lmm 4 -n 12 -o Plant.height -miss 0.9 -r2 1 -hwe 0 -maf 0.05 cp output/Plant.height.assoc.txt ../../../cache ``` - `lmm`: specify frequentist analysis choice (default 1; valid value 1-4; 1: Wald test; 2: likelihood ratio test; 3: score test; 4: all 1-3.) - `n`: specify phenotype column in the phenotype file (default 1); or to specify which phenotypes are used in the mvLMM analysis - `o`: specify output file prefix - `miss`: specify missingness threshold (default 0.05) - `r2`: specify r-squared threshold (default 0.9999) - `hwe`: specify HWE test p value threshold (default 0; no test) - `maf`: specify minor allele frequency threshold (default 0.01) --- # The Manhattan plot ```r library(qqman) res <- fread("cache/Plant.height.assoc.txt") manhattan(x = res, chr = "chr", bp = "ps", p = "p_wald", snp = "rs", col = c("blue4", "orange3"), logp = TRUE) ``` --- # rrBLUP package for GWAS ### PLINK format ```r # install.packages("BGLR") library("BGLR") ped <- read_ped("data/RiceDiversity_44K_Genotypes_PLINK/sativas413.ped") ``` Recoding the data to 0,1,2 ```r p=ped$p n=ped$n out=ped$x #Recode snp to 0,1,2 format using allele 1 # 0 --> 0 # 1 --> 1 # 2 --> NA # 3 --> 2 out[out==2]=NA out[out==3]=2 Z <- matrix(out, nrow=p, ncol=n, byrow=TRUE) Z <- t(Z) dim(Z) # # 413 x 36901 ``` --- # rrBLUP package for GWAS ### Read fam file ```r # accession ID fam <- read.table("data/RiceDiversity_44K_Genotypes_PLINK/sativas413.fam", header = FALSE, stringsAsFactors = FALSE) head(fam) rownames(Z) <- paste0("NSFTV_", fam$V2) # 413 x 36901 ``` ### SNP imputation ```r for (j in 1:ncol(Z)){ Z[,j] <- ifelse(is.na(Z[,j]), mean(Z[,j], na.rm=TRUE), Z[,j]) } ``` --- # rrBLUP package for GWAS ```r # install.packages("rrBLUP") library(rrBLUP) map <- read.table("data/RiceDiversity_44K_Genotypes_PLINK/sativas413.map", header = FALSE, stringsAsFactors = FALSE) ?GWAS mygeno <- data.frame(marker=map[,2], chrom=map[,1], pos=map[,4], t(Z-1), check.names = FALSE) # Z = \in{-1, 0, 1} pheno$NSFTVID<- paste0("NSFTV_", pheno$NSFTVID) mypheno <- data.frame(NSFTV_ID=pheno$NSFTVID, y=pheno$Plant.height) res2 <- GWAS(mypheno, mygeno, n.PC=3, min.MAF=0.05, P3D=TRUE, plot=FALSE) # Returns a data frame where the first three columns are the marker name, chromosome, and position, and subsequent columns are the marker scores (-log_{10}p) for the traits. manhattan(x = res2, chr = "chrom", bp = "pos", p = "y", snp = "marker", col = c("blue4", "orange3"), logp = FALSE) ```