Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Thetas and Fst calculation

Jinliang Yang

Feb. 6th, 2020

1 / 29

NGS data simulation using wgsim

Usage: wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>
Options: -e FLOAT base error rate [0.020]
-d INT outer distance between the two ends [500]
-s INT standard deviation [50]
-N INT number of read pairs [1000000]
-1 INT length of the first read [70]
-2 INT length of the second read [70]
-r FLOAT rate of mutations [0.0010]
-R FLOAT fraction of indels [0.15]
-X FLOAT probability an indel is extended [0.30]
-S INT seed for random generator [-1]
-A FLOAT disgard if the fraction of ambiguous
bases higher than FLOAT [0.05]
-h haplotype mode

Type in the following command:

wgsim lambda.fa -e 0 -d 500 -N 5000 -1 100 -2 100 -r 0.01 \
-R 0 -X 0 -S 1234567 -h l1.read1.fq l1.read2.fq
2 / 29

Download Reference from EnsemblPlants

Maize reference genome

Change to largedata\lab4 folder:

cd largedata
mkdir lab4
cd lab4

Then use wget to download the reference genome:

wget ftp://ftp.ensemblgenomes.org/pub/plants/release-46/fasta/zea_mays/dna/Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz
### then unzip it
gunzip Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz
3 / 29

NGS data simulation using wgsim

Type in the following command:

wgsim Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa \
-e 0 -d 500 -N 5000 -1 100 -2 100 -r 0.01 \
-R 0 -X 0 -S 1234567 l1.read1.fq l1.read2.fq
  • Reference (about 500k)
    • Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa
  • 20x coverage
    • N 5000
  • PE 100bp
    • -1 100 -2 100
  • Only SNP no Indel
    • -R 0 -X 0
  • Mutation rate is low
    • -r 0.01
4 / 29

NGS data simulation using wgsim

simulate 10 individals

for i in {1..10}
do
wgsim Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -e 0 -d 500 -N 50000 -1 100 -2 100 -r 0.1 -R 0 -X 0 l$i.read1.fq l$i.read2.fq
done
5 / 29

NGS data simulation using wgsim

simulate 10 individals

for i in {1..10}
do
wgsim Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -e 0 -d 500 -N 50000 -1 100 -2 100 -r 0.1 -R 0 -X 0 l$i.read1.fq l$i.read2.fq
done

check how many reads

wc -l l1.read1.fq
# suppose to be 200,000 lines = 50,000 reads
6 / 29

A procedure to calculate θ and FST values

1. Align the NGS reads to the reference genome

2. Calculate SFS

3. Calculate the thetas and Fst for each site

  • ANGSD
7 / 29

A procedure to calculate θ and FST values

1. Align the NGS reads to the reference genome

module load bwa samtools
# index the reference genome
bwa index Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa
# using bwa mem to align the reads to the reference genome
# => samtools to convert into bam file
bwa mem Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa l1.read1.fq l1.read2.fq | samtools view -bSh - > l1.bam

Do alignment for 2 individuals using bash loop:

# alignment
for i in {1..2}; do bwa mem Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa l$i.read1.fq l$i.read2.fq | samtools view -bSh - > l$i.bam; done
# sort
for i in *.bam; do samtools sort $i -o sorted_$i; done
# index them
for i in sorted*.bam; do samtools index $i; done

Check mapping statistics

samtools flagstat sorted_l1.bam
8 / 29

Submit a Slurm job

  • We wrap our jobs in little batch scripts, which is nice because these also help make steps reproducible.

  • To keep your directory organized, I usually keep a scripts directory (i.e., slurm-script/ ) and log dir (i.e., slurm-log ) for Slurm’s logs.

    • Tip: use these logs, as these are very helpful in debugging. I separate them from my project because they fill up directories rather quickly.
9 / 29

Submit a Slurm job

  • We wrap our jobs in little batch scripts, which is nice because these also help make steps reproducible.

  • To keep your directory organized, I usually keep a scripts directory (i.e., slurm-script/ ) and log dir (i.e., slurm-log ) for Slurm’s logs.

    • Tip: use these logs, as these are very helpful in debugging. I separate them from my project because they fill up directories rather quickly.

Let’s look at an example slurm script header for a job called theta (which is run with script theta.sh).

#!/bin/bash -l
#SBATCH -D ~projects/your-cool-project/
#SBATCH -o ~/your-cool-project/slurm-log/steve-stdout-%j.txt
#SBATCH -e ~/your-cool-project/slurm-log/steve-stderr-%j.txt
#SBATCH -J steve
#SBATCH -t 24:00:00
set -e
set -u
# insert your script here
10 / 29

An Example Slurm Batch Script Header

#!/bin/bash -l
#SBATCH -D ~/projects/your-cool-project/
#SBATCH -o ~/your-cool-project/slurm-log/steve-stdout-%j.txt
#SBATCH -e ~/your-cool-project/slurm-log/steve-stderr-%j.txt
#SBATCH -J theta
#SBATCH -t 24:00:00
#SBATCH --mail-user=your_email_address@gmail.com
#SBATCH --mail-type=END #email if ends
#SBATCH --mail-type=FAIL #email if fails
set -e
set -u
# insert your script here
  • D sets your project directory.
  • o sets where standard output (of your batch script) goes.
  • e sets where standard error (of your batch script) goes.
  • J sets the job name.
  • t sets the time limit for the job, 24:00:00 indicates 24 hours.
  • --mail: will email you if the job is "END" or "FAIL"
11 / 29

An Example Slurm Batch Script Header

cd slurm-script
vi my_theta.sh
i # insert text
  • Copy the above header to a .sh file and make appropriate modifications

  • Insert the following:

# module load bwa samtools
# cd largedata/lab4/
# alignment
for i in {1..2}; do bwa mem Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa l$i.read1.fq l$i.read2.fq | samtools view -bSh - > l$i.bam; done
# sort
for i in *.bam; do samtools sort $i -o sorted_$i; done
# index them
for i in sorted*.bam; do samtools index $i; done
12 / 29

An Example Slurm Batch Script Header

cd slurm-script
vi my_theta.sh
i # insert text
  • Copy the above header to a .sh file and make appropriate modifications

  • Insert the following:

# module load bwa samtools
# cd largedata/lab4/
# alignment
for i in {1..2}; do bwa mem Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa l$i.read1.fq l$i.read2.fq | samtools view -bSh - > l$i.bam; done
# sort
for i in *.bam; do samtools sort $i -o sorted_$i; done
# index them
for i in sorted*.bam; do samtools index $i; done
  • submit the job via sbatch:
sbatch --licenses=common --ntasks=1 --mem=5G slurm-script/my_theta.sh
## check your job status
squeue | grep "YOUR USER ID"
13 / 29

A procedure to calculate θ values

2. Calculate SFS using ANGSD

Install ANGSD first

cd ~/bin/ # if you don't have one, do `mkdir bin`
git clone https://github.com/samtools/htslib.git
git clone https://github.com/ANGSD/angsd.git
cd htslib; make;
cd ../angsd;
make HTSSRC=../htslib

run angsd

# write the bam files to a txt file
mkdir bam_files
mv sorted*.bam bam_files
cd bam_files
ls sorted*.bam > bam.txt
14 / 29

A procedure to calculate θ values

angsd -bam bam.txt -doSaf 1 -anc ../Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -GL 1 -out out
# use realSFS to calculate sfs
realSFS out.saf.idx > out.sfs

Copy the result to cache/ folder

## cp sfs to the cache/ folder
cp out.sfs ../../../cache/

3. Calculate the thetas for each site

The output from the above command are two files out.thetas.gz and out.thetas.idx. A formal description of these files can be found in the doc/formats.pdf in the angsd package. It is possible to extract the logscale persite thetas using the ./thetaStat print program.

angsd -bam bam.txt -out out -doThetas 1 -doSaf 1 -pest out.sfs -anc ../Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -GL 1
thetaStat print out.thetas.idx > theta.txt
## cp theta to the cache/ folder
cp theta.txt ../../../cache/
15 / 29

A procedure to calculate FST

Create two list bam files

Two population Fst

# first calculate per pop saf for each populatoin
angsd -b pop1.txt -anc ../Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -out pop1 -dosaf 1 -gl 1
angsd -b pop2.txt -anc ../Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -out pop2 -dosaf 1 -gl 1
# calculate the 2dsfs prior
realSFS pop1.saf.idx pop2.saf.idx > pop1.pop2.ml
# prepare the fst for easy window analysis etc
realSFS fst index pop1.saf.idx pop2.saf.idx -sfs pop1.pop2.ml -fstout out
# get the global estimate
realSFS fst stats out.fst.idx
# below is not tested that much, but seems to work
realSFS fst stats2 out.fst.idx -win 500 -step 100 > fst_win.txt

Copy the result to cache/ folder

## cp sfs to the cache/ folder
cp fst_win.txt ../../../cache/
16 / 29

Visualize the results

In local computer, using R:

Barplot for SFS

s <- scan('../../cache/out.sfs')
s <- s[-c(1,length(s))]
s <- s/sum(s)
barplot(s,names=1:length(s), main='SFS')

Histgram distribution of the theta values

hist(t$Pairwise)

Scatter plot of the Fst values

fst <- read.table("cache/fst_win.txt", skip=1, header=FALSE)
names(fst)[c(3,5)] <- c("midp", "fst")
plot(fst$midp, fst$fst, xlab="Physical position", ylab="Fst", col="#5f9ea0", pch=16)
17 / 29

General feature format (GFF) from EnsemblPlants

Maize reference genome

change to largedata\lab4 folder:

wget ftp://ftp.ensemblgenomes.org/pub/plants/release-46/fasta/zea_mays/dna/Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz
### then unzip it
gunzip Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz

Similarly, we will download and unzip the Mt GFF3 file

18 / 29

General feature format (GFF) from EnsemblPlants

Maize reference genome

change to largedata\lab4 folder:

wget ftp://ftp.ensemblgenomes.org/pub/plants/release-46/fasta/zea_mays/dna/Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz
### then unzip it
gunzip Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz

Similarly, we will download and unzip the Mt GFF3 file

Use R to process the GFF3 file

# install.package("data.table")
library("data.table")
## simply read in wouldn't work
gff <- fread("largedata/lab4/Zea_mays.B73_RefGen_v4.46.chromosome.Mt.gff3", skip="#", header=FALSE, data.table=FALSE)
## grep -v means select lines that not matching any of the specified patterns
gff <- fread(cmd='grep -v "#" largedata/lab4/Zea_mays.B73_RefGen_v4.46.chromosome.Mt.gff3', header=FALSE, data.table=FALSE)
19 / 29

General feature format (GFF) version 3

V1 V2 V3 V4 V5 V6 V7 V8
1 Mt Gramene chromosome 1 569630 . . .
2 Mt ensembl gene 6391 6738 . + .
3 Mt NCBI mRNA 6391 6738 . + .
4 Mt NCBI exon 6391 6738 . + .
5 Mt NCBI CDS 6391 6738 . + 0
6 Mt ensembl gene 6951 8285 . + .
V9
1 ID=chromosome:Mt;Alias=AY506529.1,NC_007982.1;Is_circular=true
2 ID=gene:ZeamMp002;biotype=protein_coding;description=orf115-a1;
3 ID=transcript:ZeamMp002;Parent=gene:ZeamMp002;
4 Parent=transcript:ZeamMp002;Name=ZeamMp002.exon1;constitutive=1;ensembl_end_phase=0;
5 ID=CDS:ZeamMp002;Parent=transcript:ZeamMp002;
6 ID=gene:ZeamMp003;biotype=protein_coding;description=orf444
20 / 29

General feature format (GFF) version 3

V1 V2 V3 V4 V5 V6 V7 V8
1 Mt Gramene chromosome 1 569630 . . .
2 Mt ensembl gene 6391 6738 . + .
V9
1 ID=chromosome:Mt;Alias=AY506529.1,NC_007982.1;Is_circular=true
2 ID=gene:ZeamMp002;biotype=protein_coding;description=orf115-a1;

  • 1 sequence: The name of the sequence where the feature is located.

  • 2 source: Keyword identifying the source of the feature, like a program (e.g. RepeatMasker) or an organization (like ensembl).

  • 3 feature: The feature type name, like "gene" or "exon".

    • In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line).
  • 4 start: Genomic start of the feature, with a 1-base offset.

    • This is in contrast with other 0-offset half-open sequence formats, like BED.
21 / 29

General feature format (GFF) version 3

V1 V2 V3 V4 V5 V6 V7 V8
1 Mt Gramene chromosome 1 569630 . . .
2 Mt ensembl gene 6391 6738 . + .
V9
1 ID=chromosome:Mt;Alias=AY506529.1,NC_007982.1;Is_circular=true
2 ID=gene:ZeamMp002;biotype=protein_coding;description=orf115-a1;

  • 5 end: Genomic end of the feature, with a 1-base offset.

    • This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.
  • 6 score: Numeric value that generally indicates the confidence of the source in the annotated feature.

    • A value of "." (a dot) is used to define a null value.
  • 7 strand: Single character that indicates the strand of the feature.

    • it can assume the values of "+" (positive, or 5' -> 3'),
    • "-", (negative, or 3' -> 5'), "." (undetermined).
22 / 29

General feature format (GFF) version 3

V1 V2 V3 V4 V5 V6 V7 V8
1 Mt Gramene chromosome 1 569630 . . .
2 Mt ensembl gene 6391 6738 . + .
V9
1 ID=chromosome:Mt;Alias=AY506529.1,NC_007982.1;Is_circular=true
2 ID=gene:ZeamMp002;biotype=protein_coding;description=orf115-a1;

  • 8 phase: phase of CDS (means CoDing Sequence) features.

    • The phase indicates where the feature begins with reference to the reading frame.
    • The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon.
  • 9 attributes: All the other information pertaining to this feature.

    • The format, structure and content of this field is the one which varies the most between the three competing file formats.
23 / 29

Work with GFF

names(gff) <- c("seq", "source", "feature", "start", "end", "score", "strand", "phase", "att")
table(gff$feature)

Get genes and upstream and downstream 5kb regions

g <- subset(gff, feature %in% "gene")
g$geneid <- gsub(".*gene:|;biotype.*", "", g$att)
### + strand
gp <- subset(g, strand %in% "+")
# nrow(gp) 75
### get the 5k upstream of the + strand gene model
gp_up <- gp
gp_up$end <- gp_up$start - 1
gp_up$start <- gp_up$end - 5000
### get the 5k downstream of the + strand gene model
gp_down <- gp
gp_down$start <- gp_down$end + 1
gp_down$end <- gp_down$start + 5000
24 / 29

Get genes and upstream and downstream 5kb regions

### - strand
gm <- subset(g, strand %in% "-")
dim(gm) # 82
fwrite(g, "cache/mt_gene.txt", sep="\t", row.names = FALSE, quote=FALSE)
25 / 29

Intepret the theta results

library("data.table")
library("GenomicRanges")
library("plyr")
theta <- fread("cache/theta.txt", data.table=FALSE)
names(theta)[1] <- "seq"
up5k <- read.table("cache/mt_gene_up5k.txt", header=TRUE)
### define the subject file for theta values
grc <- with(theta, GRanges(seqnames=seq, IRanges(start=Pos, end=Pos)))
### define the query file for genomic feature
grf <- with(up5k, GRanges(seqnames=seq, IRanges(start=start, end=end), geneid=geneid))
### find overlaps between the two
tb <- findOverlaps(query=grf, subject=grc)
tb <- as.matrix(tb)
out1 <- as.data.frame(grf[tb[,1]])
out2 <- as.data.frame(grc[tb[,2]])
### for each genomic feature, find the sites with non-missing data
out <- cbind(out1, out2[, "start"])
names(out)[ncol(out)] <- "pos"
26 / 29

Intepret the theta results

#define unique identifier and merge with the thetas
out$uid <- paste(out$seqnames, out$pos, sep="_")
theta$uid <- paste(theta$seq, theta$Pos, sep="_")
df <- merge(out, theta[, c(-1, -2)], by="uid")
# for each upstream 5k region, how many theta values
mx <- ddply(df, .(geneid), summarise,
Pairwise = mean(Pairwise, na.rm=TRUE),
thetaH = mean(thetaH, na.rm=TRUE),
nsites = length(uid))
27 / 29

Intepret the theta results

Copy and paste everything above, and pack it into an R function:

get_mean_theta <- function(gf_file="cache/mt_gene_up5k.txt"){
# gf_file: gene feature file [chr, ="cache/mt_gene_up5k.txt"]
theta <- fread("cache/theta.txt", data.table=FALSE)
names(theta)[1] <- "seq"
up5k <- read.table(gf_file, header=TRUE)
### define the subject file for theta values
grc <- with(theta, GRanges(seqnames=seq, IRanges(start=Pos, end=Pos)))
### define the query file for genomic feature
grf <- with(up5k, GRanges(seqnames=seq, IRanges(start=start, end=end), geneid=geneid))
### find overlaps between the two
tb <- findOverlaps(query=grf, subject=grc)
tb <- as.matrix(tb)
out1 <- as.data.frame(grf[tb[,1]])
out2 <- as.data.frame(grc[tb[,2]])
### for each genomic feature, find the sites with non-missing data
out <- cbind(out1, out2[, "start"])
names(out)[ncol(out)] <- "pos"
#define unique identifier and merge with the thetas
out$uid <- paste(out$seqnames, out$pos, sep="_")
theta$uid <- paste(theta$seq, theta$Pos, sep="_")
df <- merge(out, theta[, c(-1, -2)], by="uid")
# for each upstream 5k region, how many theta values
mx <- ddply(df, .(geneid), summarise,
Pairwise = mean(Pairwise, na.rm=TRUE),
thetaH = mean(thetaH, na.rm=TRUE),
nsites = length(uid))
return(mx)
}
28 / 29

Plot the results

Run the customized R function

### apply the function
up5k <- get_mean_theta(gf_file="cache/mt_gene_up5k.txt")
down5k <- get_mean_theta(gf_file="cache/mt_gene_down5k.txt")

And then plot the results:

library("ggplot2")
up5k$feature <- "up 5k"
down5k$feature <- "down 5k"
res <- rbind(up5k, down5k)
ggplot(res, aes(x=feature, y=Pairwise, fill=feature)) +
geom_violin(trim=FALSE)+
labs(title="Theta value", x="", y = "Log10 (theta)")+
geom_boxplot(width=0.1, fill="white")+
scale_fill_brewer(palette="Blues") +
theme_classic()
29 / 29

NGS data simulation using wgsim

Usage: wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>
Options: -e FLOAT base error rate [0.020]
-d INT outer distance between the two ends [500]
-s INT standard deviation [50]
-N INT number of read pairs [1000000]
-1 INT length of the first read [70]
-2 INT length of the second read [70]
-r FLOAT rate of mutations [0.0010]
-R FLOAT fraction of indels [0.15]
-X FLOAT probability an indel is extended [0.30]
-S INT seed for random generator [-1]
-A FLOAT disgard if the fraction of ambiguous
bases higher than FLOAT [0.05]
-h haplotype mode

Type in the following command:

wgsim lambda.fa -e 0 -d 500 -N 5000 -1 100 -2 100 -r 0.01 \
-R 0 -X 0 -S 1234567 -h l1.read1.fq l1.read2.fq
2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow