wgsim
Usage: wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>Options: -e FLOAT base error rate [0.020] -d INT outer distance between the two ends [500] -s INT standard deviation [50] -N INT number of read pairs [1000000] -1 INT length of the first read [70] -2 INT length of the second read [70] -r FLOAT rate of mutations [0.0010] -R FLOAT fraction of indels [0.15] -X FLOAT probability an indel is extended [0.30] -S INT seed for random generator [-1] -A FLOAT disgard if the fraction of ambiguous bases higher than FLOAT [0.05] -h haplotype mode
wgsim lambda.fa -e 0 -d 500 -N 5000 -1 100 -2 100 -r 0.01 \-R 0 -X 0 -S 1234567 -h l1.read1.fq l1.read2.fq
Maize reference genome
largedata\lab4
folder:cd largedatamkdir lab4cd lab4
wget
to download the reference genome:wget ftp://ftp.ensemblgenomes.org/pub/plants/release-46/fasta/zea_mays/dna/Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz### then unzip itgunzip Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz
wgsim
wgsim Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa \-e 0 -d 500 -N 5000 -1 100 -2 100 -r 0.01 \-R 0 -X 0 -S 1234567 l1.read1.fq l1.read2.fq
Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa
N 5000
-1 100 -2 100
-R 0 -X 0
-r 0.01
wgsim
for i in {1..10}do wgsim Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -e 0 -d 500 -N 50000 -1 100 -2 100 -r 0.1 -R 0 -X 0 l$i.read1.fq l$i.read2.fqdone
wgsim
for i in {1..10}do wgsim Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -e 0 -d 500 -N 50000 -1 100 -2 100 -r 0.1 -R 0 -X 0 l$i.read1.fq l$i.read2.fqdone
wc -l l1.read1.fq # suppose to be 200,000 lines = 50,000 reads
module load bwa samtools# index the reference genomebwa index Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa# using bwa mem to align the reads to the reference genome # => samtools to convert into bam filebwa mem Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa l1.read1.fq l1.read2.fq | samtools view -bSh - > l1.bam
# alignmentfor i in {1..2}; do bwa mem Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa l$i.read1.fq l$i.read2.fq | samtools view -bSh - > l$i.bam; done# sortfor i in *.bam; do samtools sort $i -o sorted_$i; done# index themfor i in sorted*.bam; do samtools index $i; done
samtools flagstat sorted_l1.bam
We wrap our jobs in little batch scripts, which is nice because these also help make steps reproducible.
To keep your directory organized, I usually keep a scripts directory (i.e., slurm-script/
) and log dir (i.e., slurm-log
) for Slurm’s logs.
We wrap our jobs in little batch scripts, which is nice because these also help make steps reproducible.
To keep your directory organized, I usually keep a scripts directory (i.e., slurm-script/
) and log dir (i.e., slurm-log
) for Slurm’s logs.
Let’s look at an example slurm script header for a job called theta
(which is run with script theta.sh
).
#!/bin/bash -l#SBATCH -D ~projects/your-cool-project/#SBATCH -o ~/your-cool-project/slurm-log/steve-stdout-%j.txt#SBATCH -e ~/your-cool-project/slurm-log/steve-stderr-%j.txt#SBATCH -J steve#SBATCH -t 24:00:00set -eset -u# insert your script here
#!/bin/bash -l#SBATCH -D ~/projects/your-cool-project/#SBATCH -o ~/your-cool-project/slurm-log/steve-stdout-%j.txt#SBATCH -e ~/your-cool-project/slurm-log/steve-stderr-%j.txt#SBATCH -J theta#SBATCH -t 24:00:00#SBATCH --mail-user=your_email_address@gmail.com#SBATCH --mail-type=END #email if ends#SBATCH --mail-type=FAIL #email if failsset -eset -u# insert your script here
D
sets your project directory.o
sets where standard output (of your batch script) goes.e
sets where standard error (of your batch script) goes.J
sets the job name.t
sets the time limit for the job, 24:00:00 indicates 24 hours.--mail
: will email you if the job is "END" or "FAIL"cd slurm-scriptvi my_theta.shi # insert text
Copy the above header to a .sh
file and make appropriate modifications
Insert the following:
# module load bwa samtools# cd largedata/lab4/# alignmentfor i in {1..2}; do bwa mem Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa l$i.read1.fq l$i.read2.fq | samtools view -bSh - > l$i.bam; done# sortfor i in *.bam; do samtools sort $i -o sorted_$i; done# index themfor i in sorted*.bam; do samtools index $i; done
cd slurm-scriptvi my_theta.shi # insert text
Copy the above header to a .sh
file and make appropriate modifications
Insert the following:
# module load bwa samtools# cd largedata/lab4/# alignmentfor i in {1..2}; do bwa mem Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa l$i.read1.fq l$i.read2.fq | samtools view -bSh - > l$i.bam; done# sortfor i in *.bam; do samtools sort $i -o sorted_$i; done# index themfor i in sorted*.bam; do samtools index $i; done
sbatch
:sbatch --licenses=common --ntasks=1 --mem=5G slurm-script/my_theta.sh## check your job statussqueue | grep "YOUR USER ID"
ANGSD
cd ~/bin/ # if you don't have one, do `mkdir bin`git clone https://github.com/samtools/htslib.gitgit clone https://github.com/ANGSD/angsd.git cd htslib; make; cd ../angsd;make HTSSRC=../htslib
# write the bam files to a txt filemkdir bam_filesmv sorted*.bam bam_filescd bam_filesls sorted*.bam > bam.txt
angsd -bam bam.txt -doSaf 1 -anc ../Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -GL 1 -out out # use realSFS to calculate sfsrealSFS out.saf.idx > out.sfs
cache/
folder## cp sfs to the cache/ foldercp out.sfs ../../../cache/
The output from the above command are two files out.thetas.gz and out.thetas.idx. A formal description of these files can be found in the doc/formats.pdf in the angsd package. It is possible to extract the logscale persite thetas using the ./thetaStat print program.
angsd -bam bam.txt -out out -doThetas 1 -doSaf 1 -pest out.sfs -anc ../Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -GL 1thetaStat print out.thetas.idx > theta.txt## cp theta to the cache/ foldercp theta.txt ../../../cache/
# first calculate per pop saf for each populatoinangsd -b pop1.txt -anc ../Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -out pop1 -dosaf 1 -gl 1angsd -b pop2.txt -anc ../Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -out pop2 -dosaf 1 -gl 1# calculate the 2dsfs priorrealSFS pop1.saf.idx pop2.saf.idx > pop1.pop2.ml# prepare the fst for easy window analysis etcrealSFS fst index pop1.saf.idx pop2.saf.idx -sfs pop1.pop2.ml -fstout out# get the global estimaterealSFS fst stats out.fst.idx # below is not tested that much, but seems to workrealSFS fst stats2 out.fst.idx -win 500 -step 100 > fst_win.txt
cache/
folder## cp sfs to the cache/ foldercp fst_win.txt ../../../cache/
In local computer, using R
:
s <- scan('../../cache/out.sfs')s <- s[-c(1,length(s))]s <- s/sum(s)barplot(s,names=1:length(s), main='SFS')
hist(t$Pairwise)
fst <- read.table("cache/fst_win.txt", skip=1, header=FALSE)names(fst)[c(3,5)] <- c("midp", "fst")plot(fst$midp, fst$fst, xlab="Physical position", ylab="Fst", col="#5f9ea0", pch=16)
Maize reference genome
change to largedata\lab4
folder:
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-46/fasta/zea_mays/dna/Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz### then unzip itgunzip Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz
Similarly, we will download and unzip the Mt GFF3 file
Maize reference genome
change to largedata\lab4
folder:
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-46/fasta/zea_mays/dna/Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz### then unzip itgunzip Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz
Similarly, we will download and unzip the Mt GFF3 file
# install.package("data.table")library("data.table")## simply read in wouldn't workgff <- fread("largedata/lab4/Zea_mays.B73_RefGen_v4.46.chromosome.Mt.gff3", skip="#", header=FALSE, data.table=FALSE)## grep -v means select lines that not matching any of the specified patternsgff <- fread(cmd='grep -v "#" largedata/lab4/Zea_mays.B73_RefGen_v4.46.chromosome.Mt.gff3', header=FALSE, data.table=FALSE)
V1 V2 V3 V4 V5 V6 V7 V81 Mt Gramene chromosome 1 569630 . . .2 Mt ensembl gene 6391 6738 . + .3 Mt NCBI mRNA 6391 6738 . + .4 Mt NCBI exon 6391 6738 . + .5 Mt NCBI CDS 6391 6738 . + 06 Mt ensembl gene 6951 8285 . + . V91 ID=chromosome:Mt;Alias=AY506529.1,NC_007982.1;Is_circular=true2 ID=gene:ZeamMp002;biotype=protein_coding;description=orf115-a1;3 ID=transcript:ZeamMp002;Parent=gene:ZeamMp002;4 Parent=transcript:ZeamMp002;Name=ZeamMp002.exon1;constitutive=1;ensembl_end_phase=0;5 ID=CDS:ZeamMp002;Parent=transcript:ZeamMp002;6 ID=gene:ZeamMp003;biotype=protein_coding;description=orf444
V1 V2 V3 V4 V5 V6 V7 V81 Mt Gramene chromosome 1 569630 . . .2 Mt ensembl gene 6391 6738 . + . V91 ID=chromosome:Mt;Alias=AY506529.1,NC_007982.1;Is_circular=true2 ID=gene:ZeamMp002;biotype=protein_coding;description=orf115-a1;
1 sequence: The name of the sequence where the feature is located.
2 source: Keyword identifying the source of the feature, like a program (e.g. RepeatMasker) or an organization (like ensembl).
3 feature: The feature type name, like "gene" or "exon".
4 start: Genomic start of the feature, with a 1-base offset.
V1 V2 V3 V4 V5 V6 V7 V81 Mt Gramene chromosome 1 569630 . . .2 Mt ensembl gene 6391 6738 . + . V91 ID=chromosome:Mt;Alias=AY506529.1,NC_007982.1;Is_circular=true2 ID=gene:ZeamMp002;biotype=protein_coding;description=orf115-a1;
5 end: Genomic end of the feature, with a 1-base offset.
6 score: Numeric value that generally indicates the confidence of the source in the annotated feature.
7 strand: Single character that indicates the strand of the feature.
V1 V2 V3 V4 V5 V6 V7 V81 Mt Gramene chromosome 1 569630 . . .2 Mt ensembl gene 6391 6738 . + . V91 ID=chromosome:Mt;Alias=AY506529.1,NC_007982.1;Is_circular=true2 ID=gene:ZeamMp002;biotype=protein_coding;description=orf115-a1;
8 phase: phase of CDS (means CoDing Sequence) features.
9 attributes: All the other information pertaining to this feature.
names(gff) <- c("seq", "source", "feature", "start", "end", "score", "strand", "phase", "att")table(gff$feature)
g <- subset(gff, feature %in% "gene")g$geneid <- gsub(".*gene:|;biotype.*", "", g$att)### + strandgp <- subset(g, strand %in% "+") # nrow(gp) 75### get the 5k upstream of the + strand gene modelgp_up <- gpgp_up$end <- gp_up$start - 1gp_up$start <- gp_up$end - 5000 ### get the 5k downstream of the + strand gene modelgp_down <- gpgp_down$start <- gp_down$end + 1gp_down$end <- gp_down$start + 5000
### - strandgm <- subset(g, strand %in% "-") dim(gm) # 82fwrite(g, "cache/mt_gene.txt", sep="\t", row.names = FALSE, quote=FALSE)
library("data.table")library("GenomicRanges")library("plyr")theta <- fread("cache/theta.txt", data.table=FALSE)names(theta)[1] <- "seq"up5k <- read.table("cache/mt_gene_up5k.txt", header=TRUE)### define the subject file for theta valuesgrc <- with(theta, GRanges(seqnames=seq, IRanges(start=Pos, end=Pos)))### define the query file for genomic featuregrf <- with(up5k, GRanges(seqnames=seq, IRanges(start=start, end=end), geneid=geneid))### find overlaps between the twotb <- findOverlaps(query=grf, subject=grc)tb <- as.matrix(tb)out1 <- as.data.frame(grf[tb[,1]])out2 <- as.data.frame(grc[tb[,2]])### for each genomic feature, find the sites with non-missing dataout <- cbind(out1, out2[, "start"]) names(out)[ncol(out)] <- "pos"
#define unique identifier and merge with the thetasout$uid <- paste(out$seqnames, out$pos, sep="_")theta$uid <- paste(theta$seq, theta$Pos, sep="_")df <- merge(out, theta[, c(-1, -2)], by="uid")# for each upstream 5k region, how many theta valuesmx <- ddply(df, .(geneid), summarise, Pairwise = mean(Pairwise, na.rm=TRUE), thetaH = mean(thetaH, na.rm=TRUE), nsites = length(uid))
Copy and paste everything above, and pack it into an R
function:
get_mean_theta <- function(gf_file="cache/mt_gene_up5k.txt"){ # gf_file: gene feature file [chr, ="cache/mt_gene_up5k.txt"] theta <- fread("cache/theta.txt", data.table=FALSE) names(theta)[1] <- "seq" up5k <- read.table(gf_file, header=TRUE) ### define the subject file for theta values grc <- with(theta, GRanges(seqnames=seq, IRanges(start=Pos, end=Pos))) ### define the query file for genomic feature grf <- with(up5k, GRanges(seqnames=seq, IRanges(start=start, end=end), geneid=geneid)) ### find overlaps between the two tb <- findOverlaps(query=grf, subject=grc) tb <- as.matrix(tb) out1 <- as.data.frame(grf[tb[,1]]) out2 <- as.data.frame(grc[tb[,2]]) ### for each genomic feature, find the sites with non-missing data out <- cbind(out1, out2[, "start"]) names(out)[ncol(out)] <- "pos" #define unique identifier and merge with the thetas out$uid <- paste(out$seqnames, out$pos, sep="_") theta$uid <- paste(theta$seq, theta$Pos, sep="_") df <- merge(out, theta[, c(-1, -2)], by="uid") # for each upstream 5k region, how many theta values mx <- ddply(df, .(geneid), summarise, Pairwise = mean(Pairwise, na.rm=TRUE), thetaH = mean(thetaH, na.rm=TRUE), nsites = length(uid)) return(mx)}
Run the customized R function
### apply the functionup5k <- get_mean_theta(gf_file="cache/mt_gene_up5k.txt")down5k <- get_mean_theta(gf_file="cache/mt_gene_down5k.txt")
And then plot the results:
library("ggplot2")up5k$feature <- "up 5k"down5k$feature <- "down 5k"res <- rbind(up5k, down5k)ggplot(res, aes(x=feature, y=Pairwise, fill=feature)) + geom_violin(trim=FALSE)+ labs(title="Theta value", x="", y = "Log10 (theta)")+ geom_boxplot(width=0.1, fill="white")+ scale_fill_brewer(palette="Blues") + theme_classic()
wgsim
Usage: wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>Options: -e FLOAT base error rate [0.020] -d INT outer distance between the two ends [500] -s INT standard deviation [50] -N INT number of read pairs [1000000] -1 INT length of the first read [70] -2 INT length of the second read [70] -r FLOAT rate of mutations [0.0010] -R FLOAT fraction of indels [0.15] -X FLOAT probability an indel is extended [0.30] -S INT seed for random generator [-1] -A FLOAT disgard if the fraction of ambiguous bases higher than FLOAT [0.05] -h haplotype mode
wgsim lambda.fa -e 0 -d 500 -N 5000 -1 100 -2 100 -r 0.01 \-R 0 -X 0 -S 1234567 -h l1.read1.fq l1.read2.fq
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |