class: center, middle, inverse, title-slide # Theta calculation ### Jinliang Yang ### Feb. 3rd, 2022 --- # Syncing a fork (from the web UI) 1. Click the __Fork__ button for the Git Repo `https://github.com/jyanglab/2022-agro932-lab` 2. And then clone to your own system `git clone git@github.com:YOURID/2022-agro932-lab.git` -- ### If you have __Forked__ it before: 1. On GitHub, navigate to the main page of the forked repository that you want to sync with the upstream repository. 2. Select the __Fetch upstream__ drop-down. 3. Review the details about the commits from the upstream repository, then click __Fetch and merge__. --- # Wright-Fisher Simulation - Consider a single locus with two alleles, `\(A_1\)` and `\(A_2\)` - In generation `\(t\)` there are `\(x\)` individuals carrying allele `\(A_1\)`, which is at frequency `\(p_t = x/2N\)`. - This implies that there are `\(2N - x\)` individuals carrying allele `\(A_2\)`, which is at frequency `\(q_t = 1- p_t= 1- x/2N\)`. - The sampling of alleles for the next generations is equivalent to sampling from a binomial distribution with parameters size = `\(2N\)` and prob= `\(x/2N\)`. - Therefore, the mean and variance of `\(p\)` in the next generation for the Wright-Fisher model are: `\begin{align*} E(p_{t+1}) &= p_t \\ Var(p_{t+1}) &= p_tq_t/2N \end{align*}` --- # Wright-Fisher Simulation ### Create [R function](https://www.statmethods.net/) ```r wright_fisher <- function(N=1000, A1=100, t=1000){ p <- A1/(2*N) ### make a numeric vector to hold the results freq <- as.numeric(); t=1000 ### Use for loop to run over t generations for (i in 1:t){ A1 <- rbinom(1, 2*N, p) p <- A1/(2*N) freq[i] <- p } return(freq) } ``` --- # Wright-Fisher Simulation Then apply the `wright_fisher` R function to conduct __1,000__ simulations using `crane`. ```r set.seed(1234) frq <- wright_fisher(N=1000, A1=100, t=1000) pdf("graphs/sim1000.pdf", width=8, height=8) plot(frq, type="l", ylim=c(0,1), col=3, xlab="Generations", ylab=expression(p(A[1]))) for(u in 1:1000){ frq <- wright_fisher(N=1000, A1=100, t=1000) random <- sample(1:1000,1,replace=F) randomcolor <- colors()[random] lines(frq,type="l",col=(randomcolor)) } dev.off() ``` -- Conduct version control and then get the results ```bash git add --all git commit -m "sim 1000 fig" git push ``` --- # Simulate NGS data ### Install a software on crane ```bash cd $HOME mkdir bin # https://github.com/lh3/wgsim git clone https://github.com/lh3/wgsim.git # compilation gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm ``` ### Put the software in your searching path ```bash cd $HOME vi .bash_profile ``` Then copy the following to your `.bash_profile` ```bash PATH=$PATH:~/bin/wgsim/ ``` --- # Reference genome ## EnsemblPlants - Bread Wheat: [Triticum aestivum](https://plants.ensembl.org/Triticum_aestivum/Info/Index) - Common bean: [Phaseolus vulgaris](https://plants.ensembl.org/Phaseolus_vulgaris/Info/Index) - Domesticated sunflower: [Helianthus annuus](https://plants.ensembl.org/Helianthus_annuus/Info/Index) - Maize: [Zea mays](https://plants.ensembl.org/Zea_mays/Info/Index?db=core) - Soybean: [Glycine max](http://plants.ensembl.org/Glycine_max/Info/Index) -- ## Important info - Version - Gene annotation: GFF3 --- # NGS data simulation using `wgsim` ```bash wgsim lambda.fa -N 5000 -1 100 -2 100 -r 0.01 -e 0 \ -R 0 -X 0 -S 1234567 l1.R1.fq l1.R2.fq ``` - Reference: 50k lambda genome downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/215104#feature_J02459.1) - `lambda.fa` - 20x coverage - `N 5000` - PE 100bp - `-1 100 -2 100` - Only SNP no Indel - `-R 0 -X 0` - Mutation rate is low - `-r 0.01` --- # NGS data simulation using `wgsim` ## simulate 10 individals ```bash for i in {1..10} do wgsim lambda.fa -N 5000 -1 100 -2 100 -r 0.01 -e 0 -R 0 -X 0 l$i.R1.fq l$i.R2.fq done # check how many reads wc -l l1.R1.fq ``` --- # A procedure to calculate `\(\theta\)` values ### 1. Align the NGS reads to the reference genome ```bash module load bwa samtools bcftools # index the reference genome bwa index lambda.fa # using bwa mem to align the reads to the reference genome # => samtools to convert into bam file bwa mem lambda.fa l1.R1.fq l1.R2.fq | samtools view -bSh - > l1.bam ``` #### Do alignment for 10 individuals using bash loop: ```bash # alignment for i in {1..10}; do bwa mem lambda.fa l$i.R1.fq l$i.R2.fq | samtools view -bSh - > l$i.bam; done # sort for i in *.bam; do samtools sort $i -o sorted_$i; done # index them for i in sorted*.bam; do samtools index $i; done ``` --- # A procedure to calculate `\(\theta\)` values ### 2. Calling SNPs with `samtools` ```bash ### index the genome assembly samtools faidx lambda.fa ### Run `mpileup` to generate VCF format samtools mpileup -g -f lambda.fa sorted_l1.bam sorted_l2.bam sorted_l3.bam sorted_l4.bam sorted_l5.bam > myraw.bcf ``` All we did so far (roughly) is to perform a format conversion from BAM to VCF! ```bash ### Call SNP using bcftools `call` function bcftools call myraw.bcf -cv -Ob -o snps.bcf ``` ```bash ### Extract allele frequency at each position bcftools query -f '%CHROM %POS %AF1\n' snps.bcf > frq.txt ``` --- # A procedure to calculate `\(\theta\)` values ### 3. Compute `\(\theta_{\pi}\)` ```r pi <- function(n=10, p=0.1){ return(n/(n-1)*(1-p^2-(1-p)^2)) } pi(n=10, p=0.1) frq <- read.table("largedata/frq.txt") ``` --- # A procedure to calculate `\(\theta\)` values ### 4. plot site freq spectrum (SFS) ```r maf <- c(0.1, 0.1, 0.3, 0.1, 0.3, 0.2, 0.1, 0.4, 0.1, 0.1) sfs <- table(maf) barplot(sfs, col="#cdc0b0", xlab="Minor allele frequency", ylab="No. of segregating sites", cex.axis =1.5, cex.names = 1.5, cex.lab=1.5) ``` ![](data:image/png;base64,#w3lab_files/figure-html/unnamed-chunk-15-1.png)<!-- -->