class: center, middle, inverse, title-slide # Population genomics software ### Jinliang Yang ### Mar. 3rd, 2022 --- # Population genomics - NGS and diversity measurement - `\(\theta_\pi\)`: pairwise necleotide diversity - `\(\theta_W\)`: watterson's `\(\theta\)`, using total number of segregating sites - `\(\epsilon_1 = S_1\)`: the number of derived singletons in a sample. - `\(\eta_1\)`: based on all singletons in a sample. -- - Population differentiation - `\(F_{ST}\)` -- - Scan for direct selection - `\(d_N/d_S\)` or `\(\pi_N/\pi_S\)` -- - Scan for linked selection - Tajima's D - Fu and Li's D, F, D\*, F\* --- # Population genomics - NGS and diversity measurement - `\(\theta_\pi\)`: pairwise necleotide diversity - `\(\theta_W\)`: watterson's `\(\theta\)`, using total number of segregating sites - `\(\epsilon_1 = S_1\)`: the number of derived singletons in a sample. - `\(\eta_1\)`: based on all singletons in a sample. ### [ANGSD](http://www.popgen.dk/angsd/index.php/ANGSD) ANGSD is a software for conducting population genomics analysis using next generation sequencing data. One advantages of this software is that it can handle mapped reads to imputed genotype probabilities. --- # Installation, however, is painful ### [Installation](http://www.popgen.dk/angsd/index.php/Installation) ```bash cd $HOME/bin ``` ### Install from github ```bash git clone --recursive https://github.com/samtools/htslib.git git clone https://github.com/ANGSD/angsd.git cd htslib;make;cd ../angsd ;make HTSSRC=../htslib ``` ### Install by directly downloading ```bash wget http://popgen.dk/software/download/angsd/angsd0.936.tar.gz tar xf angsd0.936.tar.gz cd htslib;make;cd .. cd angsd make HTSSRC=../htslib cd .. ``` --- # Instead, use `Conda` `Conda` is a package and environment manager! - by far the __easiest way to handle installing__ most of the tools we want to use in bioinformatics. It has been installed on the HCC. To learn more about conda, read this [introduction](https://astrobiomike.github.io/unix/conda-intro). --- # Making a new environment The simplest way we can create a new conda environment is like so: ```bash conda create -n mypopgen ``` #### Entering an environment ```bash conda activate mypopgen ``` #### Installing packages The first thing I usually do is just search in a web-browser for `conda install` plus whatever program I am looking for. ```bash conda install -c bioconda angsd ``` #### Exiting an environment ```bash conda deactivate ``` --- # NGS and diversity measurement #### Use our simulated data from `lab5` ```bash cd largedata; mkdir lab7 cp /common/jyanglab/jyang21/courses/2022-agro932-lab/largedata/lab5/bamlist.txt lab7 cp /common/jyanglab/jyang21/courses/2022-agro932-lab/largedata/lab5/sorted_l* lab7 cp /common/jyanglab/jyang21/courses/2022-agro932-lab/largedata/lab5/Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa* lab7 ``` -------- #### Activate my conda environment ```bash conda activate mypopgen angsd ``` ```bash module load angsd angsd ``` request a computer node ```bash srun --qos=short --nodes=1 --licenses=common --ntasks=4 --mem 8G --time 2:00:00 --pty bash ``` --- # ANGSD for diversity meansurement ```bash cd lab7 module load samtools samtools faidx Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa # run ANGSD to calculated folded SFS angsd -bam bamlist.txt -out output -doMajorMinor 1 -doMaf 1 -doSaf 2 -uniqueOnly 0 -anc Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -minMapQ 30 -minQ 20 -nInd 20 -baq 1 -ref Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -GL 1 # use realSFS to calculate sfs realSFS output.saf.idx -fold 1 > output.sfs # try this version /common/jyanglab/gxu6/software/angsd/misc/realSFS output.saf.idx -fold 1 > output.sfs ``` #### Copy the result to `cache/` folder ```bash ## cp sfs to the cache/ folder cp output.sfs ../../cache/ ``` --- # Calculate the thetas #### For each site ```bash /common/jyanglab/gxu6/software/angsd/misc/realSFS saf2theta output.saf.idx -sfs output.sfs -outname output ``` The output from the above command are two files `output.thetas.gz` and `output.thetas.idx`. - A formal description of these files can be found in the `doc/formats.pdf` in the angsd package. #### For stepping window ```bash /common/jyanglab/gxu6/software/angsd/misc/thetaStat do_stat output.thetas.idx -win 5000 -step 1000 -outnames output.theta.5k.gz # Copy the result to `cache/` folder cp output.theta.5k.gz.pestPG ../../cache/ ``` ```bash git add --all git commit -m "theta values" git push ``` --- # Visualize the results In the local computer, using `R`: #### Barplot for SFS ```r s <- scan('cache/output.sfs') s <- s[-c(1,length(s))] s <- s/sum(s) barplot(s,names=1:length(s), main='SFS') ``` #### Histgram distribution of the theta values ```r theta <- read.table("cache/output.theta.5k.gz.pestPG", header=TRUE) hist(theta$tW, xlab="theta_w", main="disverity") ``` #### Scatter plot of the Tajima's D values ```r plot(theta$WinCenter, theta$Tajima, xlab="Physical position", ylab="Tajima's D", col="#5f9ea0", pch=16) ``` --- # Fst using `vcftools` ### input data: - variant call format (or the VCF/BCF file) - need to determine the populations ```bash module load bcftools # you must be in your lab7/ folder cp /common/jyanglab/jyang21/courses/2022-agro932-lab/largedata/lab5/snps.bcf . bcftools view snps.bcf | head -n 40 ``` sorted_l10.bam sorted_l11.bam sorted_l12.bam sorted_l13.bam sorted_l14.bam sorted_l15.bam sorted_l16.bam sorted_l17.bam sorted_l18.bam sorted_l19.bam sorted_l1.bam sorted_l20.bam sorted_l2.bam sorted_l3.bam sorted_l4.bam sorted_l5.bam sorted_l6.bam sorted_l7.bam sorted_l8.bam sorted_l9.bam ```bash for ((i=1;i<=10;i++)) ; do echo "sorted_l$i.bam" >> pop1.txt; done for ((i=11;i<=20;i++)) ; do echo "sorted_l$i.bam" >> pop2.txt; done ``` --- # Fst using `vcftools` ### Window based Fst ```bash module load vcftools vcftools --bcf snps.bcf --weir-fst-pop pop1.txt --weir-fst-pop pop2.txt --fst-window-size 10000 --fst-window-step 1000 --out win_1k ``` ### Store the Weir Fst results ```bash ## cp Fst to the cache/ folder cp win_1k.windowed.weir.fst ../../cache/ ``` --- # XP-CLR approach for selection scan ### input data: - variant call format (VCF file only) - need to determine the populations ```bash module load xpclr/1.1 module load bcftools bcftools convert snps.bcf -O v -o snp.vcf xpclr --input snp.vcf --out ./xpclr_res.txt --format vcf --samplesA pop1.txt --samplesB pop2.txt --chr Mt --start 1 --stop 20000 --ld 0.7 --maxsnps 200 --minsnps 200 --size 10000 --step 5000 ``` --- # XP-CLR approach for selection scan ### using slurm script ```bash cd ../../ pwd ``` `/common/jyanglab/jyang21/courses/2022-agro932-lab` ```bash #!/bin/bash -l #SBATCH -D /common/jyanglab/jyang21/courses/2022-agro932-lab #SBATCH -o /common/jyanglab/jyang21/courses/2022-agro932-lab/slurm-log/xpclr-stdout-%j.txt #SBATCH -e /common/jyanglab/jyang21/courses/2022-agro932-lab/slurm-log/xpclr-stderr-%j.txt #SBATCH -J xpclr #SBATCH -t 10:00:00 #SBATCH --mail-user=your_email_address@gmail.com #SBATCH --mail-type=END #email if ends #SBATCH --mail-type=FAIL #email if fails set -e set -u ### your script here: module load xpclr/1.1 xpclr --input snp.vcf --out ./xpclr_res.txt --format vcf --samplesA pop1.txt --samplesB pop2.txt --chr Mt --start 1 --stop 200000 --ld 0.7 --maxsnps 200 --minsnps 200 --size 10000 --step 5000 ``` ```bash vi slurm-script/xpclr.sh i # copy and paste the above code ``` --- # XP-CLR approach for selection scan ### using slurm script ```bash sbatch --licenses=common --ntasks=2 --mem=10G slurm-script/my_theta.sh ## check your job status squeue | grep "YOUR USER ID" ``` ### Store the XP-CLR results ```bash cd largedata/lab7 cp xpclr_res.txt ../../cache/ ``` Type git command to version control your results ```bash git add --all git commit -m "Fst and XP-CLR results" git push ```