Compute Fst

class: center, middle, inverse, title-slide

# Compute Fst
### Jinliang Yang
### Feb. 10th, 2022

---

# Syncing a fork (from the web UI)

1. Click the __Fork__ button for the Git Repo `https://github.com/jyanglab/2022-agro932-lab`
2. And then clone to your own system `git clone git@github.com:YOURID/2022-agro932-lab.git`

### If you have __Forked__ it before:

1. On GitHub, navigate to the main page of the forked repository that you want to sync with the upstream repository.
2. Select the __Fetch upstream__ drop-down.
3. Review the details about the commits from the upstream repository, then click __Fetch and merge__.
4. [How to resolve a merge conflict](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-on-github)
  
---

# NGS data simulation using `wgsim`

```
Usage:   wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>

Options: -e FLOAT      base error rate [0.020]
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.15]
         -X FLOAT      probability an indel is extended [0.30]
         -S INT        seed for random generator [-1]
         -A FLOAT      disgard if the fraction of ambiguous 
                       bases higher than FLOAT [0.05]
         -h            haplotype mode
```

#### Type in the following command:

```bash
wgsim lambda.fa -e 0 -d 500 -N 5000 -1 100 -2 100 -r 0.01  \
-R 0 -X 0 -S 1234567 -h l1.read1.fq l1.read2.fq
```

---

# Download Reference from EnsemblPlants

Maize [reference genome](https://plants.ensembl.org/Zea_mays/Info/Index)

#### Change to `largedata\lab4` folder:

```bash
cd largedata
mkdir lab4
cd lab4
```

#### Then use `wget` to download the reference genome:

```bash
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-46/fasta/zea_mays/dna/Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz

### then unzip it
gunzip Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa.gz
```

---

#### Type in the following command:

```bash
wgsim Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa \
-e 0 -d 500 -N 5000 -1 100 -2 100 -r 0.01  \
-R 0 -X 0 -S 1234567 l1.read1.fq l1.read2.fq
```

- Reference (about 500k)
  - `Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa`
- 20x coverage
  - `N 5000`
- PE 100bp
  - `-1 100 -2 100`
- Only SNP no Indel
  - `-R 0 -X 0`
- Simulate Mutations 
  - `-r 0.01`

---

# NGS data simulation using `wgsim`

## simulate 20 individals

```bash
for i in {1..20}
do
   wgsim Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -e 0 -d 500 -N 50000 -1 100 -2 100 -r 0.1  -R 0 -X 0 l$i.read1.fq l$i.read2.fq
done
```

#### check how many reads

```bash
wc -l l1.read1.fq 
# suppose to be 200,000 lines = 50,000 reads
```

---

# A procedure to calculate `$\theta$` and `$F_{ST}$` values

### 1. Align the NGS reads to the reference genome
  - [bwa](https://github.com/lh3/bwa)
  - [samtools](https://github.com/samtools/samtools)

### 2. Obtain the SNP calls 
  - [bcftools](https://samtools.github.io/bcftools/bcftools.html)

### 3. Calculate the Fst value for each site and visualize the results
  - `R`

---
# A procedure to calculate `$\theta$` and `$F_{ST}$` values

### 1. Align the NGS reads to the reference genome

```bash
module load bwa samtools bcftools
# index the reference genome
bwa index Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa
```

#### Do alignment for 10 individuals using bash loop:

```bash
# using bwa mem to align the reads to the reference genome 
for i in {1..20}; do bwa mem Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa l$i.read1.fq l$i.read2.fq | samtools view -bSh - > l$i.bam; done
# sort
for i in *.bam; do samtools sort $i -o sorted_$i; done
# index them
for i in sorted*.bam; do samtools index $i; done
```

#### Check mapping statistics

```bash
samtools flagstat sorted_l1.bam
```

---

# Submit a Slurm job

- We wrap our jobs in little batch scripts, which is nice because these also help make steps reproducible.

- To keep your directory organized, I usually keep a scripts directory (i.e., `slurm-script/` ) and log dir (i.e., `slurm-log` )  for Slurm’s logs.
  - Tip: use these logs, as these are very helpful in debugging. I separate them from my project because they fill up directories rather quickly.

Let’s look at an example __slurm script header__ for a job called `theta` (which is run with script `theta.sh`).

```bash
#!/bin/bash -l
#SBATCH -D ~projects/your-cool-project/
#SBATCH -o ~/your-cool-project/slurm-log/steve-stdout-%j.txt
#SBATCH -e ~/your-cool-project/slurm-log/steve-stderr-%j.txt
#SBATCH -J steve
#SBATCH -t 24:00:00
set -e
set -u

# insert your script here
```

---

## An Example Slurm Batch Script Header

```bash
#!/bin/bash -l
#SBATCH -D ~/projects/your-cool-project/
#SBATCH -o ~/your-cool-project/slurm-log/steve-stdout-%j.txt
#SBATCH -e ~/your-cool-project/slurm-log/steve-stderr-%j.txt
#SBATCH -J theta
#SBATCH -t 24:00:00
#SBATCH --mail-user=your_email_address@gmail.com
#SBATCH --mail-type=END #email if ends
#SBATCH --mail-type=FAIL #email if fails
set -e
set -u

# insert your script here
```

- `D` sets your project directory.
- `o` sets where standard output (of your batch script) goes.
- `e` sets where standard error (of your batch script) goes.
- `J` sets the job name.
- `t` sets the time limit for the job, 24:00:00 indicates 24 hours.
- `--mail`: will email you if the job is "END" or "FAIL"

---

## An Example Slurm Batch Script Header

```bash
cd slurm-script
vi my_theta.sh
i # insert text
:sq # quit vi editor
```

- Copy the above header to a `.sh` file and make appropriate modifications

- Insert the following:

```bash
# module load bwa samtools
# cd largedata/lab4/

# alignment
for i in {1..20}; do bwa mem Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa l$i.read1.fq l$i.read2.fq | samtools view -bSh - > l$i.bam; done
# sort
for i in *.bam; do samtools sort $i -o sorted_$i; done
# index them
for i in sorted*.bam; do samtools index $i; done
```

- submit the job via `sbatch`:

```bash
sbatch --licenses=common --ntasks=2 --mem=10G slurm-script/my_theta.sh

## check your job status
squeue | grep "YOUR USER ID"
```

---

# A procedure to calculate `$\theta$` values

### 2. Obtain the SNP calls  with `samtools`

```bash
### index the genome assembly
samtools faidx Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa
### Run `mpileup` to generate VCF format
ls sorted_l*bam > bamlist.txt
samtools mpileup -g -f Zea_mays.B73_RefGen_v4.dna.chromosome.Mt.fa -b bamlist.txt > myraw.bcf
bcftools call myraw.bcf -cv -Ob -o snps.bcf
```

#### exact SNP information

```bash
### Extract allele frequency at each position
bcftools query -f '%CHROM %POS %AF1\n' snps.bcf > frq.txt

bcftools query -f '%CHROM %POS %REF %ALT [\t%GT]\n' snps.bcf > geno.txt
```

- Print chromosome, position, ref allele and the first alternate allele
- %TGT: Translated genotype (e.g. C/A)
- %TAG{INT}: Curly brackets to print a subfield (e.g. INFO/TAG{1}, the indexes are 0-based)

---
# A procedure to calculate `$\theta$` values

### 3. Calculate the Fst value for each site and visualize the results

```r
geno <- read.table("largedata/geno.txt", header=FALSE)
names(geno) <- c("chr", "pos", "ref", "alt", "l1", "l2", "l3", "l4", "l5")

for(i in 5:9){
  # replace slash and everything after it as nothing
  geno$newcol <- gsub("/.*", "", geno[,i] )
  # extract the line name
  nm <- names(geno)[i]
  # assign name for this allele
  names(geno)[ncol(geno)] <- paste0(nm, sep="_a1")
  
  geno$newcol <- gsub(".*/", "", geno[,i] )
  names(geno)[ncol(geno)] <- paste0(nm, sep="_a2")
}
```

---

# A procedure to calculate `$\theta$` values

### 3. Calculate the Fst value for each site and visualize the results

#### Compute p1, p2, p

```r
geno$p <- apply(geno[, 10:19], 1, function(x) {sum(as.numeric(as.character(x)))})
geno$p <- geno$p/10

geno$p1 <- apply(geno[, 10:15], 1, function(x) {sum(as.numeric(as.character(x)))})
geno$p1 <- geno$p1/6

geno$p2 <- apply(geno[, 16:19], 1, function(x) {sum(as.numeric(as.character(x)))})
geno$p2 <- geno$p2/4
```

Then finally,

```r
geno$fst <- with(geno, ((p1-p)^2 + (p2-p)^2)/(2*p*(1-p)) )
```

Output the Fst results

```r
write.table(geno, "cache/fst.csv", sep=",", row.names = FALSE, quote=FALSE)
```

---
# A procedure to calculate `$\theta$` values

### 3. Calculate the Fst value for each site and visualize the results

#### Visualize the results on my local computer

```r
fst <- read.csv("cache/fst.csv")

plot(fst$pos, fst$fst, xlab="Physical position", ylab="Fst value", main="")
```