Theta calculation

class: center, middle, inverse, title-slide

# Theta calculation
### Jinliang Yang
### Feb. 3rd, 2022

---

# Syncing a fork (from the web UI)

1. Click the __Fork__ button for the Git Repo `https://github.com/jyanglab/2022-agro932-lab`
2. And then clone to your own system `git clone git@github.com:YOURID/2022-agro932-lab.git`

### If you have __Forked__ it before:

1. On GitHub, navigate to the main page of the forked repository that you want to sync with the upstream repository.
2. Select the __Fetch upstream__ drop-down.
3. Review the details about the commits from the upstream repository, then click __Fetch and merge__.

---

# Wright-Fisher Simulation

- Consider a single locus with two alleles, `$A_1$` and `$A_2$`

- In generation `$t$` there are `$x$` individuals carrying allele `$A_1$`, which is at frequency `$p_t = x/2N$`.

- This implies that there are `$2N - x$` individuals carrying allele `$A_2$`, which is at frequency `$q_t = 1- p_t= 1- x/2N$`.

- The sampling of alleles for the next generations is equivalent to sampling from a binomial distribution with parameters size = `$2N$` and prob= `$x/2N$`.

- Therefore, the mean and variance of `$p$` in the next generation for the Wright-Fisher model are:

`\begin{align*}
E(p_{t+1}) &= p_t \\
Var(p_{t+1}) &= p_tq_t/2N 
\end{align*}`

---

# Wright-Fisher Simulation

### Create [R function](https://www.statmethods.net/)

```r
wright_fisher <- function(N=1000, A1=100, t=1000){
  p <- A1/(2*N)
  ### make a numeric vector to hold the results
  freq <- as.numeric();
  t=1000
  ### Use for loop to run over t generations
  for (i in 1:t){
    A1 <- rbinom(1, 2*N, p)
    p <- A1/(2*N)
    freq[i] <- p
  }
  return(freq)
}
```

---

# Wright-Fisher Simulation

Then apply the `wright_fisher` R function to conduct __1,000__ simulations using `crane`.

```r
set.seed(1234)
frq <- wright_fisher(N=1000, A1=100, t=1000)

pdf("graphs/sim1000.pdf", width=8, height=8)
plot(frq, type="l", ylim=c(0,1), col=3, xlab="Generations", ylab=expression(p(A[1])))
    for(u in 1:1000){
      frq <- wright_fisher(N=1000, A1=100, t=1000)
      random <- sample(1:1000,1,replace=F)
      randomcolor <- colors()[random] 
      lines(frq,type="l",col=(randomcolor))
    }
dev.off()
```

Conduct version control and then get the results

```bash
git add --all
git commit -m "sim 1000 fig"
git push
```

---

# Simulate NGS data

### Install a software on crane

```bash
cd $HOME
mkdir bin
# https://github.com/lh3/wgsim
git clone https://github.com/lh3/wgsim.git

# compilation
gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm
```

### Put the software in your searching path

```bash
cd $HOME
vi .bash_profile
```

Then copy the following to your `.bash_profile`

```bash
PATH=$PATH:~/bin/wgsim/
```

---

# Reference genome

## EnsemblPlants

- Bread Wheat: [Triticum aestivum](https://plants.ensembl.org/Triticum_aestivum/Info/Index)
- Common bean: [Phaseolus vulgaris](https://plants.ensembl.org/Phaseolus_vulgaris/Info/Index)
- Domesticated sunflower: [Helianthus annuus](https://plants.ensembl.org/Helianthus_annuus/Info/Index)
- Maize: [Zea mays](https://plants.ensembl.org/Zea_mays/Info/Index?db=core)
- Soybean: [Glycine max](http://plants.ensembl.org/Glycine_max/Info/Index)

## Important info
- Version
- Gene annotation: GFF3

---

# NGS data simulation using `wgsim`

```bash
wgsim lambda.fa -N 5000 -1 100 -2 100 -r 0.01 -e 0 \
-R 0 -X 0 -S 1234567 l1.R1.fq l1.R2.fq
```

- Reference: 50k lambda genome downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/215104#feature_J02459.1)
  - `lambda.fa`
- 20x coverage
  - `N 5000`
- PE 100bp
  - `-1 100 -2 100`
- Only SNP no Indel
  - `-R 0 -X 0`
- Mutation rate is low 
  - `-r 0.01`

---

# NGS data simulation using `wgsim`

## simulate 10 individals

```bash
for i in {1..10}
do
   wgsim lambda.fa -N 5000 -1 100 -2 100 -r 0.01 -e 0 -R 0 -X 0 l$i.R1.fq l$i.R2.fq
done

# check how many reads
wc -l l1.R1.fq
```

---

# A procedure to calculate `$\theta$` values

### 1. Align the NGS reads to the reference genome

```bash
module load bwa samtools bcftools
# index the reference genome
bwa index lambda.fa
# using bwa mem to align the reads to the reference genome 
# => samtools to convert into bam file
bwa mem lambda.fa l1.R1.fq l1.R2.fq | samtools view -bSh - > l1.bam
```

#### Do alignment for 10 individuals using bash loop:

```bash
# alignment
for i in {1..10}; do bwa mem lambda.fa l$i.R1.fq l$i.R2.fq | samtools view -bSh - > l$i.bam; done
# sort
for i in *.bam; do samtools sort $i -o sorted_$i; done
# index them
for i in sorted*.bam; do samtools index $i; done
```

---

# A procedure to calculate `$\theta$` values

### 2. Calling SNPs with `samtools`

```bash
### index the genome assembly
samtools faidx lambda.fa
### Run `mpileup` to generate VCF format
samtools mpileup -g -f lambda.fa sorted_l1.bam sorted_l2.bam sorted_l3.bam sorted_l4.bam sorted_l5.bam > myraw.bcf
```
All we did so far (roughly) is to perform a format conversion from BAM to VCF!

```bash
### Call SNP using bcftools `call` function
bcftools call myraw.bcf -cv -Ob -o snps.bcf
```

```bash
### Extract allele frequency at each position
bcftools query -f '%CHROM %POS %AF1\n' snps.bcf > frq.txt
```

---

# A procedure to calculate `$\theta$` values

### 3. Compute `$\theta_{\pi}$`

```r
pi <- function(n=10, p=0.1){
  return(n/(n-1)*(1-p^2-(1-p)^2))
}
pi(n=10, p=0.1)

frq <- read.table("largedata/frq.txt")
```

---

# A procedure to calculate `$\theta$` values

### 4. plot site freq spectrum (SFS)

```r
maf <- c(0.1, 0.1, 0.3, 0.1, 0.3, 0.2, 0.1, 0.4, 0.1, 0.1)
sfs <- table(maf)
barplot(sfs, col="#cdc0b0", xlab="Minor allele frequency", 
        ylab="No. of segregating sites", 
        cex.axis =1.5, cex.names = 1.5, cex.lab=1.5)
```

![](data:image/png;base64,#w3lab_files/figure-html/unnamed-chunk-15-1.png)