Diversity calculation

class: center, middle, inverse, title-slide

# Diversity calculation
### Jinliang Yang
### Jan. 30th, 2020

---

# Syncing your fork to the original repo

1. Open your __fork__ of the repository on Github
2. Click the __Compare__ button right next to the __Pull request__
3. Change the __base repo__ to your repository
  - You're now back to your fork but you've also asked to compare two identical repositories so GitHub thinks you care about branches not forks. 
  - Click on __compare across forks__ to get back your base fork option.
4. Change the __head repo__ to the upstream (original) repository __jyanglab/agro932-lab__
5. You will see a list of commits
  - These are the commits made by `yangjl`
6. Click __Create pull request__
  - Note that this pull request is to you!

---

# Unix Commands

- `cd`: change the working directory
- `mkdir`: make directories
- `pwd`: print name of current working directory
- `ls`: list directory contents
- `chmod`: change the access permissions to files and directories
- `head`: output the first part of files
- `tail`: output the last part of files
- `more` and `less`: display contents of large files page by page or scroll line by line up and down
- `wc`: print line, word, and byte counts for each each file
- `grep`: print lines matching a pattern
- `|`: pipe, i.e., `ls -la | head -n 100 >> new_file`

---

# Unix Commands: how to get help

If you don’t know how to use a command i.e. you don’t know about its parameters and return type etc, then you can make use of `–-help` command.

```bash
less --help
```

### How to use help command

Simply type `help`, which is used for listing all possible commands that are pre-installed.

```bash
help
help cd
```

---

# Unix Commands: how to get help

### The `man` command or man pages

Man is used when you want to get a detailed manual of a command.

```bash
man mkdir
man bash
```

---

# Slurm Commands

## Getting to know slurm

[Slurm](https://slurm.schedmd.com/overview.html) is job managing system: you submit jobs via batch scripts. These batch scripts have common headers; we will see one below.

```bash
sinfo # view information about Slurm nodes and partitions.
sinfo --help
sinfo | head -n 10
```

- Here we see our `PARTITION` and `TIME LIMIT`, and all of their inferior but still useful friends.

- Note that there is a column of `STATE`, which indicates the state of the machine.

---

# Slurm Commands

## Getting to know slurm

A better way of looking at what’s going on on each machine is with `squeue`, which is the job queue.

```bash
squeue --help
# view information about jobs located in the Slurm scheduling queue.
squeue 
squeue | wc -l # how many jobs are running
squeue | grep "jyanglab"
```

- This shows each __job ID__ (very important), partition the job is running on, name of person running the job. 
- Also note `TIME` which is how long a job has been running.

We can use `squeue` to find the job ID, allowing us to cancel a job with `scancel`. Let’s kill jyang21’s job:

```bash
scancel JOBID
```

---

# Warnings!!!

__Do not run anything on the headnode__ except cluster management tools (squeue, sbatch, etc), compilation tasks (but usually ask CSE help for big apps), or downloading files. 
- If you run anything on the headnode, you will disgrace your lab.

Instead, you can get a `bash` terminal using `srun`:

```bash
srun --nodes=1 --mem 4G --ntasks=4 --licenses=common --time=8:00:00 --pty bash
```

__Monitor your disk space__, as it can fill up quickly.

- You will see the disk usage every time you login

```bash
hcc-du
```

---

# Simulate NGS data

### Install a software on crane

```bash
# https://github.com/lh3/wgsim
git clone https://github.com/lh3/wgsim.git

# compilation
gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm
```

### Put the software in your searching path

```bash
cd ~
vi .bash_profile
```

Then copy the following to your `.bash_profile`

```bash
PATH=$PATH:~/bin/wgsim/
```

---

# Reference genome

## EnsemblPlants

- Bread Wheat: [Triticum aestivum](https://plants.ensembl.org/Triticum_aestivum/Info/Index)
- Common bean: [Phaseolus vulgaris](https://plants.ensembl.org/Phaseolus_vulgaris/Info/Index)
- Domesticated sunflower: [Helianthus annuus](https://plants.ensembl.org/Helianthus_annuus/Info/Index)
- Maize: [Zea mays](https://plants.ensembl.org/Zea_mays/Info/Index?db=core)
- Soybean: [Glycine max](http://plants.ensembl.org/Glycine_max/Info/Index)

## Important info
- Version
- Gene annotation: GFF3

---

# NGS data simulation using `wgsim`

```bash
wgsim lambda.fa -N 5000 -1 100 -2 100 -r 0.01 -e 0 \
-R 0 -X 0 -S 1234567 l1.read1.fq l1.read2.fq
```

- Reference: 50k lambda genome downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/215104#feature_J02459.1)
  - `lambda.fa`
- 20x coverage
  - `N 5000`
- PE 100bp
  - `-1 100 -2 100`
- Only SNP no Indel
  - `-R 0 -X 0`
- Mutation rate is low 
  - `-r 0.01`

---

# NGS data simulation using `wgsim`

## simulate 5 individals

```bash
for i in {1..5}
do
   wgsim lambda.fa -N 5000 -1 100 -2 100 -r 0.01 -e 0 -R 0 -X 0 l$i.read1.fq l$i.read2.fq
done

# check how many reads
wc -l l1.read1.fq
```

---

# A procedure to calculate `$\theta$` values

### 1. Align the NGS reads to the reference genome

```bash
module load bwa samtools
# index the reference genome
bwa index lambda.fa
# using bwa mem to align the reads to the reference genome 
# => samtools to convert into bam file
bwa mem lambda.fa l1.read1.fq l1.read2.fq | samtools view -bSh - > l1.bam
```

#### Do alignment for 5 individuals using bash loop:

```bash
# alignment
for i in {1..5}; do bwa mem lambda.fa l$i.read1.fq l$i.read2.fq | samtools view -bSh - > l$i.bam; done
# sort
for i in *.bam; do samtools sort $i -o sorted_$i; done
# index them
for i in sorted*.bam; do samtools index $i; done
```

---

# A procedure to calculate `$\theta$` values

### 2. Calculate SFS using `ANGSD`

#### Install ANGSD first

```bash
cd ~/bin/ # if you don't have one, do `mkdir bin`
git clone https://github.com/samtools/htslib.git
git clone https://github.com/ANGSD/angsd.git 
cd htslib; make; 
cd ../angsd;
make HTSSRC=../htslib
```

#### run angsd

```bash
# write the bam files to a txt file
mkdir bam_files
mv sorted*.bam bam_files
cd bam_files
ls sorted*.bam > bam.txt

angsd -bam bam.txt -doSaf 1 -anc ../lambda.fa -GL 1  -out out 
realSFS out.saf.idx > out.sfs
## cp sfs to the cache/ folder
cp out.sfs ../../../cache/
```

---

# A procedure to calculate `$\theta$` values

### 3. Calculate the thetas for each site

The output from the above command are two files out.thetas.gz and out.thetas.idx. A formal description of these files can be found in the doc/formats.pdf in the angsd package. It is possible to extract the logscale persite thetas using the ./thetaStat print program.

```bash
angsd -bam bam.txt -out out -doThetas 1 -doSaf 1 -pest out.sfs -anc ../lambda.fa -GL 1
thetaStat print out.thetas.idx > theta.txt
## cp theta to the cache/ folder
cp theta.txt ../../../cache/
```

---

# Visualize the results

In local computer, using `R`:

```r
s <- scan('../../cache/out.sfs')
s <- s[-c(1,length(s))]
s <- s/sum(s)
barplot(s,names=1:length(s), main='SFS')
```

### Show the histgram distribution of the theta values

```r
hist(t$Pairwise)
```