Selection-drift and Diversity

class: center, middle, inverse, title-slide

.title[
# Selection-drift and Diversity
]
.author[
### Jinliang Yang
]
.date[
### Feb. 12, 2024
]

---

# Selection and drift

### Selection counteracts drift

- Drift shifts allele frequency away from equilibrium

- This reduces the average fitness of the population

### Probability of fixation under selection

`\begin{align*}
Pr(fixation) = \frac{1 - e^{-2s}}{1- e^{-4N_es}}
\end{align*}`

> Moran, 1959

- If `\(2s\)` and `\(-4N_es\)` are small, numerator becomes `\(\approx 2s\)` and denominator `\(\approx 4N_es\)`

`\begin{align*}
Pr(fixation) = \frac{1 - e^{-2s}}{1- e^{-4N_es}} \approx \frac{2s}{4N_es} = \frac{1}{2N_e}
\end{align*}`

---

# Probability of fixation under selection

`\begin{align*}
Pr(fixation) = \frac{1 - e^{-2s}}{1- e^{-4N_es}} \approx \frac{1}{2N_e}
\end{align*}`

> Moran, 1959

- If `\(2N_es\)` is small, an advantageous mutation has about the same Pr(fixation) as a neutral mutation

### R function for the fixation equation

```r
pfix <- function(ne, s){
  # exp computes the exponential function
  num <- 1 - exp(-2*s) # numerator
  den <- 1 - exp(-4*ne*s)  # denominator
  return(num/den)
}
```

---

# Probability of fixation under selection

- In this case, `\(s\)` is selective advantage

```r
ne <- seq(1, 100, by=1)
plot(ne, 1/(2*ne), type="l", lwd=3, col="red", xlab="Effective population size (Ne)", ylab="Pr(fixation)")
lines(ne, pfix(ne, s=0.01), lty=2, lwd=3, col="blue")
lines(ne, pfix(ne, s=0.05), lty=2, lwd=3, col="black")
lines(ne, pfix(ne, s=0.1), lty=2, lwd=3, col="orange")
```

---

# Probability of fixation under selection

.pull-left[
<div align="center">
<img src="nes.png" width=500>
</div>
]

.pull-right[
`\begin{align*}
Pr(fixation) = \frac{1 - e^{-2s}}{1- e^{-4N_es}}
\end{align*}`

- `\(s\)`, the selection coefficient of __advantageous mutation__
]

- Favorable mutations likely be lost (and deleterious mutations can be fixed)
  - __Genetic costs of domestication__: the increase of the deleterious mutational load of domesticated species

- Mutation with a 1% advantage (heterozygous, `\(s=0.01\)`), has 2% probability of being fixed

- If 10% advantage `\(s=0.1\)`, probability becomes 20%

---

# The genetic load in elite maize

- Using __GERP (genomic evolutionary rate profiling)__ to measure deleteriousness

> [Yang et al., 2017](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007019)

---

# Fixation for recessive homozygotes

### In a recessive model

- Normal homozygotes have same fitness as heterozygotes

- The equations don't fit

### For recessive advantageous mutation

`\begin{align*}
Pr(fixation) \approx 1.13 \sqrt{\frac{s}{2N_e}}
\end{align*}`

- `\(s\)` is the selective advantage of homozygote mutant

> Kimura, 1962

---

# A human genetic study

> [Simons et al., 2014](https://www.nature.com/articles/ng.2896)

---

# Polymorphism and Neutral Theory

Polymorphism is not always maintained by selection

### Mutation-drift equilibrium (in the absence of selection)

- Drift is reducing the number of alleles in the population

- While, mutation is increasing

- Eventually, equilibrium is reached with respect to the inbreeding coefficient ( `\(F\)` )

---

# Mutation-drift equilibrium

- Probability two alleles are homozygous

`\begin{align*}
F_t = \frac{1}{N_e} + (1 - \frac{1}{2N_e})F_{t-1}
\end{align*}`

- With mutation occurring at rate `\(\mu\)`, the probability two alleles are IBD is `\(F_t \times (1-\mu)^2\)`
  
  - `\((1 - \mu)^2\)` is the probability both alleles don't mutate to another allele

`\begin{align*}
F_t = [\frac{1}{N_e} + (1 - \frac{1}{2N_e})F_{t-1}] (1 - \mu )^2
\end{align*}`

- At equilibrium ( `\(F_e\)` ), `\(F_t = F_{t-1}\)`, through rearrangement:

`\begin{align*}
F_e = \frac{(1 - \mu )^2}{2N_e - (2N_e -1)(1 - \mu)^2}
\end{align*}`

- But `\(\mu\)` is small as compared to 1 and can be ignored, so:

`\begin{align*}
F_e \approx \frac{1}{4N_e\mu + 1}
\end{align*}`

---

# Mutation-drift equilibrium

`\begin{align*}
F_e \approx \frac{1}{4N_e\mu + 1}
\end{align*}`

- `\(F_e\)` is identify (or homozygosity)

- So, the expected heterozygosity at equilibrium is ( `\(1 - F_e\)` ) or

`\begin{align*}
1- F_e & \approx 1- \frac{1}{4N_e\mu + 1} \\
H_e  & \approx \frac{4N_e \mu}{4N_e\mu + 1}
\end{align*}`

- Only the __probability of homozygosity__ remains constant at equilibrium

- The actual alleles are constantly changing, mutating, and drifting

---

# Mutation-drift equilibrium

`\begin{align*}
F_e \approx \frac{1}{4N_e\mu + 1}
\end{align*}`

- The relationship between `\(N_e\)`, `\(\mu\)`, and `\(F\)` at equilibrium.

```r
fe <- function(ne, mu){
  den <- 4*ne*mu + 1  # denominator
  return(1/den)
}
```

---

# Mutation-drift equilibrium

```r
ne <- seq(1, 50000, by=10)
plot(ne, fe(ne, mu=10^-4), type="l", lwd=3, col="red", xlab="Effective population size (Ne)", ylab="F")
lines(ne, fe(ne, mu=10^-5), lty=2, lwd=3, col="blue")
lines(ne, fe(ne, mu=10^-6), lty=2, lwd=3, col="black")
```

---

# Mutation-drift equilibrium

.pull-left[

<div align="center">
<img src="ne_mu_f.png" width=450>
</div>
`\begin{align*}
1- F_e & \approx 1- \frac{1}{4N_e\mu + 1} \\
H_e  & \approx \frac{4N_e \mu}{4N_e\mu + 1}
\end{align*}`
]

- #### When the mutation rate is low (black line, `\(10^{-6}\)`), `\(F\)` is high even at very large Ne

- Mutations happen very infrequently and do not contribute much to the heterozygosity
  
  - Mutated alleles are easily lost through drift

- #### When the mutation rate is relative high (__red line__, `\(10^{-4}\)`)

- Mutation counteracts drift through introducing new alleles at a high rate

---

# From Heterozygosity to Diversity

### Expected Heterozygosity ( `\(H_{exp}\)` )

- Probability that two randomly chosen copies of a locus are different alleles

- `\(H_{exp}\)` is often reported to __compare populations__

- Predict outlook for population or infer ancestral events
  
  - For example, high `\(H_{exp}\)` = maybe in HWE, large size, no obvious drift/inbreeding

`\(H_{exp}\)` = 1 - (avg expected __homozygosity__ over all loci)

`\begin{align*}
H_{exp} = 1 - \frac{1}{m}\sum_{k=l}^{m} \sum_{i=l}^{k} p_{ki}^2
\end{align*}`

- `\(m\)` is the number of loci
  - `\(k\)` is the number of alleles at a particular locus
  - `\(p_{ki}\)` is the frq of `\(i^{th}\)` allele at `\(k^{th}\)` locus

---

# Expected Heterozygosity

| Loci | Allele   | Frq   | 
| :-------: | : ------ : | :-------: | 
| locus A   | `\(A_1\)`    | `\(0.2\)`    | 
|           | `\(A_2\)`    | `\(0.8\)`    |
| locus B   |  `\(B_1\)`        |  `\(0.3\)`        |  
|           |  `\(B_2\)`        |  `\(0.7\)`        |

`\begin{align*}
H_{exp} = 1 - \frac{1}{m}\sum_{k=l}^{m} \sum_{i=l}^{k} p_{ki}^2
\end{align*}`

- `\(m=2\)` is the number of loci
  - `\(k=2\)` is the number of alleles at a particular locus
  - `\(p_{ki}\)` is the frq of `\(i^{th}\)` allele at `\(k^{th}\)` locus

`\begin{align*}
H_{exp} & = 1 - \frac{1}{2}\sum_{k=1}^{2} \sum_{i=1}^{2} p_{ki}^2 \\
& = 1 - \frac{1}{2}(0.2^2 + 0.8^2 + 0.3^2 + 0.7^2) \\
& = 0.37
\end{align*}`

---

# Expected Heterozygosity

- If `\(H_e\)` is high, it does not mean that there are a lot of heterozygotes.

- It reflects potential diversity

- And how much __heterozygosity you would expect__ if the population were to randomly mate (no selection, migration, mutation, etc.).

# Diversity - allelic richness

- Count of how many alleles are in a sample

- Informative when comparing populations

- A population with fewer alleles may be prone to inbreeding
  
  - A locus with fewer alleles compared to others may indicate it is selected for/against (i.e., minor alleles tends to be deleterious alleles)

---

# Nucleotide Diversity ( `\(\pi\)` )

- Average of nucleotide difference between any two copies of a locus

- When you look at a nucleotide position of two random copies of a gene, diversity  ( `\(\pi\)` ) is the __probability they are different__

`\begin{align*}
\Pi = \frac{n}{n-1} \sum_{ij} x_i x_j \pi_{ij}
\end{align*}`

- `\(x_i\)` and `\(x_j\)` is the freq of the `\(i^{th}\)` and `\(j^{th}\)` haplotype
- `\(\pi_{ij}\)` is the proportion of nucleotide differences between `\(i^{th}\)` and `\(j^{th}\)` haplotype

---

# An example of Nucleotide Diversity

Here, we sequenced a population of 5 maize inbred lines for 500 bp and identified SNP variants as below.

--------------

- Total number of bp: `\(N_t = 500\)`
- Number of polymorphic: `\(N_p = 5\)`

- Proportion of polymorphic site: `\(P_n = \frac{N_p}{N_t} = \frac{5}{500} = 0.01\)`

---

# An example of Nucleotide Diversity

--------------

`\begin{align*}
\Pi = \frac{n}{n-1} \sum_{ij} x_i x_j \pi_{ij}
\end{align*}`

- `\(x_i\)` and `\(x_j\)` is the freq of the `\(i^{th}\)` and `\(j^{th}\)` haplotype
- `\(\pi_{ij}\)` is the proportion of nucleotide differences between `\(i^{th}\)` and `\(j^{th}\)` haplotype

---

# An example of Nucleotide Diversity

--------------

`\begin{align*}
\Pi = \frac{n}{n-1} \sum_{ij} x_i x_j \pi_{ij}
\end{align*}`

- `\(x_i\)` and `\(x_j\)` is the freq of the `\(i^{th}\)` and `\(j^{th}\)` haplotype
- `\(\pi_{ij}\)` is the proportion of nucleotide differences between `\(i^{th}\)` and `\(j^{th}\)` haplotype

- `\(\pi_{12} =3/500\)`, `\(\pi_{13} =3/500\)`, `\(\pi_{14} =3/500\)`, `\(\pi_{14} =2/500\)`
- `\(\pi_{23} =3/500\)`, `\(\pi_{24} =2/500\)`, `\(\pi_{14} =3/500\)`
- ...

---

# An example of Nucleotide Diversity

--------------

```r
df <- data.frame(hap=paste0("line", 1:5), snp1=c(0,0,1,1,1), 
                 snp2=c(1,1,0,1,1), snp3=c(0,1,1,1,0), 
                 snp4=c(1,0,1,0,1), snp5=c(0,1,1,0,1))
snp <- t(df[,-1])
getpi <- function(snp, totbp=500){
  nhap = ncol(snp) # number of hap
  td <- c()
  for(i in 1:(nhap -1)){
    for(j in (i+1):nhap){
      cn <- sum(abs(snp[,i] - snp[,j])) # count the nucleotide differences
      td <- c(td, cn)
    }
  }
  return(nhap/(nhap -1)*(1/nhap)^2*sum(td)/totbp)
}
```

---

# An example of Nucleotide Diversity

```r
getpi(snp, totbp=500)
```

```
## [1] 0.0028
```

Heterozygosity at this SNP site is:

`\begin{align*}
h = \frac{n}{n-1}(1- p^2 - q^2)
\end{align*}`

- `\(n\)` is the number of haplotypes in a population

then the sum of site heterozygosities is:

`\begin{align*}
\sum_{i=1}^S h_i
\end{align*}`

- Here, `\(S\)` is the number of polymorphic sites
- `\(h_i\)` is the heterozygosity at the `\(i\)`th site

---

# An example of Nucleotide Diversity

--------------

```r
geth <- function(snp, totbp=500){
  nhap = ncol(snp) # number of hap
  q <- apply(snp, 1, sum)/nhap
  p <- 1 -q
  return(nhap/(nhap -1) * sum(rep(1, nhap) - p^2 - q^2)/totbp/2)
}
geth(snp, totbp=500)
```

```
## [1] 0.0028
```
---
# An example of Nucleotide Diversity

# Heterozygosity and Nucleotide Diversity

This estimate of heterozygosity gives us the average number of pairwise necleotide difference.

`\begin{align*}
\Pi = \sum_{i=1}^S h_i
\end{align*}`