Measures of sequence diversity

class: center, middle, inverse, title-slide

.title[
# Measures of sequence diversity
]
.author[
### Jinliang Yang
]
.date[
### Feb. 4, 2025
]

---

# Common estimators of `\(\theta\)` under the infinite sites model

For an individual SNP, one allele has sample frequency `\(p\)`, alternative allele frequency is `\(q\)`, such that `\(p +q =1\)`.

__Heterozygosity__ at this SNP site is:

`\begin{align*}
h = \frac{n}{n-1}(1 - p^2 - q^2)
\end{align*}`

- where `\(n\)` is the number of sequences in the sample.

.pull-left[
<div align="center">
<img src="p4.png" height=150>
</div>
]

.pull-right[
For the first SNP site:
`\begin{align*}
h = & \frac{n}{n-1}(1 - p^2 - q^2) \\
= & \frac{4}{4-1}(1 - (1/4)^2 - (3/4)^2) \\
= & 0.5
\end{align*}`
]

---

# Common estimators of `\(\theta\)` under the infinite sites model

For an individual SNP, one allele has sample frequency `\(p\)`, alternative allele frequency is `\(q\)`, such that `\(p +q =1\)`.

__Heterozygosity__ at this SNP site is:

`\begin{align*}
h = \frac{n}{n-1}(1 - p^2 - q^2)
\end{align*}`

- where `\(n\)` is the number of sequences in the sample.

.pull-left[
<div align="center">
<img src="p4.png" height=150>
</div>
]

.pull-right[
For the __2nd SNP site__:
`\begin{align*}
h = & \frac{n}{n-1}(1 - p^2 - q^2) \\
= & \frac{4}{4-1}(1 - (2/4)^2 - (2/4)^2) \\
= & 0.667
\end{align*}`
]

---
# Common estimators of `\(\theta\)` under the infinite sites model

Given this measure of diversity at a single site, we now consider measures of diversity across an entire sequence.

We can define the __sum of site heterozygosities__ as:

`\begin{align*}
\pi = & \sum_{j=1}^{S}h_j \\
\end{align*}`

- Where `\(S\)` is the number of segregating sites
- `\(h_j\)` is the heterozygosity at the `\(j\)`th SNP site.

Under the __infinite sites model__ for a diploid Wright-Fisher population at equilibrium,

`\begin{align*}
E(\pi) = & \theta \\
\end{align*}`

Which is why this statistic is sometimes called `\(\theta_\pi\)`.

---
# Common estimators of `\(\theta\)` under the infinite sites model

- For the 1st SNP site, `\(h_1=0.5\)`
- For the 2nd SNP site, `\(h_2=0.667\)`

`\begin{align*}
\pi = & \sum_{j=1}^{S}h_j \\
= & 0.5 + 0.667 \\
= & 1.167
\end{align*}`

We often give this value per site, in which case `\(\pi = 1.167/10=0.1167\)`

---
# Common estimators of `\(\theta\)` under the infinite sites model

`\begin{align*}
\pi = & \frac{\sum_{i < j}{k_{ij}}}{n(n-1)/2} \\
\end{align*}`

- Where `\(k_{ij}\)` is the number of nucleotide differences between the `\(i\)`th and `\(j\)`th sequences.

This measure of diversity gives us the average number of pairwise necleotide difference between any two sequences.

`\begin{align*}
P1:P2, 0; P1:P3, 2; P1:P4, 1; P2:P3, 2; P2:P4, 1; P3:P4, 1
\end{align*}`

---
# An Alternative method: Watterson' theta

In this method, we summarize SNPs using the total number of segregating sites, `\(S\)`, in the sample.

However, because larger sample sizes will result in larger values of `\(S\)`, we must adjust the statistic to be

`\begin{align*}
\theta_W = & \frac{S}{a} \\
\end{align*}`

Where `\(a\)` is,

`\begin{align*}
a=\sum_{i=1}^{n-1}\frac{1}{i}
\end{align*}`

Or, combine them together

`\begin{align*}
\theta_W = & \frac{S}{\sum_{i=1}^{n-1}\frac{1}{i}} \\
\end{align*}`

---

# An Alternative method: Watterson' theta

`\begin{align*}
\theta_W = & \frac{S}{\sum_{i=1}^{n-1}\frac{1}{i}} \\
= & 2/(1/1 + 1/2 + 1/3) \\
= & 1.09
\end{align*}`

The per site measure would be `\(1.09/10=0.109\)`, which is very similar to `\(\theta_\pi\)`

---
# The frequency spectrum of alleles

In this case, among these 10 haplotypes are 10 segregating sites, each of which can have a frequency between `\(1/n\)` and `\((n-1)/n\)`

- Note that if an allele is present on all n individuals, it is not polymorhpic.

If we donot know which allele is ancestral and which is derived, a simple way would be using the __minor allele frequency (MAF)__: the frequency of the less common allele.

Visually summarize the MAF of all segregating sites using the __allele frequency spectrum__

---
# The frequency spectrum of alleles

The allele frequency specturm is also referred to as the __site frequency spectrum (SFS)__.

.pull-left[
<div align="center">
<img src="p10.png" height=300>
</div>
]

.pull-left[

``` r
maf <- c(0.1, 0.1, 0.3, 0.1, 0.3, 
         0.2, 0.1, 0.4, 0.1, 0.1)
sfs <- table(maf)
barplot(sfs, col="#cdc0b0", xlab="Minor allele frequency", 
        ylab="No. of segregating sites", 
        cex.axis =1.5, cex.names = 1.5, cex.lab=1.5)
```

<img src="hahn_c3_theta_files/figure-html/unnamed-chunk-1-1.png" width="80%" style="display: block; margin: auto;" />
]

---
# The frequency spectrum of alleles

- The allele frequency specturm is also referred to as the __site frequency spectrum (SFS)__.
- Use the sequences of one or more closely related species, we can get the ancestral state.
- Therefore, we can describe variation at each site using the __derived allele frequency (DAF)__.

.pull-left[
<div align="center">
<img src="daf.png" height=300>
</div>
]

.pull-left[

``` r
maf <- c(0.1, 0.1, 0.3, 0.1, 0.3, 
         0.2, 0.1, 0.6, 0.9, 0.9)
sfs <- table(maf)
barplot(sfs, col="#cdc0b0", xlab="Derived allele frequency", ylab="No. of segregating sites", 
        cex.axis =1.5, cex.names = 1.5, cex.lab=1.5)
```

<img src="hahn_c3_theta_files/figure-html/unnamed-chunk-2-1.png" width="50%" style="display: block; margin: auto;" />
]

---
# The frequency spectrum of alleles

We now consider several statistics summarizing sequencing diversity that use informtion about the frequency of derived alleles, as these capture more information about our sequencing data.

Fu and Li (1993) defined a statistic, `\(\epsilon_1\)`, based on the number of __derived singletons__ in a sample.

`\begin{align*}
\epsilon_1 = S_1 \\
\end{align*}`

- Where `\(S_1\)` is the number of segregating site with derived alleles found on only one haplotype.

If we don't know the ancestral status, we can aslo define a statistic, `\(\eta_1\)`, based on all singletons in a sample

`\begin{align*}
\eta_1 = S_1^*\frac{n-1}{n} \\
\end{align*}`

- Where `\(S_1^*\)` is all the singletons.

---
# The frequency spectrum of alleles

A second summary statistic of diversity that uses ancestral state information is `\(\theta_H\)`:

`\begin{align*}
\theta_H = \frac{\sum_{i=1}^{n-1} i^2S_i}{n(n-1)/2} \\
\end{align*}`

- Where `\(S_i\)` is again the number of segregating sites where `\(i\)` haplotypes carry the derived allele (Fay and Wu, 2000).

.pull-left[
<div align="center">
<img src="daf.png" height=300>
</div>
]

.pull-left[
`\begin{align*}
\theta_H = & \frac{\sum_{i=1}^{n-1} i^2S_i}{n(n-1)/2} \\
= & \frac{(1^2 \times 4 + 2^2 \times 1 + 3^2 \times 2 + 6^2 \times 1 + 9^2 \times 2)}{10(10-1)/2} \\
= 4.98
\end{align*}`
]

---

# Summary

All of these statistics --- `\(\pi, \epsilon_1, \eta_1, \theta_H\)` --- are estimators of `\(\theta\)` 
- in a __Wright-Fisher population__
- at __mutation-drift__ equilibrium 
- under an __infinite sites__ mutational model

Specifically,

`\begin{align*}
E(\epsilon_1) = E(\eta_1) = E(\theta_H)
\end{align*}`

These relationships arise because we know the expected shape of the allele frequency distribution under our standard neutral assumptions.