class: center, middle, inverse, title-slide .title[ # Test for selection ] .author[ ### Jinliang Yang ] .date[ ### Feb. 23, 2024 ] --- # Diversity measurement We now consider several statistics summarizing sequencing diversity that use information about __the frequency of derived alleles__ - As these capture more information about our sequencing data. -- Fu and Li (1993) defined a statistic, `\(\epsilon_1\)`, based on the number of __derived singletons__ in a sample. `\begin{align*} \epsilon_1 = S_1 \\ \end{align*}` - Where `\(S_1\)` is the number of segregating site with derived alleles found on only one haplotype. -- If we don't know the ancestral status, we can aslo define a statistic, `\(\eta_1\)`, based on __all singletons__ in a sample `\begin{align*} \eta_1 = S_1^*\frac{n-1}{n} \\ \end{align*}` - Where `\(S_1^*\)` is all the singletons. --- # Diversity measurement A second summary statistic of diversity that uses ancestral state information is `\(\theta_H\)`: `\begin{align*} \theta_H = \frac{\sum_{i=1}^{n-1} i^2S_i}{n(n-1)/2} \\ \end{align*}` - Where `\(S_i\)` is again the number of segregating sites where `\(i\)` haplotypes carry the derived allele (Fay and Wu, 2000). -- .pull-left[ <div align="center"> <img src="daf.png" height=300> </div> ] -- .pull-left[ `\begin{align*} \theta_H = & \frac{\sum_{i=1}^{n-1} i^2S_i}{n(n-1)/2} \\ = & \frac{(1^2 \times 4 + 2^2 \times 1 + 3^2 \times 2 + 6^2 \times 1 + 9^2 \times 2)}{10(10-1)/2} \\ = & 4.98 \end{align*}` ] --- # Summary of the `\(\theta\)` statistics All of these statistics --- `\(\epsilon_1, \eta_1, \theta_H\)` --- are estimators of `\(\theta\)` - at __mutation-drift__ equilibrium - under an __infinite sites__ mutational model -- Specifically, `\begin{align*} E(\epsilon_1) = E(\eta_1) = E(\theta_H) \end{align*}` These relationships arise because we know the expected shape of the allele frequency distribution under our standard neutral assumptions. --- # Detecting selection using the SFS ## The effects of positive selection .pull-left[ <div align="center"> <img src="sfs1.png" height=300> </div> > Hanh, 2020 ] .pull-right[ - After sweep ended, new mutations started to accumulate. - These new mutations are by definition __singletons__ - there is only one origin in the sample with each derived allele. ] The SFS can be skewed toward an excess of low-frequency polymorphisms relateive to the neutral spectrum. --- # Detecting selection using the SFS ## The effects of balancing selection Here we consider a simple scenario with a single biallelic site that has been under balancing selection for a long time. - Variation within each allelic class has been able to __build up__ and __reach equilibrium__ .pull-left[ <div align="center"> <img src="sfs3.png" height=200> </div> > Bitarello et al., 2018 ] .pull-right[ - Neutral mutations has accumulated both within and between allelic classes - Overall variation is higher - SNPs at intermediate frequency show __a distinctive "bump"__ in the SFS. ] --- # Detecting selection using SFS A straightforward way would be test a difference between two SFSs. - However, linkage among sites means that __SNPs at a locus are not independent__, which violates the assumptions made by almost all such test. -- ### Instead, we use `\(\theta\)` to detect deviations. - `\(\theta_\pi\)`: pairwise necleotide diversity. -- - `\(\theta_W\)`: Watterson's `\(\theta\)`, using total number of segregating sites -- - `\(\epsilon_1 = S_1\)`: the number of derived singletons in a sample. - `\(\eta_1\)`: based on all singletons in a sample. -- Under the standard neutral model, all of these test statistics are expected to have a mean of 0. --- # Tajima's D and related tests Tajima (1989) constructed the first test to detect difference between the SFS. His statistic, `\(D\)`, was defined as: `\begin{align*} D = \frac{\theta_\pi - \theta_W}{\sqrt{Var(\theta_\pi - \theta_W)}} \end{align*}` -- Fu and Li (1993) created similar statistics. These are known as Fu and Li's `\(D\)`, `\(F\)`, `\(D^*\)`, and `\(F^*\)`. `\begin{align*} D = \frac{\theta_\pi - \epsilon_1}{\sqrt{Var(\theta_\pi - \epsilon_1)}} \end{align*}` `\begin{align*} F = \frac{\theta_W - \epsilon_1}{\sqrt{Var(\theta_W - \epsilon_1)}} \end{align*}` `\begin{align*} D^* = \frac{\theta_\pi - \eta_1}{\sqrt{Var(\theta_\pi - \eta_1)}} \end{align*}` `\begin{align*} F^* = \frac{\theta_W - \eta_1}{\sqrt{Var(\theta_W - \eta_1)}} \end{align*}` --- # Tajima's D and related tests Tajima (1989) constructed the first test to detect difference between the SFS. His statistic, `\(D\)`, was defined as: `\begin{align*} D = \frac{\theta_\pi - \theta_W}{\sqrt{Var(\theta_\pi - \theta_W)}} \end{align*}` Originally designed to fit a normal distribution, however, none of these test statistics fit a parametric distribution very well. ### Calculation - Only variable sites at each locus are needed - The number of invariant sites do not figure into any calculations. --- # Interpreting values of the test statistics Tajima's `\(D\)`, Fu and Li's `\(D, F, D^*, F^*\)`: `\begin{align*} D = \frac{\theta_\pi - \theta_W}{\sqrt{Var(\theta_\pi - \theta_W)}} \end{align*}` After a sweep, all SNPs are low in frequency, `\(\theta_\pi\)` will be much lower than expected. While statistics based on counts of segregating sites (like `\(\theta_W\)`) will be much closer to their expected values. -- ------ - All __negative__ when there has been a sweep --- # Interpreting values of the test statistics Tajima's `\(D\)`, Fu and Li's `\(D, F, D^*, F^*\)`: `\begin{align*} D = \frac{\theta_\pi - \theta_W}{\sqrt{Var(\theta_\pi - \theta_W)}} \end{align*}` Balancing selection lead to __an excess of intermediate frequency neutral variation__ surrounding a selected site. In such case, `\(\theta_\pi\)` will be greater than `\(\theta_W\)` and other statistics. -- ------ - All __negative__ when there has been a sweep - All __positive__ when there is balancing selection -- - Are usually __significant__ when the values `\(> +2\)` or `\(< -2\)` - The exact thresholds depend on sample size, number of SNPs, etc. --- # The power of the SFS The time window for positive selection is limited. - #### Too early during the sweep - Signal will be not strong enough -- - #### Too late after the sweep - Both levels and frequencies of variants will have returned to normal -- Power also determined by the distance between our studied loci and the location of the selected site. - Because of the effect of the recombination. - Move far away enough and there will be no signal of selection at all. --- # The Hudson, Kreitman, Aguadé (HKA) Test #### H0: If two loci evolve neutrally, they should have a similar ratio of polymorphism (within species) to divergence (between species) -- Teases apart what is responsible for `\(\theta = 4N_e \mu\)`, testing for selection Under a neutral model: - All loci share the same `\(N_e\)` - Neutral mutation rate `\(\mu\)` varies for different loci, but should be constant for the same locus in different species -- - High neutral mutation rate = high polymorphism and greater divergence between species - Low rate = low polymorphism and less divergence --- # The Hudson, Kreitman, Aguadé (HKA) Test #### H0: If two loci evolve neutrally, they should have a similar ratio of polymorphism (within species) to divergence (between species) To determine level of divergence - Examine fixed differences between species - Then look at number of polymorphisms between species -- ### The _Adh_ locus example | | _Adh_ | Control locus | | :-------: | : ------ : | :-------: | | Polymorphism within species (S) | 0.101 | 0.022 | | Divergence between species (D) | 0.056 | 0.052 | | Ratio S/D (within/between) | 1.80 | 0.42 | | `\(\chi^2\)` P-value | 0.016 | | | -- Conclusion: Excess polymorphism at _Adh_ may be due to balancing selection, but HKA can’t determine the true cause --- # Cautions and Prospects - A review paper: [Detecting Natural Selection in Genomic Data](https://www.annualreviews.org/doi/abs/10.1146/annurev-genet-111212-133526) by Vitti, Grossman, and Sabeti, 2013. - Frequency-based methods (i.e., Tajima's D, Fay & Wu's H, Ewens-Watterson test) - Gene-based methods (i.e., Ka/Ks, MKT) - Mutation rate-based methods (i.e., HKA test) -- Rejection of the null hypothesis (H0: neutral theory) is not necessarily the same as demonstrating natural selection! Because we have strong assumptions: 1. Constant Ne 2. Populations are at mutation-drift equilibrium 3. No gene flow 4. No recombination ... -- Most of these tests are weak! They have little power to detect deviation unless these deviations are large.