class: center, middle, inverse, title-slide # Direct Selection ### Jinliang Yang ### Feb. 15th, 2020 --- # Direct and linked selection Mutations are __advantageous__, __deleterious__, or __neutral__. ### Direct selection The effects of selection on these mutation themselves ### Linked selection The effects of selection on mutations closely linked to those under selection. -- ------- Because the __expected patterns of polymorphism__ are often different, and therefore different methods will be used to detect one type of selection or the other. --- # The accumulation of sequence divergence ### Cumulative number of mutations in different timescales - 100 years - Different lines, varieties, etc. -- - 10,000 years - wild ancestors, modern breeding lines -- - 10 million years - Two different species, i.e., the grass (_Poaceae_) subfamilies diverged from a common ancestor 60-80 My. - Grasses (wheat, barley, Brachypodium, Sorghum, Oryza, Zea) -- ------------ Species differ because new alleles arise and are fixed. --- # The accumulation of sequence divergence ## Necleotide substituion rate ( `\(k\)` ) The variable `\(k\)` is defined as the substitution rate of __new alleles__ - The rate of alleles that are fixed over long periods of time. - It determines how quickly two squences are expected to diverge over time. -- ## Sequence divergence ( `\(d\)` ) We define `\(d\)` as the genetic distance between two orthologous sequences. - We generally calculate `\(d\)` by taking a single sequence from each species and counting the number of positions that differ between them, divided by the total number of aligned necleotides. --- # The accumulation of sequence divergence The contribution of the __rate of substituion ( `\(k\)` )__ to the expected amount of __divergence ( `\(d\)` )__ can be seen in the following equation: `\begin{align*} E(d) = k2t + \theta_{Anc} \end{align*}` - Where `\(k\)` represents the allele substitution rate. - `\(t\)` is the time since the species split - We use `\(2t\)` because substitutions can occur on both branches of the phylogenetic tree. - `\(\theta_{Anc}\)`: average amount of nucleotide variation expected between two sequences in the ancestor. - Because at the time of speciation there differences have already accumulated along the two linages. -- Simplified as below if assuming divergence levels are much greater than the expected levels of polymorphism in the ancestral species, `\begin{align*} E(d) = k2t \end{align*}` --- # What affects `\(k\)`? Two quantities determine the rate of substitution ( `\(k\)` ). ### The probability of fixation of any mutation ( `\(u\)` ). ### The total number of mutations that arise and can possibly be fixed. --- # Fixation rate of new mutation #### Neutral mutation ( `\(u_0\)` ) If a mutation has no effect on fitness, the probability of fixing is equal to __its current frequency__. -- New mutations always begin at frequency `\(\frac{1}{2N}\)`, therefore, `\begin{align*} u_0 = \frac{1}{2N} \end{align*}` -- #### Advantageous mutations ( `\(u_a\)` ) For new, advantageous mutations ( `\(s > 0\)` ) and large effective population sizes, the probability of fixation is `\begin{align*} u_a \approx 2s_a \end{align*}` according to Haldane 1927; Fisher 1930; Wright 1931. - `\(s_a\)` is the __selective advantage of the new allele in a heterozygote__ and `\(2s_a\)` in a homozygote. --- # Fixation rate of new mutation #### Deleterious mutations ( `\(u_d\)` ) For new, deleterious mutations ( `\(s < 0\)` ) that don't have large effects, the probability of fixation is (Kimura 1957): `\begin{align*} u_d \approx \frac{2s_d}{1 - e^{(-4N s_d)}} \end{align*}` - Here `\(s_d\)` is the __deleterious effect of the new allele in a heterozygote__ and `\(2s_d\)` is the effect in a homozygote. --- # Fixation rate of new mutation Probability of fixation, relative to a neutral allele, of new, selected mutations: `\begin{align*} u/u_0 \approx \frac{2s}{1 - e^{(-4N s)}} / \frac{1}{2N} = \frac{4Ns}{1 - e^{(-4N s)}} \end{align*}` .pull-left[ ```r ns <- seq(from = -1, to =1, by=0.01) plot(ns, 4*ns/(1 - exp(-4*ns)), xlab="Ns", ylab="") abline(v=0, lty=2, lwd=2) ``` <img src="w5class_files/figure-html/unnamed-chunk-1-1.png" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right[ - `\(Ns=0\)`, neutral mutations - `\(Ns > 0\)`, slightly advantageous mutations are not that much more likely to fix than neutral mutations. - `\(Ns < 0\)`, slightly deleterious mutations have some probability of fixing. ] --- # What affects `\(k\)`? Two quantities determine the rate of substitution ( `\(k\)` ). ### The probability of fixation of any mutation ( `\(u\)` ). `\begin{align*} u_0 & = \frac{1}{2N} \\ u_a & \approx 2s_a \\ u_d & \approx \frac{2s_d}{1 - e^{(-4N s_d)}} \end{align*}` ### The total number of mutations that arise and can possibly be fixed. --- # The total number of mutations If the probability of a mutation at a nucleotide in each generation is `\(\nu\)`, then in a population of `\(N\)` diploid individuals, there will be __ `\(2N\nu\)` new mutations per generation at a single site__. -- - with `\(f_0\)` representing the fraction of neutral mutations. - __ `\(2N \nu f_0\)`__ will be neutral -- - The remaining will be advantageous (__ `\(f_a\)` fraction__) and deleterious (__ `\(f_d\)` fraction__). - __ `\(2N \nu f_a\)`__ new advantageous mutations - __ `\(2N \nu f_d\)`__ new deleterious mutations -- If advantageous and deleterious mutations have no contribution, then the subsitution rate is a function of only _the total number_ of neutral mutations that arise and the _probability that each of them fixes_. `\begin{align*} k = (2N \nu f_0) \frac{1}{2N} = \nu f_0 \end{align*}` --- # The total number of mutations `\begin{align*} k = (2N \nu f_0) \frac{1}{2N} = \nu f_0 \end{align*}` Substitute the symbol `\(\mu\)` for the total rate at which neutral mutations arise, `\(\mu = \nu f_0\)`, the rate of neutral mutations is: `\begin{align*} k = \mu \end{align*}` -- When considering only neutral mutations, __the substitution rate is equal to the neutral mutation rate ( `\(\mu\)` )__, regardless of population size. - While more mutations arise in large populations, each of them has a smaller chance of eventually going to fixation - Likewise, it is more likely that any single new mutation will fix in a small population, but there are fewer mutations overall. --- # Advantageous and deleterious mutations The rate of subsitution for advantageous mutations: `\begin{align*} k = (2N \nu f_a) 2s_a = 4N \nu f_as_a \end{align*}` -- The rate of subsitution for deleterious mutations: `\begin{align*} k & = (2N \nu f_d) \times \frac{2s_d}{1 - e^{(-4N s_d)}} \\ & = \frac{4N \nu f_d s_d}{1 - e^{(-4N s_d)}} \end{align*}` -- ---------------- - The population size ( `\(N\)` ) plays an important role in the rate of substitution of selected mutations. - More advantageous mutations will fix in larger populations than in smaller populations - More deleterious mutation will fix in smaller populations relative to larger populations. --- # Detecting selection using divergence In coding regions, we measure divergenece that is due to nonsynonymous and synoymous changes. - `\(d_N\)` as the number of nonsynonymous difference per nonsynonymous site - `\(d_S\)` as the number of synonymous differences per synonymous site -- Note that natural selection has __a profound effect__ on the number of nonsynonymous mutations that are fixed. `\begin{align*} E(d_N) & = k2t \\ & = 2t (\nu f_0 + 4N \nu f_as_a + \frac{4N \nu f_d s_d}{1 - e^{(-4N s_d)}}) \\ & = \nu 2t (f_0 + 4N f_as_a + \frac{4N f_d s_d}{1 - e^{(-4N s_d)}}) \\ \end{align*}` The total nonsynonymous divergence in a region is due to all three types of mutations, therefore, our expression for `\(d_N\)` includes all three terms. --- # Detecting selection using divergence `\begin{align*} E(d_N) & = \nu 2t (f_0 + 4N f_as_a + \frac{4N f_d s_d}{1 - e^{(-4N s_d)}}) \\ \end{align*}` - A higher underlying mutation rate, `\(\nu\)`, and longer divergence times, `\(t\)`, will increase the amount of divergence - The proportion of advantageous mutations fixed will be a function of the frequency at which they arise and their average selective effect - The deleterious mutations can also contribute to divergence if selection is weak enough. --- # Synonymous mutations Here we assume all synonymous changes are neutral. - That is, `\(f_0 =1\)` and `\(f_a = f_d =0\)` -- The total expected amount of synonymous divergence between two sequences is: `\begin{align*} E(d_S) & = \nu 2t \\ \end{align*}` For neutral mutations, the substitution rate is simply equal to the mutation rate. --- # The ratio of nonsynonymous to synonymous divergence Because both `\(\nu\)` and `\(t\)` will be approximately the same of nonsynonymous and synonymous sites in the same gene, dividing above equations gives `\begin{align*} \frac{E(d_N)}{E(d_S)} & = f_0 + 4N f_as_a + \frac{4N f_d s_d}{1 - e^{(-4N s_d)}} \\ \end{align*}` -- - Relative to synonymous divergence, the level of nonsynonymous divergence is again due to the fractions of mutations that are __neutral__, __advantageous__, and __deleterious__. - Note that here, `\(f_0\)` represents only the __nonsynonymous mutations__. --- # Some general guidelines __ `\(d_N/d_S << 1\)`__ The vast majority of nonsynonymous mutations are deleterious, and negative (purifying) selection is predominant. -- __ `\(d_N/d_S < 1\)`__ The majority of nonsynonymous mutations are deleterious, but here may be some unknown fraction of advantageous mutations. -- __ `\(d_N/d_S = 1\)`__ This situation can occur in two cases: - First, there is no selection and all nonsynoymous mutations are neutral. - Second, there is simply a large number of neutral and advantageous mutations (as well as deleterious mutations). -- __ `\(d_N/d_S > 1\)`__ There are many advantageous nonsynonymous mutations and positive selection is predominant, but there are still many deleterious mutations. --- # Interpreting `\(d_N/d_S\)` <div align="center"> <img src="dnds.png" height=300> </div> > Yang and Gaut, 2011. _Arabidopsis thaliana_ and _A. Lyrata_ The mean values for individal genes vary from 0 to >2 This range indicates that at least 75% to 85% of nonsynonymous mutations are deleterious and do not fix. --- # Detecting selection using polymorphism On timescales shorter than those required for mutations to fix, selection will change the mean frequency of alleles in a population. For new mutations, the density of polymorphisms found at frequency `\(q\)` is given by (Wright 1969): `\begin{align*} f(q) & = \frac{2 \nu}{q(1-q)} \frac{1 - e^{(-4Ns)(1-q)}}{1 - e^{(-4N s)}} \\ \end{align*}` - Where `\(\nu\)` is again the total rate of mutation. - `\(s\)` is the fitness effect. Advantageous mutations have `\(s > 0\)` and deleterious mutations have `\(s <0\)` --- # The expected frequency spectra `\begin{align*} f(q) & = \frac{2 \nu}{q(1-q)} \frac{1 - e^{(-4Ns)(1-q)}}{1 - e^{(-4N s)}} \\ \end{align*}` ```r # expected freq spectra f <- function(q, ns){ frq = 2/(q*(1-q)) * (1 - exp(-4*ns*(1-q))) / (1 - exp(-4*ns)) return(frq)} q <- seq(from = 0.01, to =0.99, by=0.01) ## Ploting function plot(q, f(q, ns=0.01), type="l", lty=1, lwd=3, xlab="Ns", ylab="No. of polymorhpic sites", cex.lab=2) lines(q, f(q, ns=-50), type="l", lty=1, lwd=3, col="red") lines(q, f(q, ns=-5), type="l", lty=2, lwd=3, col="red") lines(q, f(q, ns=5), type="l", lty=1, lwd=3, col="blue") lines(q, f(q, ns=50), type="l", lty=2, lwd=3, col="blue") legend(0.6, 200, title="Ns", legend=c("-50", "5", "0", "-5", "50"), col=c("red", "red", "black", "blue", "blue"), lty=c(1,2,1,1,2), cex=2, lwd=3) ``` --- # The expected frequency spectra `\begin{align*} f(q) & = \frac{2 \nu}{q(1-q)} \frac{1 - e^{(-4Ns)(1-q)}}{1 - e^{(-4N s)}} \\ \end{align*}` .pull-left[ <img src="w5class_files/figure-html/unnamed-chunk-3-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ - Deleterious alleles => lower frequencies - most strongly deleterious mutations are immediately removed from the population - Advantage alleles shifted toward higher frequencies - most strongly advantageous mutations fix very rapidly. ] --- # `\(\pi_N/\pi_S\)` Within a species, by analogy with the logic of the comparison of `\(d_N\)` and `\(d_S\)`, we can compare the average number of non-synonymous differences per nonsynoymous site ( `\(\pi_N\)` ) to the average number of synonymous differences per synonymous site ( `\(\pi_S\)` ). - Combining the methods for calculating `\(\pi\)` - With the methods for calculating nonsynonymous and synonymous changes. -- ### Interpretation of the ratio - Values of `\(\pi_N/\pi_S\)` below 1 are again evidence for the predominance of purifying selection, and the vast majority of all coding loci show `\(\pi_N/\pi_S < 1\)` - However, interpretation of `\(\pi_N/\pi_S > 1\)` is different. --- # `\(\pi_N/\pi_S\)` ### Interpretation of the ratio - Since positive selction will rapidly fix advantageous mutations, these adaptive changes will rarely be found in studies of polymorphism - Instead, balancing selection will result in `\(\pi_N/\pi_S > 1\)` - heterozygote advantage (heterosis) - Therefore `\(d_N/d_S > 1\)` for strong evidence of positive selection - `\(\pi_N/\pi_S > 1\)` is a very strict criterion for detecting balancing selection. - Single sites under very strong selection will never contribute enough to values of `\(\pi_N\)` to push `\(\pi_N/\pi_S\)` greater than 1.