Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Statistical Foundations

Jinliang Yang

Oct. 3th, 2022

1 / 43

Population Genetics vs. Quantitative genetics?

Population genetics

  • Pop-gen is the study of evolution.

  • The language of pop-gen is Mathematics.

2 / 43

Population Genetics vs. Quantitative genetics?

Population genetics

  • Pop-gen is the study of evolution.

  • The language of pop-gen is Mathematics.

Quantitative genetics

  • Quant-gen is the study of the complex trait, or phenotype.
  • The language of Quant-gen is Statistics.
3 / 43

Quantitative genetics almost synonymous with statistics

  • R. A. Fisher is a founder of quantitative genetics but also of analysis of variance and randomization procedures in statistics.
  • The early geneticist Karl Pearson originated the concepts of regression and correlation.
4 / 43

Quantitative genetics almost synonymous with statistics

  • R. A. Fisher is a founder of quantitative genetics but also of analysis of variance and randomization procedures in statistics.
  • The early geneticist Karl Pearson originated the concepts of regression and correlation.

In the next couple weeks, we will be deeply involved with the statistical evalution of the basic quantiative genetic models.

5 / 43

Quantitative genetics vs. statistics

  • Many genetic factors

    • Determining the quantitative traits are almost always normally distributed
6 / 43

Quantitative genetics vs. statistics

  • Many genetic factors

    • Determining the quantitative traits are almost always normally distributed
  • Genetic factors act in pairs (two alleles per locus)

    • Explanatory variables
    • Each with two or three levels of variation
7 / 43

Quantitative genetics vs. statistics

  • Many genetic factors

    • Determining the quantitative traits are almost always normally distributed
  • Genetic factors act in pairs (two alleles per locus)

    • Explanatory variables
    • Each with two or three levels of variation
  • Passed on to progeny at random

    • Random events
8 / 43

Quantitative genetics vs. statistics

  • Many genetic factors

    • Determining the quantitative traits are almost always normally distributed
  • Genetic factors act in pairs (two alleles per locus)

    • Explanatory variables
    • Each with two or three levels of variation
  • Passed on to progeny at random

    • Random events
  • Genetic factors sometimes show independent assortment

    • Independence
9 / 43

Quantitative traits: statistical notation

Conceptual Notation

= + error
10 / 43

Quantitative traits: statistical notation

Conceptual Notation

= + error

Matrix Notation

[Y1Y2Yn]n×1=[X11X12X1mX21X22X2mXn1Xn2Xnm]n×m[a1a2am]m×1+[ϵ1ϵ2ϵm]n×1

11 / 43

Quantitative traits: statistical notation

Conceptual Notation

= + error

Matrix Notation

[Y1Y2Yn]n×1=[X11X12X1mX21X22X2mXn1Xn2Xnm]n×m[a1a2am]m×1+[ϵ1ϵ2ϵm]n×1

Statistical Notation

Yi=j=mj=1Xijαj+ϵi

12 / 43

Why the normal distribution?

13 / 43

Why the normal distribution?

The Central Limit Theorem (CLT)

  • The CLT states that the sums of a set of random variables (X1,X2,X3,...,Xn) is normally distributed no matter the distribution the individual X's were sampled from, as long as they were sampled from identical distributions.
14 / 43

A simulation experiment

Yi=j=mj=1Xijαj+ϵi

  • For a given individual ( i=1 ) with a number of loci ( m=1,000 )

  • Each allele is Xj(A,a) , with the probability of p or q=1p

15 / 43

A simulation experiment

Yi=j=mj=1Xijαj+ϵi

  • For a given individual ( i=1 ) with a number of loci ( m=1,000 )

  • Each allele is Xj(A,a) , with the probability of p or q=1p

  • The effect of jth allele ( αj ) can be samples from any distribution (e.g., uniform distribution)

According to the CLT, if m is sufficiently large, the sum is normally distributed.

16 / 43

A simulation experiment

Yi=j=mj=1Xijαj+ϵi

  • For a given individual ( i=1 ) with a number of loci ( m=1,000 )

  • Each allele is Xj(A,a) , with the probability of p or q=1p

  • The effect of jth allele ( αj ) can be samples from any distribution (e.g., uniform distribution)

According to the CLT, if m is sufficiently large, the sum is normally distributed.

m <- 1000
## for each allele, the chance of A or a is equal to 0.5
x <- rbinom(n=m, size=1, prob=0.5)
## sample effect from a uniform distribution:
a <- runif(n=m)
y <- sum(x*a) + 0
y
## [1] 243.8146
17 / 43

A simulation experiment

Yi=j=mj=1Xijαj+ϵi

set.seed(1234) # seed for random number generator
m <- 1000
n = 2000 # simulate a population of 2,000 individuals
out <- c() # create an empty vector
for(i in 1:n){
x <- rbinom(n=m, size=1, prob=0.5) ## for each allele, the chance of A = 0.5
a <- runif(n=m) ## sample effect from a uniform distribution:
y <- sum(x*a)
out <- c(out, y)
}
#shapiro.test(out) # W = 0.99928, p-value = 0.6622
hist(out, breaks=50, col="#b8860b", main="Phenotype Distribution", xlab="")

18 / 43

A simulation experiment

Yi=j=mj=1Xijαj+ϵi

set.seed(1234) # seed for random number generator
m <- 2
n = 2000 # simulate a population of 2,000 individuals
out <- c()
for(i in 1:n){
x <- rbinom(n=m, size=1, prob=0.5) ## for each allele, the chance of A = 0.5
a <- runif(n=m) ## sample effect from a uniform distribution:
y <- sum(x*a)
out <- c(out, y)
}
#shapiro.test(out) # W = 0.91117, p-value < 2.2e-16
hist(out, breaks=50, col="#b8860b", main="Phenotype Distribution", xlab="")

19 / 43

Probability Density

For a continuous trait, i.e., kernel number per ear (raning from 0 to 1,000), what is the Pr(Y=100)?

20 / 43

Probability Density

For a continuous trait, i.e., kernel number per ear (raning from 0 to 1,000), what is the Pr(Y=100)?

Assuming wheat plant height in a population is normally distributed, with m = 30 inch, sd=5. Question: Pr(30<Y50)?

21 / 43

Probability Density

For a continuous trait, i.e., kernel number per ear (raning from 0 to 1,000), what is the Pr(Y=100)?

Assuming wheat plant height in a population is normally distributed, with m = 30 inch, sd=5. Question: Pr(30<Y50)?

Using the probability density function (or pdf) and integration, we can calculate the probability that Y is contained in a certain bracket as:

Pr(30<Y50)=5030f(y)dy

22 / 43

Probability Density

For a continuous trait, i.e., kernel number per ear (raning from 0 to 1,000), what is the Pr(Y=100)?

Assuming wheat plant height in a population is normally distributed, with m = 30 inch, sd=5. Question: Pr(30<Y50)?

Using the probability density function (or pdf) and integration, we can calculate the probability that Y is contained in a certain bracket as:

Pr(30<Y50)=5030f(y)dy

23 / 43

Expectation and variance

Define the random variable X (i.e. for genotype) which counts the number of allele A. X={2if AA with frequency p21if Aa with frequency 2p(1p)0if aa with frequency (1p)2 where p is the allele frequency of A.

24 / 43

Expectation and variance

Define the random variable X (i.e. for genotype) which counts the number of allele A. X={2if AA with frequency p21if Aa with frequency 2p(1p)0if aa with frequency (1p)2 where p is the allele frequency of A.

Then, according to Formula (1) in the Note:

E(f(X))=ki=1f(xi)Pr(X=xi)

25 / 43

Expectation and variance

Define the random variable X (i.e. for genotype) which counts the number of allele A. X={2if AA with frequency p21if Aa with frequency 2p(1p)0if aa with frequency (1p)2 where p is the allele frequency of A.

Then, according to Formula (1) in the Note:

E(f(X))=ki=1f(xi)Pr(X=xi)

Expected value of X:

26 / 43

Expectation and variance

Define the random variable X (i.e. for genotype) which counts the number of allele A. X={2if AA with frequency p21if Aa with frequency 2p(1p)0if aa with frequency (1p)2 where p is the allele frequency of A.

Then, according to Formula (1) in the Note:

E(f(X))=ki=1f(xi)Pr(X=xi)

Expected value of X:

E[X]=0×(1p)2+1×[2p(1p)]+2×p2=2p

27 / 43

Expectation and variance

Define the random variable X which counts the number of allele A. X={2if AA with frequency p21if Aa with frequency 2p(1p)0if aa with frequency (1p)2 where p is the allele frequency of A.

28 / 43

Expectation and variance

Define the random variable X which counts the number of allele A. X={2if AA with frequency p21if Aa with frequency 2p(1p)0if aa with frequency (1p)2 where p is the allele frequency of A.

Expected value of X2:

29 / 43

Expectation and variance

Define the random variable X which counts the number of allele A. X={2if AA with frequency p21if Aa with frequency 2p(1p)0if aa with frequency (1p)2 where p is the allele frequency of A.

Expected value of X2:

E[X2]=02×(1p)2+12×[2p(1p)]+22×p2=2p(1p)+4p2

30 / 43

Expectation and variance

Define the random variable X which counts the number of allele A. X={2if AA with frequency p21if Aa with frequency 2p(1p)0if aa with frequency (1p)2 where p is the allele frequency of A.

Expected value of X2:

E[X2]=02×(1p)2+12×[2p(1p)]+22×p2=2p(1p)+4p2

Thus, the variance of allelic counts is

31 / 43

Expectation and variance

Define the random variable X which counts the number of allele A. X={2if AA with frequency p21if Aa with frequency 2p(1p)0if aa with frequency (1p)2 where p is the allele frequency of A.

Expected value of X2:

E[X2]=02×(1p)2+12×[2p(1p)]+22×p2=2p(1p)+4p2

Thus, the variance of allelic counts is

Var(X)=E[X2]E[X]2=2p(1p)+4p2(2p)2=2p(1p)

32 / 43

Alternative coding

Define the random variable X as below:

X={1if AA with frequency p20if Aa with frequency 2p(1p)1if aa with frequency (1p)2 where p is the allele frequency of A.

33 / 43

Alternative coding

Define the random variable X as below:

X={1if AA with frequency p20if Aa with frequency 2p(1p)1if aa with frequency (1p)2 where p is the allele frequency of A.

Then, E[X]=1×(1p)2+0×[2p(1p)]+1×p2=(12p+p2)+p2=2p1E[X2]=(1)2×(1p)2+02×[2p(1p)]+12×p2=12p+p2+p2=2p22p+1 Thus, the variance of allelic counts is Var(X)=E[X2]E[X]2=2p22p+1(4p24p+1)=2p2+2p=2p(1p)

34 / 43

Examples for probabilities

Two variables: Genotype and Milk Yield (MY)

Genotype (G) MY100 100<MY300 MY>300 Marginal Pr(G)
aa 0.10 0.04 0.02 0.16
Aa 0.14 0.18 0.16 0.48
AA 0.06 0.10 0.20 0.36
Marg. Prob. 0.30 0.32 0.38 1.00
35 / 43

Examples for probabilities

Two variables: Genotype and Milk Yield (MY)

Genotype (G) MY100 100<MY300 MY>300 Marginal Pr(G)
aa 0.10 0.04 0.02 0.16
Aa 0.14 0.18 0.16 0.48
AA 0.06 0.10 0.20 0.36
Marg. Prob. 0.30 0.32 0.38 1.00

Joint Probability

Two random variables to occur together.

36 / 43

Examples for probabilities

Two variables: Genotype and Milk Yield (MY)

Genotype (G) MY100 100<MY300 MY>300 Marginal Pr(G)
aa 0.10 0.04 0.02 0.16
Aa 0.14 0.18 0.16 0.48
AA 0.06 0.10 0.20 0.36
Marg. Prob. 0.30 0.32 0.38 1.00

Joint Probability

Two random variables to occur together.

  • What is the joint probability of Pr(G=aa,MY>300)?
37 / 43

Examples for probabilities

Two variables: Genotype and Milk Yield (MY)

Genotype (G) MY100 100<MY300 MY>300 Marginal Pr(G)
aa 0.10 0.04 0.02 0.16
Aa 0.14 0.18 0.16 0.48
AA 0.06 0.10 0.20 0.36
Marg. Prob. 0.30 0.32 0.38 1.00
38 / 43

Examples for probabilities

Two variables: Genotype and Milk Yield (MY)

Genotype (G) MY100 100<MY300 MY>300 Marginal Pr(G)
aa 0.10 0.04 0.02 0.16
Aa 0.14 0.18 0.16 0.48
AA 0.06 0.10 0.20 0.36
Marg. Prob. 0.30 0.32 0.38 1.00

Marginal Probability

A sum of mutually exclusive and exhaustive set of events.

39 / 43

Examples for probabilities

Two variables: Genotype and Milk Yield (MY)

Genotype (G) MY100 100<MY300 MY>300 Marginal Pr(G)
aa 0.10 0.04 0.02 0.16
Aa 0.14 0.18 0.16 0.48
AA 0.06 0.10 0.20 0.36
Marg. Prob. 0.30 0.32 0.38 1.00

Marginal Probability

A sum of mutually exclusive and exhaustive set of events.

  • What is the marginal probability of Pr(G=Aa)?
40 / 43

Examples for probabilities

Two variables: Genotype and Milk Yield (MY)

Genotype (G) MY100 100<MY300 MY>300 Marginal Pr(G)
aa 0.10 0.04 0.02 0.16
Aa 0.14 0.18 0.16 0.48
AA 0.06 0.10 0.20 0.36
Marg. Prob. 0.30 0.32 0.38 1.00
41 / 43

Examples for probabilities

Two variables: Genotype and Milk Yield (MY)

Genotype (G) MY100 100<MY300 MY>300 Marginal Pr(G)
aa 0.10 0.04 0.02 0.16
Aa 0.14 0.18 0.16 0.48
AA 0.06 0.10 0.20 0.36
Marg. Prob. 0.30 0.32 0.38 1.00

Conditional Probability

Pr(X=x|Y=y)=Pr(X=x,Y=y)Pr(Y=y)

42 / 43

Examples for probabilities

Two variables: Genotype and Milk Yield (MY)

Genotype (G) MY100 100<MY300 MY>300 Marginal Pr(G)
aa 0.10 0.04 0.02 0.16
Aa 0.14 0.18 0.16 0.48
AA 0.06 0.10 0.20 0.36
Marg. Prob. 0.30 0.32 0.38 1.00

Conditional Probability

Pr(X=x|Y=y)=Pr(X=x,Y=y)Pr(Y=y)

  • What is the conditional probability of Pr(MY100|G=Aa)?
43 / 43

Population Genetics vs. Quantitative genetics?

Population genetics

  • Pop-gen is the study of evolution.

  • The language of pop-gen is Mathematics.

2 / 43
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow