class: center, middle, inverse, title-slide # Statistical Foundations (2) ### Jinliang Yang ### Oct. 7th, 2020 --- # Expectation and Variance ### Expectation `\(E(X)\)`: `\begin{align*} E(f(X)) = \sum\limits_{i=1}^kf(x_i)Pr(X = x_i) \end{align*}` `\begin{align*} E[X] &= 0 \times (1 - p)^2 + 1 \times [2p(1-p)] + 2 \times p^2 = 2p \\ \end{align*}` -- ### Variance `\(Var(X)\)`: `\begin{align*} Var(X) &= E[X^2] - E[X]^2 \\ &= 2p(1-p) \\ \end{align*}` --- # Probabilities ### Joint Probability Two random variables to **occur together**. In the `Milk Yield` exmaple, the joint probability of `\(Pr(G=aa, MY > 300)\)`? -- ### Marginal Probability A sum of **mutually exclusive** and **exhaustive** set of events. The marginal probability of `\(Pr(G=Aa)\)` for all possible MY? --- # Examples for probabilities Two variables: Genotype and Milk Yield (MY) | Genotype (G) | `\(MY \leq 100\)` | `\(100 < MY \leq 300\)` | `\(MY > 300\)` | Marginal `\(Pr(G)\)` | | :-------: | ------- | ------- | ------- | ------- | ------- | | aa | 0.10 | 0.04 | 0.02 | 0.16 | | Aa | 0.14 | 0.18 | 0.16 | 0.48 | | AA | 0.06 | 0.10 | 0.20 | 0.36 | | Marg. Prob.| 0.30 | 0.32 | 0.38 | 1.00 | -- ### Conditional Probability $$ Pr(X = x | Y = y) = \frac{Pr(X = x, Y = y)}{ Pr(Y = y)} $$ -- - What is the conditional probability of `\(Pr(MY \leq 100 | G=Aa)\)`? --- # Genotype (X) and Milk Yield (MY) | Genotype (G) | `\(MY \leq 100\)` | `\(100 < MY \leq 300\)` | `\(MY > 300\)` | Marginal `\(Pr(G)\)` | | :-------: | ------- | ------- | ------- | ------- | ------- | | aa | 0.10 | 0.04 | 0.02 | 0.16 | | Aa | 0.14 | 0.18 | 0.16 | 0.48 | | AA | 0.06 | 0.10 | 0.20 | 0.36 | | Marg. Prob.| 0.30 | 0.32 | 0.38 | 1.00 | -- ### Statistical Independence $$ `\begin{align*} & Pr(X = x_i|Y = y_j) \\ & = Pr(X = x_i|Y = y_k) \\ & = Pr(X = x_i) \\ \end{align*}` $$ -- `\begin{equation} Pr(X = x_i, Y = y_j) = Pr(X = x_i) \times Pr(Y = y_j) \end{equation}` -- $$ `\begin{align*} &Pr(MY > 300, X_{AA}) \\ &Pr(MY > 300) \times Pr(X_{AA}) \\ \end{align*}` $$ --- # Genotype (X) and Milk Yield (MY) | Genotype | `\(MY = 100\)` | `\(MY = 150\)` | `\(MY = 300\)` | Marginal `\(Pr(G)\)` | | :-------: | ------- | ------- | ------- | ------- | ------- | | aa | 0.10 | 0.04 | 0.02 | 0.16 | | Aa | 0.14 | 0.18 | 0.16 | 0.48 | | AA | 0.06 | 0.10 | 0.20 | 0.36 | | Marg. Prob.| 0.30 | 0.32 | 0.38 | 1.00 | What are the genotype effects, or `\(E(MY | X_{AA)}\)`, `\(E(MY | X_{aa)}\)`, `\(E(MY | X_{Aa)}\)`? -- ### Conditional Expectation The expectation (=mean) for variable `\(X\)` conditional on variable `\(Y=y\)` is: `\begin{align*} E(X|Y = y) & = \sum\limits_{i=1}^k x_i Pr(X = x_i | Y = y) \\ & = \sum\limits_{i=1}^k x_i \frac{Pr(X = x_i, Y = y)}{Pr(Y = y)} \\ \end{align*}` --- # Genotype (X) and Milk Yield (MY) | Genotype | `\(MY = 100\)` | `\(MY = 150\)` | `\(MY = 300\)` | Marginal `\(Pr(G)\)` | | :-------: | ------- | ------- | ------- | ------- | ------- | | aa | 0.10 | 0.04 | 0.02 | 0.16 | | Aa | 0.14 | 0.18 | 0.16 | 0.48 | | AA | 0.06 | 0.10 | 0.20 | 0.36 | | Marg. Prob.| 0.30 | 0.32 | 0.38 | 1.00 | What are the genotype effects, or `\(E(MY | X_{AA)}\)`, `\(E(MY | X_{aa)}\)`, `\(E(MY | X_{Aa)}\)`? -- $$ `\begin{align*} E(MY| X_{AA}) & = \sum\limits_{i=1}^3 MY_i Pr(MY = MY_i | X = X_{AA}) \\ & = 100 \times 0.06/0.36 + 150 \times 0.10/0.36 + 300 \times 0.20/0.36 = 81/0.36 = 225 \end{align*}` $$ -- $$ `\begin{align*} E(MY| X_{aa}) & = \sum\limits_{i=1}^3 MY_i Pr(MY = MY_i | X = X_{aa}) \\ & = 100 \times 0.10/0.16 + 150 \times 0.04/0.16 + 300 \times 0.02/0.16 = 22/0.16 = 137.5 \end{align*}` $$ $$ `\begin{align*} E(MY| X_{Aa}) & = \sum\limits_{i=1}^3 MY_i Pr(MY = MY_i | X = X_{Aa}) \\ & = 100 \times 0.14/0.48 + 150 \times 0.18/0.48 + 300 \times 0.16/0.48 = 89/0.48 = 185.4 \end{align*}` $$ --- # Covariance To quantify to what extent the two variables **co-vary**. $$ `\begin{aligned} Cov(X, Y) & = E([X - E(X)][Y - E(Y)]) \\ & = E(XY) - E(X)E(Y) \\ \end{aligned}` $$ where, $$ `\begin{aligned} E(XY) = \sum_i \sum_j x_i y_j Pr(X = x_i, Y = y_j) \end{aligned}` $$ --- # A plant example The number of florets per spikelet: `\begin{align*} Y_{i} = \sum\limits_{j=1}^{j=m} X_{ij} \alpha_{j} + E_i = G_i + E_i \end{align*}` <table> <thead> <tr> <th style="text-align:left;"> Genotype </th> <th style="text-align:right;"> P </th> <th style="text-align:right;"> G </th> <th style="text-align:right;"> E </th> <th style="text-align:right;"> Prob </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> t/t </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2.8 </td> <td style="text-align:right;"> 0.2 </td> <td style="text-align:right;"> 0.20 </td> </tr> <tr> <td style="text-align:left;"> t/t </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2.8 </td> <td style="text-align:right;"> -0.8 </td> <td style="text-align:right;"> 0.05 </td> </tr> <tr> <td style="text-align:left;"> T/t </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2.6 </td> <td style="text-align:right;"> 0.4 </td> <td style="text-align:right;"> 0.30 </td> </tr> <tr> <td style="text-align:left;"> T/t </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2.6 </td> <td style="text-align:right;"> -0.6 </td> <td style="text-align:right;"> 0.20 </td> </tr> <tr> <td style="text-align:left;"> T/T </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2.2 </td> <td style="text-align:right;"> 0.8 </td> <td style="text-align:right;"> 0.05 </td> </tr> <tr> <td style="text-align:left;"> T/T </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2.2 </td> <td style="text-align:right;"> -0.2 </td> <td style="text-align:right;"> 0.20 </td> </tr> </tbody> </table> -- ### What is the covariance between G and P? $$ `\begin{aligned} Cov(G,P) &=E(GP) - E(G)E(P) \\ \end{aligned}` $$ --- ### What is the covariance between G and P? $$ `\begin{aligned} Cov(G,P) &=E(GP) - E(G)E(P) \\ \end{aligned}` $$ ```r dt <- data.frame(Genotype=c("t/t", "t/t", "T/t", "T/t", "T/T", "T/T"), P=c(3, 2, 3, 2, 3, 2), G=c(2.8, 2.8, 2.6, 2.6, 2.2, 2.2), E=c(0.2, -0.8, 0.4, -0.6, 0.8, -0.2), Prob=c(0.20, 0.05, 0.30, 0.20, 0.05, 0.20)) kable(dt) # print the table ``` <table> <thead> <tr> <th style="text-align:left;"> Genotype </th> <th style="text-align:right;"> P </th> <th style="text-align:right;"> G </th> <th style="text-align:right;"> E </th> <th style="text-align:right;"> Prob </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> t/t </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2.8 </td> <td style="text-align:right;"> 0.2 </td> <td style="text-align:right;"> 0.20 </td> </tr> <tr> <td style="text-align:left;"> t/t </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2.8 </td> <td style="text-align:right;"> -0.8 </td> <td style="text-align:right;"> 0.05 </td> </tr> <tr> <td style="text-align:left;"> T/t </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2.6 </td> <td style="text-align:right;"> 0.4 </td> <td style="text-align:right;"> 0.30 </td> </tr> <tr> <td style="text-align:left;"> T/t </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2.6 </td> <td style="text-align:right;"> -0.6 </td> <td style="text-align:right;"> 0.20 </td> </tr> <tr> <td style="text-align:left;"> T/T </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2.2 </td> <td style="text-align:right;"> 0.8 </td> <td style="text-align:right;"> 0.05 </td> </tr> <tr> <td style="text-align:left;"> T/T </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2.2 </td> <td style="text-align:right;"> -0.2 </td> <td style="text-align:right;"> 0.20 </td> </tr> </tbody> </table> --- ### What is the covariance between G and P? $$ `\begin{aligned} Cov(G,P) &=E(GP) - E(G)E(P) \\ \end{aligned}` $$ ```r dt <- data.frame(Genotype=c("t/t", "t/t", "T/t", "T/t", "T/T", "T/T"), P=c(3, 2, 3, 2, 3, 2), G=c(2.8, 2.8, 2.6, 2.6, 2.2, 2.2), E=c(0.2, -0.8, 0.4, -0.6, 0.8, -0.2), Prob=c(0.20, 0.05, 0.30, 0.20, 0.05, 0.20)) *kable(dt[1,]) # only print the first line to save space ``` <table> <thead> <tr> <th style="text-align:left;"> Genotype </th> <th style="text-align:right;"> P </th> <th style="text-align:right;"> G </th> <th style="text-align:right;"> E </th> <th style="text-align:right;"> Prob </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> t/t </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2.8 </td> <td style="text-align:right;"> 0.2 </td> <td style="text-align:right;"> 0.2 </td> </tr> </tbody> </table> -- ```r *sum(with(dt, G*P*Prob)) # E(GP) ``` ``` ## [1] 6.55 ``` ```r *sum(with(dt, G*Prob)) # E(G) ``` ``` ## [1] 2.55 ``` ```r *sum(with(dt, P*Prob)) # E(P) ``` ``` ## [1] 2.55 ``` --- ### What is the covariance between G and P? $$ `\begin{aligned} Cov(G,P) &=E(GP) - E(G)E(P) \\ &= 6.55 - (2.55)^2 = 0.0475 \\ \end{aligned}` $$ -- Similarly, to calculate `\(Cov(G, E) =E(GE) - E(G)E(E)\)`: ```r *sum(dt$G * dt$E * dt$Prob) # E(GE) ``` ``` ## [1] -5.551115e-17 ``` ```r *sum(dt$E * dt$Prob) # E(E) ``` ``` ## [1] 0 ``` $$ `\begin{aligned} Cov(G, E) &=E(GE) - E(G)E(E) \\ &= 0 - (2.55) \times 0 = 0 \\ \end{aligned}` $$ --- # Correlation A mutual relationship between two variables. $$ `\begin{align*} r_{XY} & = Corr(X, Y) \\ & = \frac{Cov(X, Y)}{\sqrt{Var(X)Var(Y)} }\\ \end{align*}` $$ -- ### The correlation coefficent between G and P $$ `\begin{align*} r_{GP} & = \frac{Cov(G, P)}{\sqrt{Var(G)Var(P)} }\\ \end{align*}` $$ --- $$ `\begin{align*} r_{GP} & = \frac{Cov(G, P)}{\sqrt{Var(G)Var(P)} }\\ \end{align*}` $$ `\(Cov(G, P) =0.0475\)` -- `\(Var(G) = E(G^2) - E(G)^2\)` ```r sum(dt$G^2 * dt$Prob) - sum(dt$G * dt$Prob)^2 ``` ``` ## [1] 0.0475 ``` -- `\(Var(P) = E(P^2) - E(P)^2\)` ```r sum(dt$P^2 * dt$Prob) - sum(dt$P * dt$Prob)^2 ``` ``` ## [1] 0.2475 ``` -- $$ `\begin{align*} r_{GP} & = \frac{Cov(G, P)}{\sqrt{Var(G)Var(P)} }\\ & = \frac{0.0475}{\sqrt{0.0475 \times 0.2475}} \\ & = 0.438 \end{align*}` $$ --- # Regression The regression of `\(Y\)` on `\(X\)`: $$ \hat{y} = E(Y|X) $$ This is also called the **best predictor** of `\(Y\)` given `\(X\)`. -- Regression can be used to define **a linear model**: $$ y = \hat{y} + e $$ where `\(e\)` is called the residual. -- Another definition of the simple linear regression model: $$ `\begin{aligned} y = \bar{y} & + \beta_{YX}(x - \bar{x})+e \\ \text{with } & \bar{y} = E(Y) \\ & \beta_{YX} = \frac{Cov(Y, X)}{Var(X)} \\ \end{aligned}` $$ --- # Predict G based on P $$ `\begin{aligned} & G = \bar{G} + \beta_{GP}(P - \bar{P}) + e \\ & \beta_{GP} = \frac{Cov(G, P)}{Var(P)} \\ \end{aligned}` $$ -- $$ `\begin{aligned} \beta_{GP} & = \frac{Cov(G, P)}{Var(P)} \\ & = 0.0475/0.2475 = 0.192\\ & \bar{P} = E(P) = 2.55 \\ & \bar{G} = E(G) = 2.55 \\ \end{aligned}` $$ -- $$ `\begin{aligned} & G = \bar{G} + \beta_{GP}(P - \bar{P}) + e \\ & \hat{G} = \bar{G} + \beta_{GP}(P- \bar{P} ) \\ & = 2.55 + 0.192(P-2.55) \end{aligned}` $$ --- # Prediction Model $$ `\begin{aligned} G & = 2.55 + 0.192(P-2.55) \\ & = 2.0604 + 0.192 \times P \end{aligned}` $$ ```r plot(x=1, y=1, ylim=c(1, 5), xlim=c(2, 3), type="n", xlab="P", ylab="G") # a, b : single values specifying the intercept and the slope of the line abline(a=2.0604, b=0.192, lwd=3, col="red") ``` <img src="Ch0_c2_files/figure-html/unnamed-chunk-7-1.png" width="40%" style="display: block; margin: auto;" /> --- # Get predicted G ( `\(\hat{G}\)` ) Using the prediction model: $$ `\begin{aligned} G = 2.0604 + 0.192 \times P \end{aligned}` $$ ```r dt$ghat <- 2.0604 + 0.192*dt$P kable(dt) ``` <table> <thead> <tr> <th style="text-align:left;"> Genotype </th> <th style="text-align:right;"> P </th> <th style="text-align:right;"> G </th> <th style="text-align:right;"> E </th> <th style="text-align:right;"> Prob </th> <th style="text-align:right;"> ghat </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> t/t </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2.8 </td> <td style="text-align:right;"> 0.2 </td> <td style="text-align:right;"> 0.20 </td> <td style="text-align:right;"> 2.6364 </td> </tr> <tr> <td style="text-align:left;"> t/t </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2.8 </td> <td style="text-align:right;"> -0.8 </td> <td style="text-align:right;"> 0.05 </td> <td style="text-align:right;"> 2.4444 </td> </tr> <tr> <td style="text-align:left;"> T/t </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2.6 </td> <td style="text-align:right;"> 0.4 </td> <td style="text-align:right;"> 0.30 </td> <td style="text-align:right;"> 2.6364 </td> </tr> <tr> <td style="text-align:left;"> T/t </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2.6 </td> <td style="text-align:right;"> -0.6 </td> <td style="text-align:right;"> 0.20 </td> <td style="text-align:right;"> 2.4444 </td> </tr> <tr> <td style="text-align:left;"> T/T </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2.2 </td> <td style="text-align:right;"> 0.8 </td> <td style="text-align:right;"> 0.05 </td> <td style="text-align:right;"> 2.6364 </td> </tr> <tr> <td style="text-align:left;"> T/T </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2.2 </td> <td style="text-align:right;"> -0.2 </td> <td style="text-align:right;"> 0.20 </td> <td style="text-align:right;"> 2.4444 </td> </tr> </tbody> </table> --- # Accuracy of prediction The accuracy of the prediction is equal to the **correlation of `\(\hat{y}\)` with its true value `\(y\)`**. We can derive accuracy as: $$ `\begin{aligned} r_{\hat{y}y} & = \frac{Cov(\hat{y}, y)}{\sqrt{Var(\hat{y})Var(y)}} \\ & = \frac{Cov(x, y)}{\sqrt{Var(x) Var(y)}} \\ & = r_{XY} \\ \end{aligned}` $$ --- # Accuracy of prediction ### Method One: $$ `\begin{aligned} r_{\hat{G}G} & = \frac{Cov(\hat{G}, G)}{\sqrt{Var(\hat{G})Var(G)}} \\ \end{aligned}` $$ ```r vg <- sum(dt$G^2 * dt$Prob) - sum(dt$G * dt$Prob)^2 vghat <- sum(dt$ghat^2 * dt$Prob) - sum(dt$ghat * dt$Prob)^2 cov_g_ghat <- sum(dt$ghat * dt$G * dt$Prob) - sum(dt$G * dt$Prob) * sum(dt$ghat * dt$Prob) r_ghat_g <- cov_g_ghat / sqrt(vg*vghat) r_ghat_g ``` ``` ## [1] 0.4380858 ``` --- # Accuracy of prediction ### Method Two: $$ `\begin{aligned} r_{GP} = \frac{Cov(G, P)}{\sqrt{Var(G) Var(P)}} \end{aligned}` $$ ```r vp <- sum(dt$P^2 * dt$Prob) - sum(dt$P * dt$Prob)^2 cov_g_p <- sum(dt$P * dt$G * dt$Prob) - sum(dt$G * dt$Prob) * sum(dt$P * dt$Prob) r_g_p <- cov_g_p / sqrt(vg*vp) r_g_p ``` ``` ## [1] 0.4380858 ``` --- # Decomposition of variance $$ Y = G + E $$ Variance of Y can be decomposed as **explained** and **unexplained** variance components: $$ `\begin{aligned} Var(Y) = r_{XY}^2 Var(Y) + (1 - r_{XY}^2) Var(Y) \end{aligned}` $$ ```r vg ``` ``` ## [1] 0.0475 ``` ```r r_g_p^2 * vg ``` ``` ## [1] 0.009116162 ``` ```r (1- r_g_p^2) *vg ``` ``` ## [1] 0.03838384 ```