Statistical Foundations (2)

class: center, middle, inverse, title-slide

# Statistical Foundations (2)
### Jinliang Yang
### Oct. 7th, 2020

---

# Expectation and Variance

### Expectation `$E(X)$`:  
`\begin{align*}
E(f(X)) = \sum\limits_{i=1}^kf(x_i)Pr(X = x_i)
\end{align*}`

`\begin{align*}
E[X] &= 0 \times (1 - p)^2 + 1 \times [2p(1-p)] + 2 \times p^2 = 2p \\
\end{align*}`

### Variance `$Var(X)$`:
`\begin{align*}
Var(X) &= E[X^2] - E[X]^2 \\
       &= 2p(1-p) \\
\end{align*}`

---
# Probabilities

### Joint Probability

Two random variables to **occur together**.
In the `Milk Yield` exmaple, the joint probability of `$Pr(G=aa, MY > 300)$`?

### Marginal Probability

A sum of **mutually exclusive** and **exhaustive** set of events.
The marginal probability of `$Pr(G=Aa)$` for all possible MY?

---
# Examples for probabilities

Two variables: Genotype and Milk Yield (MY)

| Genotype (G) | `$MY \leq 100$` | `$100 < MY \leq 300$` | `$MY > 300$`  | Marginal `$Pr(G)$`  | 
| :-------: | ------- | ------- | ------- | ------- | ------- |
| aa      | 0.10 | 0.04 |  0.02  | 0.16    |
| Aa     |  0.14 | 0.18 | 0.16 | 0.48 |
| AA | 0.06 | 0.10 | 0.20 | 0.36 |
| Marg. Prob.|  0.30 | 0.32 | 0.38 | 1.00 |

### Conditional Probability

$$
Pr(X = x | Y = y) = \frac{Pr(X = x, Y = y)}{ Pr(Y = y)}
$$

- What is the conditional probability of `$Pr(MY \leq 100 | G=Aa)$`?

---
# Genotype (X) and Milk Yield (MY)

### Statistical Independence

$$
`\begin{align*}
& Pr(X = x_i|Y = y_j) \\
& = Pr(X = x_i|Y = y_k) \\
& = Pr(X = x_i) \\
\end{align*}`
$$

`\begin{equation}
Pr(X = x_i, Y = y_j) = Pr(X = x_i) \times Pr(Y = y_j)
\end{equation}`

$$
`\begin{align*}
&Pr(MY > 300, X_{AA}) \\
&Pr(MY > 300) \times Pr(X_{AA}) \\
\end{align*}`
$$

---
# Genotype (X) and Milk Yield (MY)

| Genotype | `$MY = 100$` | `$MY = 150$` | `$MY = 300$`  | Marginal `$Pr(G)$`  | 
| :-------: | ------- | ------- | ------- | ------- | ------- |
| aa      | 0.10 | 0.04 |  0.02  | 0.16    |
| Aa     |  0.14 | 0.18 | 0.16 | 0.48 |
| AA | 0.06 | 0.10 | 0.20 | 0.36 |
| Marg. Prob.|  0.30 | 0.32 | 0.38 | 1.00 |

What are the genotype effects, or `$E(MY | X_{AA)}$`, `$E(MY | X_{aa)}$`, `$E(MY | X_{Aa)}$`?

### Conditional Expectation

The expectation (=mean) for variable `$X$` conditional on variable `$Y=y$` is:
`\begin{align*}
E(X|Y = y) & = \sum\limits_{i=1}^k x_i Pr(X = x_i | Y = y) \\
& = \sum\limits_{i=1}^k x_i \frac{Pr(X = x_i, Y = y)}{Pr(Y = y)} \\
\end{align*}`

---
# Genotype (X) and Milk Yield (MY)

What are the genotype effects, or `$E(MY | X_{AA)}$`, `$E(MY | X_{aa)}$`, `$E(MY | X_{Aa)}$`?

$$
`\begin{align*}
E(MY| X_{AA}) & = \sum\limits_{i=1}^3 MY_i Pr(MY = MY_i | X = X_{AA}) \\
& = 100 \times 0.06/0.36 + 150 \times 0.10/0.36 + 300 \times 0.20/0.36 = 81/0.36 = 225
\end{align*}`
$$
--

$$
`\begin{align*}
E(MY| X_{aa}) & = \sum\limits_{i=1}^3 MY_i Pr(MY = MY_i | X = X_{aa}) \\
& = 100 \times 0.10/0.16 + 150 \times 0.04/0.16 + 300 \times 0.02/0.16 = 22/0.16 = 137.5
\end{align*}`
$$
$$
`\begin{align*}
E(MY| X_{Aa}) & = \sum\limits_{i=1}^3 MY_i Pr(MY = MY_i | X = X_{Aa}) \\
& = 100 \times 0.14/0.48 + 150 \times 0.18/0.48 + 300 \times 0.16/0.48 = 89/0.48 = 185.4
\end{align*}`
$$

---
# Covariance

To quantify to what extent the two variables **co-vary**.

$$
`\begin{aligned}
Cov(X, Y) & = E([X - E(X)][Y - E(Y)]) \\
& = E(XY) - E(X)E(Y) \\
\end{aligned}`
$$
where,
$$
`\begin{aligned}
E(XY) = \sum_i \sum_j x_i y_j Pr(X = x_i, Y = y_j)
\end{aligned}`
$$

---
# A plant example
The number of florets per spikelet:

`\begin{align*}
Y_{i} = \sum\limits_{j=1}^{j=m} X_{ij} \alpha_{j} + E_i = G_i + E_i
\end{align*}`

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Genotype </th>
   <th style="text-align:right;"> P </th>
   <th style="text-align:right;"> G </th>
   <th style="text-align:right;"> E </th>
   <th style="text-align:right;"> Prob </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> t/t </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2.8 </td>
   <td style="text-align:right;"> 0.2 </td>
   <td style="text-align:right;"> 0.20 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> t/t </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2.8 </td>
   <td style="text-align:right;"> -0.8 </td>
   <td style="text-align:right;"> 0.05 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> T/t </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2.6 </td>
   <td style="text-align:right;"> 0.4 </td>
   <td style="text-align:right;"> 0.30 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> T/t </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2.6 </td>
   <td style="text-align:right;"> -0.6 </td>
   <td style="text-align:right;"> 0.20 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> T/T </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2.2 </td>
   <td style="text-align:right;"> 0.8 </td>
   <td style="text-align:right;"> 0.05 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> T/T </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2.2 </td>
   <td style="text-align:right;"> -0.2 </td>
   <td style="text-align:right;"> 0.20 </td>
  </tr>
</tbody>
</table>

### What is the covariance between G and P?

$$
`\begin{aligned}
Cov(G,P) &=E(GP) - E(G)E(P) \\
\end{aligned}`
$$

---
### What is the covariance between G and P?

$$
`\begin{aligned}
Cov(G,P) &=E(GP) - E(G)E(P) \\
\end{aligned}`
$$

```r
dt <- data.frame(Genotype=c("t/t", "t/t", "T/t", "T/t", "T/T", "T/T"),
                 P=c(3, 2, 3, 2, 3, 2),
                 G=c(2.8, 2.8, 2.6, 2.6, 2.2, 2.2),
                 E=c(0.2, -0.8, 0.4, -0.6, 0.8, -0.2),
                 Prob=c(0.20, 0.05, 0.30, 0.20, 0.05, 0.20))
kable(dt) # print the table
```

---
### What is the covariance between G and P?

$$
`\begin{aligned}
Cov(G,P) &=E(GP) - E(G)E(P) \\
\end{aligned}`
$$

```r
*sum(with(dt, G*P*Prob)) # E(GP)
```

```
## [1] 6.55
```

```r
*sum(with(dt, G*Prob)) # E(G)
```

```
## [1] 2.55
```

```r
*sum(with(dt,  P*Prob)) # E(P)
```

```
## [1] 2.55
```

---
### What is the covariance between G and P?

$$
`\begin{aligned}
Cov(G,P) &=E(GP) - E(G)E(P) \\
&= 6.55 - (2.55)^2 = 0.0475 \\
\end{aligned}`
$$

Similarly, to calculate `$Cov(G, E) =E(GE) - E(G)E(E)$`:

```r
*sum(dt$G * dt$E * dt$Prob) # E(GE)
```

```
## [1] -5.551115e-17
```

```r
*sum(dt$E * dt$Prob) # E(E)
```

```
## [1] 0
```
$$
`\begin{aligned}
Cov(G, E) &=E(GE) - E(G)E(E) \\
&= 0 - (2.55) \times 0 = 0 \\
\end{aligned}`
$$

---
# Correlation

A mutual relationship between two variables.

$$
`\begin{align*}
r_{XY} & = Corr(X, Y) \\
& = \frac{Cov(X, Y)}{\sqrt{Var(X)Var(Y)} }\\
\end{align*}`
$$

### The correlation coefficent between G and P

$$
`\begin{align*}
 r_{GP} & = \frac{Cov(G, P)}{\sqrt{Var(G)Var(P)} }\\
\end{align*}`
$$

---

$$
`\begin{align*}
 r_{GP} & = \frac{Cov(G, P)}{\sqrt{Var(G)Var(P)} }\\
\end{align*}`
$$

`$Cov(G, P) =0.0475$`

`$Var(G) = E(G^2) - E(G)^2$`

```r
sum(dt$G^2 * dt$Prob) - sum(dt$G * dt$Prob)^2
```

```
## [1] 0.0475
```

`$Var(P) = E(P^2) - E(P)^2$`

```r
sum(dt$P^2 * dt$Prob) - sum(dt$P * dt$Prob)^2
```

```
## [1] 0.2475
```

$$
`\begin{align*}
 r_{GP} & = \frac{Cov(G, P)}{\sqrt{Var(G)Var(P)} }\\
        & = \frac{0.0475}{\sqrt{0.0475 \times 0.2475}} \\
        & = 0.438
\end{align*}`
$$

---
# Regression

The regression of `$Y$` on `$X$`:
$$
\hat{y} = E(Y|X)
$$
This is also called the **best predictor** of `$Y$` given `$X$`.

Regression can be used to define **a linear model**:
$$
y = \hat{y} + e
$$
where `$e$` is called the residual.

Another definition of the simple linear regression model:

$$
`\begin{aligned}
y = \bar{y} & + \beta_{YX}(x - \bar{x})+e \\
\text{with } & \bar{y} = E(Y) \\
& \beta_{YX} = \frac{Cov(Y, X)}{Var(X)} \\
\end{aligned}`
$$

---
# Predict G based on P

$$
`\begin{aligned}
& G = \bar{G}  + \beta_{GP}(P - \bar{P}) + e \\
& \beta_{GP} = \frac{Cov(G, P)}{Var(P)} \\
\end{aligned}`
$$
--

$$
`\begin{aligned}
\beta_{GP} & = \frac{Cov(G, P)}{Var(P)} \\
& = 0.0475/0.2475 = 0.192\\
& \bar{P} = E(P) = 2.55 \\
& \bar{G} = E(G) = 2.55 \\
\end{aligned}`
$$
--

$$
`\begin{aligned}
& G = \bar{G}  + \beta_{GP}(P - \bar{P}) + e \\
& \hat{G} = \bar{G} + \beta_{GP}(P- \bar{P} ) \\
& = 2.55 + 0.192(P-2.55)
\end{aligned}`
$$

---
# Prediction Model

$$
`\begin{aligned}
G & = 2.55 + 0.192(P-2.55) \\
& = 2.0604 + 0.192 \times P
\end{aligned}`
$$

```r
plot(x=1, y=1, ylim=c(1, 5), xlim=c(2, 3), type="n", xlab="P", ylab="G")
# a, b : single values specifying the intercept and the slope of the line
abline(a=2.0604, b=0.192, lwd=3, col="red")
```

---
# Get predicted G ( `$\hat{G}$` )

Using the prediction model: 
$$
`\begin{aligned}
G = 2.0604 + 0.192 \times P
\end{aligned}`
$$

```r
dt$ghat <- 2.0604 + 0.192*dt$P
kable(dt)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Genotype </th>
   <th style="text-align:right;"> P </th>
   <th style="text-align:right;"> G </th>
   <th style="text-align:right;"> E </th>
   <th style="text-align:right;"> Prob </th>
   <th style="text-align:right;"> ghat </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> t/t </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2.8 </td>
   <td style="text-align:right;"> 0.2 </td>
   <td style="text-align:right;"> 0.20 </td>
   <td style="text-align:right;"> 2.6364 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> t/t </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2.8 </td>
   <td style="text-align:right;"> -0.8 </td>
   <td style="text-align:right;"> 0.05 </td>
   <td style="text-align:right;"> 2.4444 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> T/t </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2.6 </td>
   <td style="text-align:right;"> 0.4 </td>
   <td style="text-align:right;"> 0.30 </td>
   <td style="text-align:right;"> 2.6364 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> T/t </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2.6 </td>
   <td style="text-align:right;"> -0.6 </td>
   <td style="text-align:right;"> 0.20 </td>
   <td style="text-align:right;"> 2.4444 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> T/T </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2.2 </td>
   <td style="text-align:right;"> 0.8 </td>
   <td style="text-align:right;"> 0.05 </td>
   <td style="text-align:right;"> 2.6364 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> T/T </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2.2 </td>
   <td style="text-align:right;"> -0.2 </td>
   <td style="text-align:right;"> 0.20 </td>
   <td style="text-align:right;"> 2.4444 </td>
  </tr>
</tbody>
</table>

---
# Accuracy of prediction

The accuracy of the prediction is equal to the **correlation of `$\hat{y}$` with its true value `$y$`**.

We can derive accuracy as:
$$
`\begin{aligned}
r_{\hat{y}y} & =  \frac{Cov(\hat{y}, y)}{\sqrt{Var(\hat{y})Var(y)}} \\
& = \frac{Cov(x, y)}{\sqrt{Var(x) Var(y)}} \\
& = r_{XY} \\
\end{aligned}`
$$

---
# Accuracy of prediction

### Method One:

$$
`\begin{aligned}
r_{\hat{G}G} & =  \frac{Cov(\hat{G}, G)}{\sqrt{Var(\hat{G})Var(G)}} \\
\end{aligned}`
$$

```r
vg <- sum(dt$G^2 * dt$Prob) - sum(dt$G * dt$Prob)^2
vghat <- sum(dt$ghat^2 * dt$Prob) - sum(dt$ghat * dt$Prob)^2
cov_g_ghat <- sum(dt$ghat * dt$G * dt$Prob) - sum(dt$G * dt$Prob) * sum(dt$ghat * dt$Prob)
r_ghat_g <- cov_g_ghat / sqrt(vg*vghat)
r_ghat_g
```

```
## [1] 0.4380858
```

---
# Accuracy of prediction

### Method Two:

$$
`\begin{aligned}
r_{GP}  =  \frac{Cov(G, P)}{\sqrt{Var(G) Var(P)}} 
\end{aligned}`
$$

```r
vp <- sum(dt$P^2 * dt$Prob) - sum(dt$P * dt$Prob)^2
cov_g_p <- sum(dt$P * dt$G * dt$Prob) - sum(dt$G * dt$Prob) * sum(dt$P * dt$Prob)
r_g_p <- cov_g_p / sqrt(vg*vp)
r_g_p
```

```
## [1] 0.4380858
```

---
# Decomposition of variance

$$
Y = G + E 
$$
Variance of Y can be decomposed as **explained** and **unexplained** variance components:

$$
`\begin{aligned}
Var(Y) = r_{XY}^2 Var(Y) + (1 - r_{XY}^2) Var(Y)
\end{aligned}`
$$

```r
vg
```

```
## [1] 0.0475
```

```r
r_g_p^2 * vg
```

```
## [1] 0.009116162
```

```r
(1- r_g_p^2) *vg
```

```
## [1] 0.03838384
```