2016-11-20
遺伝子検査を伝えるための遺伝統計学基礎

遺伝子検査メンデル遺伝 Cancer Syndrome 複合遺伝性疾患発現プロファイル
こちらの続き
遺伝子検査を実臨床に用いること・リスク情報提供することを主眼において、そのような検査の妥当性がどのようにして得られるのか、その妥当性に照らしてどのように検査結果を解釈するのか、ということを理解するための統計学的背景の理解を目指す
それをよく理解し、今後の分野の進展にもついていくのに有用なスキルの一つはシミュレーションして統計解析をしてみる、というものと思うので、講義資料は、「自然言語」と「コンピュータ言語(R) 」のバイリンガルにする
以下のRmdファイルは、「自然言語」の部分を英語のみにしたドラフト版。最終的には、「自然言語(日英)」＋「コンピュータ言語」として完成する予定
ドラフトだけれど、kindleに登録
３年で一巡する感じで提供している「統計遺伝学」。2016年度は概念を中心としていますが、2017年度は、もっとも基礎的な内容に戻ります。
- メンデル遺伝
- (BRCA1/2のような)Cancer syndromes
- (SNPを中心とした)複合遺伝性疾患
- 発現プロファイルでの疾患サブタイプ分け
の４タイプはそれぞれ以下のような点に留意して実施予定です。
- 古典的な遺伝子診断とNGS時代の遺伝子診断
- 予防的乳腺摘出など遺伝情報活用が「強烈な」決断を迫る
- 23&me等のDTC(Direct-To-Consumer)遺伝子キットの結果解釈
- バイオマーカー選定とその臨床活用のためのバリデーション
４タイプのうち３タイプはすでに実臨床に導入されていますし、残りの１タイプもDTC的には現実的な問題です。
これらについて一通りの統計遺伝学的な背景を学びます。
「三年で一巡のうちの、一番簡単」な内容となりますが、Rのコードを実行して、条件設定をし直してシミュレーションする、それを宿題として実施する、などの部分は、徒手空拳では難しいですが、ティーチングアシスタントなどもつけますので、ご期待ください。
StatGenet2017Text: Use of Genetic Data in Clinics (English Edition)
作者: ryamada
発売日: 2016/11/27
メディア: Kindle版
この商品を含むブログ (1件) を見る
---
title: "StatGenet2017"
author: "ryamada"
date: "Tuesday, November 15, 2016"
output: 
  html_document:
    toc: true
    toc_depth: 6
    number_section: true
---

# Mendelian traits

## Pedigree

Are you new to "R"? -> Go through "Basics of R".
In Japanese http://www.slideshare.net/m884/japan-r-15432969 
In English http://www.cyclismo.org/tutorial/R/ 

Draw a trio of parents and a male offspring. Mother and the child are affected.

ID

1 : father

2 : mother

3 : male offspring

sex

1 : male

2 : female

affected

0 : non-affected

1 : affected

```{r}
#install.packages("kinship2")
library(kinship2)
id  <- c(1, 2, 3)
mom <- c(0, 0, 2)
dad <- c(0, 0, 1)
sex <- c(1, 2, 1)
affected <- c(0,1,1)

ptemp<-pedigree(id=id,dadid=dad,momid=mom,sex=sex,affected=affected)
plot(ptemp)
```

```{}
[Exercise]
Make vectors of id, mom, dad, sex for a family with 10 or more members and 3 or more generations.
Draw a pedigree manually on a sheet of paper, then make vectors.

ex. 
id <- 1:11
mom <- c(0,0,2,2,0,0,5,5,6,6,6)
dad <- c(0,0,1,1,0,0,4,4,7,7,7)
sex <- c(1,2,1,1,2,2,1,2,2,1,2)
affected <- sample(0:1,11,replace=TRUE)
print(affected)
ptemp<-pedigree(id=id,dadid=dad,momid=mom,sex=sex,affected=affected)

plot(ptemp)
```

## Genotypes and phenotypes of mendelian traits

### Genotypes
"00", "01", and "11".

0 : "00"

0.5 : "01"

1 : "11"

Assign genotypes to the pedigree memebers.
You can assign genotypes to the members without paretns.
You can assign genotypes to the members with parents with restriction.

```{r}
gen <- c(1,0,0.5)
```

### Phenotypes (non-affected/affected)

"0" and "1".

### Genetic models

Dominant model means

$$
phenotype \ = 0 \ when \ genotype \ = 0 \\
phenotype \ = 1 \ when \ genotype \ = \ 0.5 \ or \  1
$$

```{r}
my.dom <- function(g){
  ph <- rep(0,length(g))
  for(i in 1:length(g)){
    if(g[i]>0){
      ph[i] <- 1
    }
  }
  return(ph)
}
my.dom(gen)
```
```{r}
affected <- my.dom(gen)
ptemp<-pedigree(id=id,dadid=dad,momid=mom,sex=sex,affected=affected)
plot(ptemp)
```

```{}
[Exercise]
Make a function, my.rec(), that returns phenotype in recessive model.
```

```{r, echo=FALSE}
my.rec <- function(g){
  ph <- rep(0,length(g))
  for(i in 1:length(g)){
    if(g[i]==1){
      ph[i] <- 1
    }
  }
  return(ph)
}
```

```{}
[Exercise]
Assigne genotypes (0,0.5,1) to the members of your pedigree with 10 or more members and 3 or more generations. Then calculate their phenotype in dominant and recessive models and draw pedigrees, respectively.

eg.
id <- 1:11
mom <- c(0,0,2,2,0,0,5,5,6,6,6)
dad <- c(0,0,1,1,0,0,4,4,7,7,7)
sex <- c(1,2,1,1,2,2,1,2,2,1,2)
gen <- c(0,1,0.5,0.5,1,0,1,0.5,0.5,0.5,0.5)
ph.dom <- my.dom(gen)
affected <- ph.dom
ptemp<-pedigree(id=id,dadid=dad,momid=mom,sex=sex,affected=affected)
plot(ptemp)

ph.rec <- my.rec(gen)
affected <- ph.rec
ptemp<-pedigree(id=id,dadid=dad,momid=mom,sex=sex,affected=affected)
plot(ptemp)
```

### Penetrance and phenocopy

Penetrance is the probability that phenotype "1" appears for people with corresponding genotype(s).

Probability to develop phenotype "1" is 0, penetrance, and penetrance for three genotypes 0, 0.5 and 1, respectively, when dominant.


```{r}
my.dom2 <- function(g,penet){
  ph <- rep(0,length(g))
  prob <- c(0,penet,penet)
  for(i in 1:length(g)){
    if(runif(1)<= prob[g[i]]){
      ph[i] <- 1
    }
  }
  return(ph)
}
```

```{}
[Exercise]
Answer probability to develope phenotype "1" for recessive model with penetrance.

Make a function, my.rec2(), for the case.
```

```{r, echo=FALSE}
my.rec2 <- function(g,penet){
  ph <- rep(0,length(g))
  prob <- c(0,0,penet)
  for(i in 1:length(g)){
    if(runif(1)<= prob[g[i]]){
      ph[i] <- 1
    }
  }
  return(ph)
}
```

Phenocopy is the probablity that phenotype "1" appears for people without corresponding genotype(s).

Probability to develop phenotype "1"is, phenocopy, 1, and 1 for three genetypes 0, 0.5 and 1, respectively, when dominant.

```{}
[Exercise]
Make a function, my.dom3(), that returns phenotype for dominant model with phenocopy.

Make a function, my.rec3(), that returns phenotype for recessive model with phenocopy.
```

```{r, echo=FALSE}
my.dom3 <- function(g,pheno){
  ph <- rep(0,length(g))
  prob <- c(pheno,1,1)
  for(i in 1:length(g)){
    if(runif(1)<= prob[g[i]]){
      ph[i] <- 1
    }
  }
  return(ph)  
}
my.rec3 <- function(g,pheno){
  ph <- rep(0,length(g))
  prob <- c(pheno,pheno,1)
  for(i in 1:length(g)){
    if(runif(1)<= prob[g[i]]){
      ph[i] <- 1
    }
  }
  return(ph)  
}
```
Phenotype developing probability vector for dominant model is

(pheno, penet, penet). The vector for recessive model is (pheno, pheno, penet).

```{}
[Exercise]

Make a function, my.dom4(), for dominant model reflecting phenocopy and penetrance.

Make a function, my.rec4(), for recessive model.
```

```{r,echo=FALSE}
my.dom4 <- function(g,penet,pheno){
  ph <- rep(0,length(g))
  prob <- c(pheno,penet,penet)
  for(i in 1:length(g)){
    if(runif(1)<= prob[g[i]]){
      ph[i] <- 1
    }
  }
  return(ph)  
}
my.rec4 <- function(g,penet,pheno){
  ph <- rep(0,length(g))
  prob <- c(pheno,pheno,penet)
  for(i in 1:length(g)){
    if(runif(1)<= prob[g[i]]){
      ph[i] <- 1
    }
  }
  return(ph)  
}
```

### Genotype-specific probability to develop phenotype "1"

Dominant and recessive models with penetrance and phenocopy specify phenotype developing probability vector.

(pheno, penet, penet)

(pheno, pheno, penet),

respectively.

We can make a function that takes the probability vecrot and returns phenotype.

```{r}
my.pheno <- function(g,prob){
  ph <- rep(0,length(g))
  #prob <- c(pheno,pheno,penet)
  for(i in 1:length(g)){
    if(runif(1)<= prob[g[i]]){
      ph[i] <- 1
    }
  }
  return(ph)    
}
```

## NGS and Disease-responsible mutations

A review "Identifying Highly Penetrant Disease Causal Mutations Using Next Generation Sequencing: Guide to Whole Process".

https://www.hindawi.com/journals/bmri/2015/923491/

### Data handling for variant calls of NGS

If you have no background on next generation sequencing technology, read a review on it. 

Sequencing

Mapping

Variant calling

Filtering

Ranking



### Variant calling ~ Coverage depth and variant detection

When reading a heterozygous locus with depth x50, the number of reads of one allele distributes in binomial distribution.

```{r}
depth <- 50
n.allele <- 0:depth
ch <- dbinom(n.allele,depth,prob=c(0.5,0.5))
plot(n.allele,ch,type="h")
```

NGS produce errors.
Therefore some reads at a homozygous locus may look heterozygous-like.

```{r}
error.p <- 0.05

err <- dbinom(n.allele,depth,prob=error.p)
plot(n.allele,err,type="h")
```

Overlap the plots.

```{r}
plot(n.allele,err,type="h",col=2)
points(n.allele+0.2,ch,type="h")
```

### Distinguish heterozygous signals and error signals

#### Likelihood and likelihood ratio
If your result is 25 vs. 25, likelihoods of both assumptions are:

```{r}
k <- 25
ch[which(n.allele==k)]
err[which(n.allele==k)]
```

Likelihood ratio is:
```{r}
ch[which(n.allele==k)]/err[which(n.allele==k)]
```

Log-likelihood ratio is :

```{r}
log10(ch[which(n.allele==k)]/err[which(n.allele==k)])
```

```{}
[Exercise]
Calculate likelihood ratio and log-likelihood ratio for observations 0 to 50 and plot them.
```

```{r, echo=FALSE}
LL <- ch/err
plot(n.allele,LL)
logLL <- log10(LL)
plot(n.allele,logLL)
abline(h=c(-2,2),col=2)
```

When n.allele is less than a cut-off value 1, the locus might be called "homozygous" and more than a cut-off value 2, "heterozygous", otherwise uncertain.

#### Prior 1 ~ Majority loci are homozygous

If we have a good reason to believe that the locus can be heterogygous and homozygous evenly, then the prior probability of both assumptions are 0.5, respectively.

Throughout the genome, heterozygous sites in an individual are one out of several hundreds or so... then, the prior should be (0.003, 0.997).

```{r}
prior <- c(0.003,0.997)
post.ch <- ch[which(n.allele==k)] * prior[1]
post.err <- err[which(n.allele==k)] * prior[2]
post.ch/post.err
log10(post.ch/post.err)
```

```{}
[Exercise]
Drow likelihood ratio and log-likelihood ratio based on the prior and draw them.
```
```{r,echo=FALSE}
post.LL <- (ch*prior[1])/(err*prior[2])
plot(n.allele,post.LL)
post.logLL <- log10(post.LL)
plot(n.allele,post.logLL)
abline(h=c(-2,2),col=2)
```

When n.allele is less than a cut-off value 1, the locus might be called "homozygous" and more than a cut-off value 2, "heterozygous", otherwise uncertain.

```{}
[Exercise]
Do the same for different depth values, including x5, x10, x20, x30, x40.
Check the cut-off values and plot the cut-off values along the depth values.
```

#### Prior 2 ~ Bayesian estimation of heterozygous frequency of known variant loci

If this site is known to be a single nucleotide variant locus, then what is the prior?

##### Allele frequency and HWE

When allele frequency of the SNP is given and HWE is assumed:

$$
1 = (p+(1-p))^2 = p^2 + \ 2\times p (1-p) + \ (1-p)^2
$$

```{r}
allele.freq <- 0.3
another.freq <- 1-allele.freq
another.freq

my.diplofreq.hwe <- function(allele.freq){
  another.freq <- 1-allele.freq
  ret <- c(allele.freq^2,2*allele.freq*another.freq,another.freq^2)
  return(ret)
}
my.diplofreq.hwe(allele.freq)
```

```{}
[Exercise]
Calculate heterozygous frequency in HWE for allele frequency ranging from 0 to 1 and plot the results.
```
```{r,echo=FALSE}
allele.freq <- seq(from=0,to=1,length=100)
het.freq <- rep(0,length(allele.freq))
for(i in 1:length(allele.freq)){
  het.freq[i] <- my.diplofreq.hwe(allele.freq[i])[2]
}
plot(allele.freq,het.freq)
```

Prior shouldbe $ 1-2p(1-p) vs. 2p(1-p) $.

##### Diplotype frequency and HWD

When number of individuals per diplotype is given as $g_{00},g_{01},g_{11}$,$G=g_{00}+g_{01}+g_{11}$,

allele frequency is $f = \frac{g_{00} + g_{01}/2}{G}$.

Observed diplotype frequency : $g_{00},g_{01},g_{11}$ .

Expected diplotype frequency under the assumption of HWE : $G f^2,2Gf(1-f),G(1-f)^2$.

The deviation of the observed counts from the expected values can be measured with $\chi^2$ statistics.

```{r}
g <- c(10,20,15)
G <- sum(g)
f <- (g[1]+g[2]/2)/G
f
g.hwe <- my.diplofreq.hwe(f) * G
chi2 <- sum((g-g.hwe)^2/g.hwe)
chi2
```
```{r}
my.hwe.test <- function(g){
  G <- sum(g)
  f <- (g[1]+g[2]/2)/G
  g.hwe <- my.diplofreq.hwe(f)*G
  chi2 <- sum((g-g.hwe)^2/g.hwe)
  return(chi2)
}
```

##### Random sampling from null hypothesis, HWE

Assume a population whose allele frequency is f and who is in HWE.
We can sample N individuals from the population at random and observe the genotype counts and calculate $chi^2$ value to test HWE.

```{r}
f <- 0.3
g.hwe <- my.diplofreq.hwe(f)
N <- 100
n.iter <- 10^4
chi2out <- rep(0,n.iter)
for(i in 1:n.iter){
  smpl <- sample(0:2,N,replace=TRUE,prob=g.hwe)
  tmp.g <- tabulate(smpl+1,nbins=3)
  chi2out[i] <- my.hwe.test(tmp.g)
}
hist(chi2out)
```
$chi^2$ distribution: https://en.wikipedia.org/wiki/Chi-squared_distribution 

The plot of df = 1 is most similar.

```{r}
p <- pchisq(chi2out,df=1,lower.tail=FALSE)
hist(p)
plot(sort(p))
```

Degree of freedom is 1.

$$
g_{obs,00} = g_{hwe,00} + \delta \\
g_{obs,01} = g_{hwe,01} - 2 \delta \\
g_{obs,11} = g_{hwe,11} + \delta\\

g_{hwe} = (f^2,2f(1-f),(1-f)^2)\\
f = (g_{obs,00}+g_{obs,01}/2)/(g_{obs,00}+g_{obs,01}+g_{obs,11})
$$
Besides $\delta$ all other values are determined by the observation values $g_{obs}$, which means only 1 parameter is free ... degree of freedom = 1.

```{}
[Exercise]
Check yourself, 
```
$$
g_{obs,00} = g_{hwe,00} + \delta \\
g_{obs,01} = g_{hwe,01} - 2 \delta \\
g_{obs,11} = g_{hwe,11} + \delta+ /delta
$$
```{}
are true.
```

##### Bayesian estimation of heterozygous frequency

When observed n samples and m out of n were heterozygous:

Probability to observe (m, n-m) when frequency of heterozygous is f is:
$$
\begin{pmatrix} n \\ m \end{pmatrix} f^m (1-f)^{n-m}
$$

This is also likelihood of f with observation of (m, n-m).

When observation is binomial, beta distribution is its likelihood function.

```{r}
n <- 10
m <- 3
f <- seq(from=0,to=1,length=100)
d <- dbeta(f,m+1,(n-m)+1)
plot(f,d,type="l")
```

We may use this distribution of f as the prior.





### Ranking ~ in silico inference of functionality

```{}
[Exercise]
Report how to score variants for their functional importance.

Ref: https://www.ncbi.nlm.nih.gov/pubmed/16824020 
```

# Cancer-syndromes

## Basics

Cancers are common diseases and therea are subsets among them that are strongly hereditary, which are called cancer syndromes.

BRCA1 and BRCA2 are such examples causing breast/ovarian cancers frequently.

BRCA1, BRCA2など、特定の癌の発病リスクを高める遺伝子変異が知られている。

遺伝要因が特に高いサブポピュレーションである。

The recommendations for cancer-syndromes are:

このようなサブポピュレーションへの対応は、

１　Score risk possessing the responsible variants without prior genetic testings. 遺伝要因を保有しているリスクが高いと考えられる場合に

２　Do genetic counseling 遺伝的カウンセリングを行ったうえで

３　Then do genetic testing 遺伝子検査を実施し

４　If positive, decside what to do. 結果に応じて予防対策を決定する

というやり方が推奨されている。https://www.uspreventiveservicestaskforce.org/Page/Document/RecommendationStatementFinal/brca-related-cancer-risk-assessment-genetic-counseling-and-genetic-testing　


For the step 1, two methods, a simple scoring system and a Bayesian risk estimation, BRCA-Pro are handled and for the step 4, a web-based decision tool, BRCA tool is handled.


１を行うツール例として、Scoring system (項目別得点形式)とBRCA-Pro　http://bcb.dfci.harvard.edu/bayesmendel/brcapro.php　を、

４を行うツール例として、BRCA decision tool http://brcatool.stanford.edu/brca.html 

について理解する

## Risk assessment

### Ontario Family History Assessment Tool

http://www.wikidoc.org/index.php/BRCA_screening_tools 

How do you read the table?

How do you describe the scores?

```{}
[Exercise]
Why "Breast and ovarian" and "Breast" are separate?

Why "Breast and ovaria" has higher scores than "Breast (only)"?

How do you describe the difference in score values of "Breast (only)" and "Ovarian (only)"?

How do you describe the score values of "Mother", "Sibling" and "Second-/third-degree relative" ?

Why is "Male relative (Breast)" specifically handled?

Describe the score values of "age/life stage" information.
```

#### Likelihood ratio gets bigger with multiple events

Two hypotheses: $p1$ to cause a cancer vs. $p2$ to cause a cancer; $p1 < p2$.

The hypothesis with $p1$ is "no responsible variat" but "phenocopy".

The hypothesis with $p2$ is "yes responsible variant" with "penetrance < 1".

When one cancer is observed, likelihood of two hypotheses are $p1$ and $p2$.

When two cancers are observed, likelihood are $p1^2$ and $p2^2$.

Likelihood ratio gets bigger:

$$
\frac{p2^2}{p1^2}  = \frac{p2}{p1} \times \frac{p2}{p1} > \frac{p2}{p1}
$$

#### Breat vs. Ovary; Female vs. Male

$p1$ represents probability to develop it without the variant, in other words, the baseline probability ... prevalence.

Prevalence of breast vs. ovary ?

Prevalence of breast cancer in female vs. male?

Prevalences of two cancer types; $p1.1 < p1.2$. 

$$
\frac{p2}{p1.1}  > \frac{p2}{p1.2}
$$
#### Age; early-onset

If rare events happen at random, it follows Poisson distribution.

Assume you get cold m times per year in average from age 0 to 100.

```{r}
age.box <- 0:100
m <- 4
events <- rpois(length(age.box),m)
plot(age.box, events,type="h")
table(events)
```

Much infrequent event, eg. cancer...

```{r}
m <- 0.1
events <- rpois(length(age.box),m)
plot(age.box, events,type="h")
table(events)
```

The first cancer episode should be focused because cancer is more significant than common colds.

```{r}
n.ind <- 10^4
first.cancer <- rep(-9,n.ind)
for(i in 1:n.ind){
  events <- rpois(length(age.box),m)
  tmp <- which(events > 0)
  if(length(tmp)>0){
    first.cancer[i] <- tmp[1] - 1
  }
}
hist(first.cancer)
```

This is in exponential distribution.

This distribution does not fit to one of age of onset of adult cancers.

Assumption of multi-hit-accumulation:

```{r}
n.hit <- 10
first.cancer <- rep(-9,n.ind)
for(i in 1:n.ind){
  events <- rpois(length(age.box),m)
  cumsum.events <- cumsum(events)
  tmp <- which(cumsum.events >= n.hit)
  if(length(tmp)>0){
    first.cancer[i] <- tmp[1] - 1
  }
}
hist(first.cancer,breaks=seq(from=-10,to=100,by=10))
```

Many people are free from cancer and age of onset of first cancer distributes in a good shape.

The possession of responsible variant increases the average value m.
```{r}
n.hit <- 10
m2 <- 0.2
first.cancer2 <- rep(-9,n.ind)
for(i in 1:n.ind){
  events <- rpois(length(age.box),m2)
  cumsum.events <- cumsum(events)
  tmp <- which(cumsum.events >= n.hit)
  if(length(tmp)>0){
    first.cancer2[i] <- tmp[1] - 1
  }
}
h1 <- hist(first.cancer,breaks=seq(from=-10,to=100,by=10),density=25)
h2 <- hist(first.cancer2,breaks=seq(from=-10,to=100,by=10), add=TRUE,col=2,density=17)
```
```{r}
h1$counts
h2$counts
```

```{}
[Exercise]
How many people did survive till age 100 without cancer with m=0.1 and m=0.2, respectively?

Answer likelihood ratio m=0.2 over m=0.1 for the survivors?

How about for the patients who developed cancer in their 40s?

Calculate likelihoods for other xxties and plot them.
Pay attention to the age the more likely model changes from one to the other.
```
```{r,echo=FALSE}
plot((h2$counts/h1$counts)[c(2:10,1)])
```

#### kinship relationship

(1) Phenotype of mother -> likeliness of mother having the variant

(2) likeliness of mother having the variant -> likeliness of self having the variant.

Blood relationship affects (2).

```{}
[Excersize]
Make a pedigree that contains, (mother,daughter), (sisters), (aunt,niece),(female-cousins) and draw a pedigree.
Using kinship() function, calculate kinship between the pairs.

eg.
library(kinship2)
example(kinship)
plot(tped)
```

#### Calculation of scores


Each factor has a point.

$$
score = \sum_{i=1}^{n.factors} p_i \times x_i \ ; \ x_i \in \{0,1\}
$$

```{}
[Excercise]
Generate a function to calculate Ontario Family Hisory Assessment Tool score.
```
```{r,echo=FALSE}
factors <- 1:20
points <- c(10,7,5,4,3,2,2,6,4,2,2,3,7,4,3,6,4,2,1,1)

yes.factors <- c(0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0)

sum(yes.factors * points)
```

Score measures or estimates how likely an individual posesses the variant.

The formula above is in the shape of linear regression.

#### What is a scoring system?

##### Quantitative measure

Monotonic relation between a measure and risk.

x : measurable item

y : risk

```{r}
x <- seq(from=0,to=10,length=100)
y <- sin(x) + 1.5*x
plot(x,y,type="l")
```

Measuring x and returning y as a score.

Risk itself is not known but we know that x and risk are in monotonic relation.

Then, higher x should have higher score.

Value of x might be used as score itself but...

```{r}
x <- c(rnorm(10000),rnorm(5000,2,3))
hist(x)

length(which(x<=5))/length(x)
```

Standardization:

The minimum x gets a score 1/N.

The median x gets a score 0.5.

The maximum x gets a score 1.

##### Two dimensional measures.

```{r}
x1 <- rnorm(1000) * 5
x2 <- rnorm(1000) * 3
plot(x1,x2,asp=1)
a1 <- 2.3
a2 <- 1.2
b <- 5
y <- a1 * x1 + a2 * x2 + b

y. <- (y-min(y))/(max(y)-min(y))

plot(x1,x2,pch=20,col=rgb(y.,1-y.,1),asp=1)
```

```{r}
plot(x1,x2,pch=20,col=rgb(y.,1-y.,1),asp=1)
abline(0,a2/a1)
abline(-10,-a1/a2)
abline(-5,-a1/a2)
abline(0,-a1/a2)
abline(5,-a1/a2)
abline(10,-a1/a2)
```

$$
y ~ a1 \times x1 + a2 \times x2
$$

Linear regression

```{r}
lm.out <- lm(y~x1 + x2)
lm.out
lm.out$coefficients

#predict(lm.out,2,3)

```

##### Regression

Target variable.

Explanatory variables.

Making a formula to calculate target value with explanatory variables' values.

How to design a scoring system?


Multiple items

A function to take multiple items and returns a scaler value.



##### Linear regression vs. Non-linear regression

The likelihood ratios for age of onset bins:

```{r}
h1 <- hist(first.cancer,breaks=seq(from=-10,to=100,by=10),density=25)
h2 <- hist(first.cancer2,breaks=seq(from=-10,to=100,by=10), add=TRUE,col=2,density=17)
plot((h2$counts/h1$counts)[c(2:10,1)])
```

Contribution of age (or age bins) is not simple.

Score of somebody whose mother and a sister developed breast cancer in early age should be higher than one whose mother developed in late but a sister in early.

Factor of relatives and factor of age of onset are mutually interrelated.

Those information would be more informative to estimate likeliness to possess the variant.

Bayesian network method can integrate all the information together.


### BRCA-PRO ~ Bayesian network method

http://bcb.dfci.harvard.edu/bayesmendel/brcapro.php

Risk express : http://www.crahealth.com/risk-express 

```{}
[Exercise]
Using "Risk express", evaluate the effect of various factors as below.

Make a pedigree with cancer patients.
Change the self-person in the pedigree and compare the results of each evaluation.
```

#### What is Bayesian network?

Assume a pedigree in which a hereditary phenotype is running.

The appearance of the phenotype is age-dependent.

Bayesian network method can estimate individuals' possession of responsible variant and genetic model with phenocopy and penetrance with age-depending function of disease development.

##### Further reading : Bayesian Network for BRCAPro

Sub text : "BayesianNetwork4BRCAPro.Rmd" @ http://www.statgenet.med.kyoto-u.ac.jp/StatGenet/lectures/2017/Statgenet2017/BayesianNetwork4BRCAPro.Rmd  

or

http://www.statgenet.med.kyoto-u.ac.jp/StatGenet/lectures/2017/Statgenet2017/BayesianNetwork4BRCAPro.html

## Decision tools

### BRCA decision tool

#### Use the tool
BRCA decision tool http://brcatool.stanford.edu/brca.html 

```{}
[Exercise]
Play with the tool and report the results.
```

#### Monte-carlo simulation to estimate outcomes

Paper: https://www.ncbi.nlm.nih.gov/pubmed/22231042

```{}
[Exercise]
Describe Table 1
```

Multiple factors for Monte-Carle simulation.

Stochastic effects of BRCA1/2 on the factors are based on previous studies.

```{}
[Exercise]
Describe how the BRCA decision tool simulate the life of individuals using the factors in the table.
```


# Complex genetic traits

## Genetic models

### Dominant,recessive,additve, multiplicative  遺伝モデル〜優性・劣性・相加・相乗〜

Assume a SNP whose minor allele (m) increases probability over the major allele (M) to develop a phenotype as:

MM : p0

Mm : p1 >= p0

mm : p2 >= p1

```{}
[Exercise]
Answer p0,p1 and p2 when genetic mode is dominant and penetrance is a and phenocopy is b.

Answer p0,p1 and p2 when genetic mode is recessive and penetrance is a and phenocopy is b.
```

Dominant model is $p1 = p2$.

Recessive model is $p0 = p1$.

Non-mendelian traits, $p0 < p1 < p2$.

```{}
[Exercise]
Describe molecular mechanisms that cause pure dominant or pure recessive traits.

Describe molecular mechanisms that cause non-pure dominant and non-pure recessive traits.
```

```{}
What kind of molecular mechanisms can produce additive model and multiplicative model, respectively?
```

## Population, cohort 集団、コホート

In case of rare mendelian tratis, the frequency of responsible variant in population is very rare and the prevalence of the trait is also extremely rare.

Howevere common genetic traits are common (~ 1 %  ~ half) in population and risk allele frequency is high.

Now we are to make a function to calculate fractions of genotypes per phenotypes of the whole population.

```{r}
prevalence <- 0.01
allele.f <- 0.3
# HWE as a whole
genotype.f <- c(allele.f^2,2*allele.f*(1-allele.f),(1-allele.f)^2)

genotypic.risk <- c(4,2,1)
relative.frac.case <- genotypic.risk * genotype.f
# the sum of frac.case should be prevalence
frac.case <- relative.frac.case * prevalence / sum(relative.frac.case)

frac.cont <- genotype.f - frac.case

frac.whole <- rbind(frac.case,frac.cont)

frac.whole
colSums(frac.whole)
rowSums(frac.whole)
```
```{r}
my.snp.pop <- function(allele.f,prevalence,genotypic.risk){
  genotype.f <- c(allele.f^2,2*allele.f*(1-allele.f),(1-allele.f)^2)
  relative.frac.case <- genotypic.risk * genotype.f
  # the sum of frac.case should be prevalence
  frac.case <- relative.frac.case * prevalence / sum(relative.frac.case)
  frac.cont <- genotype.f - frac.case
  frac.whole <- rbind(frac.case,frac.cont)
  frac.whole
}
allele.f <- 0.2
prevalence <- 0.1
genotypic.risk <- c(3,2,1)
pop <- my.snp.pop(allele.f,prevalence,genotypic.risk)
pop
colSums(pop)
rowSums(pop)
```

#### Genotypic risk ratio and addtive model and multiplicative model

GRR : (9,3,1) means probabilities to develop a phenotype of homozygous and heterozygous genotypes are $\times 3 \times 3$ and $\times 3$ of baseline probability 1, respectively.

Calculation of risks is "multiplication".

GRR : (3,2,1) means probalities to develop a phenotype of homozygous and heterozygous genotpypes are $+ 1 + 1$ and $+1$ from baseline 1, respectively.

Calcualtion of risks is "addition".

```{}
[Exercise]
Answer genotypic risk ratio of dominant model with phenocopy = p and prevalence q.

Answer genotypic risk ratio of recessive model with phenocopy = p and prevalence q.
```

#### Genotypic risk ratio and odds ratio

Pick a genotype as a baseline genotype and calculate odds ratios for the other two genotypes.

Odds ratio are close to the genotypic ratio whose base line riks = 1.

However the values are not exactly identical.

```{r}
allele.f <- 0.2
prevalence <- 0.01
genotypic.risk <- c(3,2,1)
pop <- my.snp.pop(allele.f,prevalence,genotypic.risk)
pop
(pop[1,1]*pop[2,3])/(pop[2,1]*pop[1,3])
(pop[1,2]*pop[2,3])/(pop[2,2]*pop[1,3])
```

```{}
[Exercise]
Fix allele freq and parameterize prevalence.
Evaluate the relation between prevalence and discrepancy in genotypic risk ratios and genotypic risk ratios.
```


## SNP 2x3 table and association tests

In the case-control sampling designs, we sample cases from case population and controls from control population.

```{r}
n.case <- 1000
n.control <- 1000

allele.f <- 0.2
prevalence <- 0.01
genotypic.risk <- c(2,1.5,1)
pop <- my.snp.pop(allele.f,prevalence,genotypic.risk)
pop.case <- pop[1,]
pop.control <- pop[2,]

sample.case <- rmultinom(1,n.case,prob=pop.case)
sample.control <- rmultinom(1,n.control,prob=pop.control)

tab <- rbind(c(sample.case),c(sample.control))
tab
```

```{}
[Exercise]
Test sum of cases and controls for hwe.

Test cases and controls individually for hwe.

Generate a sample tables with low prevalence and high prevalence (0.5) and also test cases, controls and sum of cases and samples for hwe and discuss their deviation from hwe.
```


### Table tests with df =2 or df=1

df=2 test will test any difference between cases and controls.

This test returns $chi^2$ value and it should be converted to p-value with df=2 $chi^2$ distribution.
```{r}
chisq.test(tab)
```

When you want to test dominant model, you can do as follow.
This time you can use df=1 $chi^2$ distribution.
```{r}
test.weight <- c(1,1,0)
prop.trend.test(tab[1,],tab[1,]+tab[2,],test.weight)
```

For recessive model, the weight vector is $(1,0,0)$.

For additive model, $(1,0.5,0)$.

```{r}
test.weight <- c(1,0,0)
prop.trend.test(tab[1,],tab[1,]+tab[2,],test.weight)

test.weight <- c(1,0.5,0)
prop.trend.test(tab[1,],tab[1,]+tab[2,],test.weight)
```

#### df=2 vs. df=1

Generate tables at random in null hypothesis and generate $chi^2$ value distribution for df=2 and df=1 tests.

```{r}
allele.f <- 0.2
prevalence <- 0.01
genotypic.risk <- c(1,1,1) # null hypothesis
pop <- my.snp.pop(allele.f,prevalence,genotypic.risk)
pop.case <- pop[1,]
pop.control <- pop[2,]

n.iter <- 10^3
sample.case <- rmultinom(n.iter,n.case,prob=pop.case)
sample.control <- rmultinom(n.iter,n.control,prob=pop.control)


chi2 <- chidom <- chirec <- chiadd <- rep(0,n.iter)
w.dom <- c(1,1,0)
w.rec <- c(1,0,0)
w.add <- c(1,0.5,0)
for(i in 1:n.iter){
  tab <- rbind(c(sample.case[,i]),c(sample.control[,i]))
  chi2[i] <- chisq.test(tab)$statistic
  chidom[i] <- prop.trend.test(tab[1,],tab[1,]+tab[2,],w.dom)$statistic
  chirec[i] <- prop.trend.test(tab[1,],tab[1,]+tab[2,],w.rec)$statistic
  chiadd[i] <- prop.trend.test(tab[1,],tab[1,]+tab[2,],w.add)$statistic
  
}
hist(chi2)
hist(chidom)
hist(chirec)
hist(chiadd)
```

```{r}
plot(data.frame(chi2,chidom,chirec,chiadd))
```

$chi^2$ of df=2 are always larger than other three types of $chi^2$(df=1).

```{r}
hist(pchisq(chi2,df=1,lower.tail=FALSE))
hist(pchisq(chi2,df=2,lower.tail=FALSE))
```

p-values based on df=1 for $chi^2(df=2)$ are not in uniform distribution but p-values based on df=2 are.

```{r}
hist(pchisq(chidom,df=1,lower.tail=FALSE))
hist(pchisq(chirec,df=1,lower.tail=FALSE))
hist(pchisq(chirec,df=1,lower.tail=FALSE))
```

##### df=2 selects the best fit df=1 model

Weight vectors of dominant, recessive and addtive are (1,1,0), (1,0,0), and (1,0.5,0).

In general (1,x,0).

```{r}
x <- seq(from= -20,to=20,length=1000)
chidf1 <- rep(0,length(x))
chi2 <- chisq.test(tab)$statistic
for(i in 1:length(x)){
  tmp.weight <- c(1,x[i],0)
  chidf1[i] <- prop.trend.test(tab[1,],tab[1,]+tab[2,],tmp.weight)$statistic
}
plot(x,chidf1,type="l")
abline(h=chi2)
```

df=2 $chi^2$ value is the biggest $chi^2$ value of df=1 tests with arbitrary model vectors.

This is because df=2 $chi^2$ values should correspond to larger p-values compared with $chi^2$ values of df=1.

df=2 tests also cares the models whose weight vectors that are not difficult to be described biologically in general, such as (1,2,0) and (1, -1, 0). 

```{}
n.case <- 1000
n.control <- 1000

allele.f <- 0.2
prevalence <- 0.1
genotypic.risk <- c(1,0.5,0)+0.1
pop <- my.snp.pop(allele.f,prevalence,genotypic.risk)
pop.case <- pop[1,]
pop.control <- pop[2,]

sample.case <- rmultinom(1,n.case,prob=pop.case)
sample.control <- rmultinom(1,n.control,prob=pop.control)

tab <- rbind(c(sample.case),c(sample.control))
tab
x <- seq(from= -2,to=2,length=1000)
chidf1 <- rep(0,length(x))
chi2 <- chisq.test(tab)$statistic
for(i in 1:length(x)){
  tmp.weight <- c(1,x[i],0)
  chidf1[i] <- prop.trend.test(tab[1,],tab[1,]+tab[2,],tmp.weight)$statistic
}
plot(x,chidf1,type="l")
abline(h=chi2)
abline(v=0.5)
```

### Logistic Regression 

When explanatory variable(s) stochastically determine (0,1) phenotype, logistic regression is the choice.

```{r}
n.case <- 1000
n.control <- 1000

allele.f <- 0.2
prevalence <- 0.1
genotypic.risk <- c(4,2,1)
pop <- my.snp.pop(allele.f,prevalence,genotypic.risk)
pop.case <- pop[1,]
pop.control <- pop[2,]
sample.case <- sample(2:0,n.case,replace=TRUE,prob=pop.case)
sample.control <- sample(2:0,n.control,replace=TRUE,prob=pop.control)

genotypes <- c(sample.case,sample.control)
phenotypes <- c(rep(1,n.case),rep(0,n.control))
lm.out <- glm(phenotypes~genotypes,family="binomial")
lm.out

coef <- lm.out$coefficients
exp(coef[2])
```

```{r}
n.iter <- 1000
est.effect <- rep(0,n.iter)
for(i in 1:n.iter){
  sample.case <- sample(2:0,n.case,replace=TRUE,prob=pop.case)
  sample.control <- sample(2:0,n.control,replace=TRUE,prob=pop.control)

  genotypes <- c(sample.case,sample.control)
  phenotypes <- c(rep(1,n.case),rep(0,n.control))
  lm.out <- glm(phenotypes~genotypes,family="binomial")
  lm.out

  coef <- lm.out$coefficients
  est.effect[i] <- exp(coef[2])
}
hist(est.effect)
```


Regression model easily incorporate covariates.

Genotypes, sex and weight contribute phenotype development.

```{r}
n.samples <- 1000
geno.freq <- c(0.09,0.42,0.49)
g <- sample(0:2,n.samples,replace=TRUE,prob=geno.freq)
sex <- sample(0:1,n.samples,replace=TRUE,prob=c(0.5,0.5))
weight <- rnorm(n.samples,60,10)

a.g <- 5
a.sex <- 3
a.weight <- 1/60
risk <- a.g * g + a.sex * sex + a.weight * weight
risk <- (risk-min(risk))/(max(risk)-min(risk))
hist(risk,density=15)

pheno <- runif(n.samples) < risk
glm.out <- glm(pheno ~ g + sex + weight,family="binomial")
glm.out$coefficients
```

## Multi-locus model

Genotypes and other covariates can contribute in additive fashion.

Multiple genetic loci can also contribute in additive fashion.

They may have interacttive effects but the simplest additive model seem to work for risk locus-search; or interactive effects seem to be too complicated to be studied with appropriate confidence for now.

### Quantitative traits and normal distribution

Let's increase number of loci and count the number of risk allele copies.

```{r}
n.locus <- 100
allele.freq <- runif(n.locus)
geno.freq <- cbind(allele.freq^2,2*allele.freq*(1-allele.freq),(1-allele.freq)^2)

n.samples <- 10^3
G <- matrix(0,n.samples,n.locus)
for(i in 1:n.locus){
  G[,i] <- sample(0:2,n.samples,replace=TRUE,prob=geno.freq[i,])
}

n.risk.copy <- t(apply(G,1,cumsum))

for(i in 1:4){
  hist(n.risk.copy[,i])
}
hist(n.risk.copy[,25])
hist(n.risk.copy[,100])
```

In reality each loci's effect sizes vary.

We can assume that all SNPs are somehow effective and their effect sizes are normally distributted.

Currently a fraction of loci with higher effect seems to be identified.

```{r}
eff <- rnorm(n.locus)

#n.risk.copy <- t(apply(G,1,cumsum))
G.risk <- t(t(G) * eff)
risk.copy <- t(apply(G.risk,1,cumsum))

for(i in 1:4){
  hist(risk.copy[,i])
}
hist(risk.copy[,25])
hist(risk.copy[,100])
```

### Heritability and missing heritability

How strongly genetic factors determine heredity is measured with heritability.

Heritability is a fraction of variance explained by genetic factors out of whole variance of phenotype.


```{r}
n.samples <- 10^6
var.gen <- 8
var.environment <- 2
gen <- rnorm(n.samples, 0, sqrt(var.gen))
env <- rnorm(n.samples, 0 , sqrt(var.environment))
phen <- gen + env
hist(phen)

var(phen)
var.gen + var.environment
```

Variance of phenotype is sum of two variances.
The heritability is 0.8.

#### Heritability and missing heritability

Assume 20 loci contibuted additively with unknown heritability.

```{r}
n.samples <- 10^5

n.locus <- 20
allele.freq <- runif(n.locus)
geno.freq <- cbind(allele.freq^2,2*allele.freq*(1-allele.freq),(1-allele.freq)^2)

eff <- rnorm(n.locus, 5,1)

G <- matrix(0,n.samples,n.locus)
for(i in 1:n.locus){
  G[,i] <- sample(0:2,n.samples,replace=TRUE,prob=geno.freq[i,])
}

G.risk <- t(t(G) * eff)
risk.copy <- t(apply(G.risk,1,cumsum))
gen.risk.total <- risk.copy[,n.locus]

Pheno <- gen.risk.total + rnorm(n.samples,0,2)

hist(Pheno)
```

If all 20 loci are known and their effects are also known, heritability is calculated as;

```{r}
var.gen <- var(gen.risk.total)
var.phen <- var(Pheno)
var.gen/var.phen
```

In case a fraction of loci and their effects are known, the calculation is;

```{r}
n.locus.known <- 5
gen.risk.partial <- risk.copy[,n.locus.known]
var.gen.partial <- var(gen.risk.partial)
var.gen.partial/var.phen
```

The calculated heritability is smaller than the truth.

Actually the fraction of total heritability that is not explained by know loci is called "missing heritability".
```{r}
1- (var.gen.partial/var.phen) / (var.gen/var.phen)
```

#### Calculation of missing heritability


Heritability is estimated by familial anaylyses.

Risk loci are identified by linkage analysis/LD mapping.

The effect size is estimated also.

```{r}
lm.out <- lm(Pheno ~ G[,1:n.locus.known])
lm.out

predicted <- predict(lm.out)

plot(predicted, Pheno)
plot(gen.risk.partial,predicted) 
```

Genetic contribution of known loci seems well estimated and it partially explained phenotype variation.

Heritability explained by know loci is;

```{r}
var(predicted)/var(Pheno)
```

When this value is lower than heritability value estimated from familial studies, it is believed the rest of heritability is missing.

Missing heritability might be explained by unidentified loci or combinatorial effects of known with or wityhout unknown loci.

Also assumed model ( simple additive model for each loci and for multiple loci) deviated from the truth can be the source of the missing.


### Dichotomous traits and threshold model

Affected vs. non-affected, dichotomous phenotype is somewhat different from the above.

Multiple loci generate total risk and environmental risk is added, then each individual is believed to have "risk to develop" the phenotype.

The threshold model is the model where individuals with risk more than a threshold value are affected and others not.

The risk can determine probability to develop the phenotype. The threshold model has a stochastic component that is buried into environmental contribution. 

```{r}
thres <- quantile(Pheno,0.8)
Pheno.bin <- Pheno > thres
hist(Pheno)
abline(v=thres,col=2)
```

```{r}
br <- seq(from=0,to=200,by=10)
hist(gen.risk.total,breaks=br)
hist(gen.risk.total[which(Pheno.bin==0)],col=2,density=15,breaks=br,add=TRUE)
hist(gen.risk.total[which(Pheno.bin==1)],col=3,density=23,breaks=br,add=TRUE)
```

#### Missing heritability problem of dichotomous traits

Heritability is the faraction of phenotype variance explained by genetic factors.

Heritability for dichotomous traitns itself is problematic because heritability was defined for quantitative traits in normal distribution initially.

Therefore missing heritability, in other words, difficulty in explanation of heritability with loci, is reasonable.


Multiple methods have been proposed to define heritability of dichotomous traits.

A few important ideas for the methods are described below.

##### Variance of phenotype vs. variance of liability

Variance of dichotomous traits are K(1-K), where K is prevalence in population.

Genetic factors determine individual liability, genetic risk to develop phenotpe that takes "normal" distribution; much different from the {0,1}.

Heritability for liability can be estimated in the same way with quantitative traits. 

Unfortunately liability is not observed.

##### Ascertainment problem

Variance of dichotomous traits are K(1-K), where K is prevalence in population.

Samples' phenotype tariance is P(1-P), where P is proportion of cases in case-control design.

K(1-K) and P(1-P) differ a lot...

Then liability distribution is deformed.

```{r}
br <- seq(from=0,to=200,by=10)
sample.case <- which(Pheno.bin==1)
sample.cont <- which(Pheno.bin==0)

case.control.Case <- sample(sample.case,1000,replace=TRUE)
case.control.Cont <- sample(sample.cont,1000,replace=TRUE)

hist(gen.risk.total[c(case.control.Cont,case.control.Case)],breaks=br)

hist(gen.risk.total[c(case.control.Cont,case.control.Case)],breaks=br)
hist(gen.risk.total[case.control.Cont],col=2,density=15,breaks=br,add=TRUE)
hist(gen.risk.total[case.control.Case],col=3,density=23,breaks=br,add=TRUE)
```

To treat these issues appropriately, model-based estimation procudures are being used.

Also GWAS/ whole exome/genome sequence data are to be utilized to incorporate similarlity infromation among individuals and markers.

#### Further readings

http://www.well.ox.ac.uk/dtc/CurrentProblems.pdf 

http://www.pnas.org/content/111/49/E5272.short 


### If we have time...

#### Multiple testing correction

##### Bonferroni correction

##### Permutation test

#### Meta-analysis

#### Linkage disequilibrium

##### Haplotype estimation

##### LD block




# Transcriptome ~ Subtyping cancers based on expression profile

## Molecular profiling for cancers ~ Prognostic information

Review of molecular profiling for breast cancers: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3825646/ 

### eg. Mamma print 

#### MammaPrint at a glance

In Japanese : https://www.dna-chip.co.jp/diagnostic/mammaprint/patient1.html 

In English : http://www.agendia.com/healthcare-professionals/breast-cancer/mammaprint/ 

#### Identification of a set of prognostic genes

https://www.ncbi.nlm.nih.gov/pubmed/11823860

```{}
[Exercise]
Define the clinical target of studies.

Describe definition of two subgroups.
```

## Differential expression analysis (DEA)

### Statistical test and measure of difference

Individual gene's expression is compared.

t-test for rejection of null hypothesis.

Correlation to measure strength of relation.

Slope of linear regression .

Mean expression ratio.

```{r}
exp1 <- rnorm(100,100,1)
exp2 <- rnorm(100,110,1)
gr <- c(rep(0,length(exp1)),rep(1,length(exp2)))
boxplot(cbind(exp1,exp2))
t.test(exp1,exp2)
cor(gr,c(exp1,exp2))
lm(c(exp1,exp2)~gr)
mean(exp1)/mean(exp2)
```

#### Evaluate genes with p-values and measure of difference

Volcano plot.

width of difference and smallness of p-value are correlated but not in 1-to-1 correspondence.

```{r}
n.gene <- 1000
n1 <- 100
n2 <- 100
gr <- c(rep(0,n1),rep(1,n2))

ps <- ratios <- rep(0,n.gene)
for(i in 1:n.gene){
  m <- rnorm(1,100,20)
  exp1 <- matrix(rnorm(n.gene*n1,m),ncol=n.gene)
  exp2 <- matrix(rnorm(n.gene*n2,m),ncol=n.gene)
  ps[i] <- t.test(exp1[,i],exp2[,i])$p.value
  ratios[i] <- mean(exp1[,i])/mean(exp2[,i])
}

plot(log2(ratios),-log10(ps))
```

### Log-transformation

Normal distribution.

```{r}
r <- rnorm(100000,0,0.3)
hist(r)
```

Log-normal distribution.

Values should be 0 or more.

Expression intensity of genes is log-normally distributed.

```{r}
r2 <- exp(r)
hist(r2)
```
Log-normally distributed values can be transformed in normal distribution by taking their logarithm.

```{r}
r3 <- log(r2)
hist(r3)
```

#### t-testing log-normally distributed data sets.

t-test is designed to test data sets that are in normal distribution.

What would happen when t-test is applied to data sets that are in normal and log-normal distributions?

```{r}
n.iter <- 1000
p.normal <- p.log.normal <- rep(0,n.iter)
for(i in 1:n.iter){
  d1 <- rnorm(20)
  d2 <- rnorm(20,0.6) # small difference
  exp.d1 <- exp(d1)
  exp.d2 <- exp(d2)
  tmp <- t.test(d1,d2)
  tmp.exp <- t.test(exp.d1,exp.d2)
  p.normal[i] <- tmp$p.value
  p.log.normal[i] <- tmp.exp$p.value
}

hist(p.normal,col=2,density=25,breaks=seq(from=0,to=1,by=0.05))
hist(p.log.normal,add=TRUE,col=3,density=31,breaks=seq(from=0,to=1,by=0.05))

plot(p.normal,p.log.normal)
```

### False Discovery Rate (FDR)

When many genes are tested and when there is a good reason to believe that some fraction of genes should be positively assosicated with a phenotype, FDR is the choice to pick up genes with association.

```{r}
n.genes <- 10^3
frac.assoc <- 0.2
n.assoc <- n.genes*frac.assoc
n.non.assoc <- n.genes-n.assoc

n1 <- n2 <- 100
d1.non.assoc <- matrix(rnorm(n1*n.non.assoc),nrow=n1)
d2.non.assoc <- matrix(rnorm(n2*n.non.assoc),nrow=n2)

d1.assoc <- matrix(rnorm(n1*n.assoc,0.3),nrow=n1) # Exercise: plot with 0.3 -> 2
d2.assoc <- matrix(rnorm(n2*n.assoc,0),nrow=n2)

d1 <- cbind(d1.non.assoc,d1.assoc)
d2 <- cbind(d2.non.assoc,d2.assoc)

d.all <- rbind(d1,d2)
heatmap(d.all)

```
```{}
[Exercise]
Plot a heatmap with "d1.assoc <- matrix(rnorm(n1*n.assoc,0.3),nrow=n1) # plot with 0.3 -> 2"
```

We can t-test genes one by one.
```{r}
ps <- rep(0,n.genes)
for(i in 1:n.genes){
  ps[i] <- t.test(d1[,i],d2[,i])$p.value
}
plot(ps)
hist(ps[1:n.non.assoc])
hist(ps[(n.non.assoc+1):n.genes])
plot(sort(ps))
```

Genes with smaller p-values are more likely to be truely positive.

```{r}
ord <- order(ps)
col <- c(rep(1,n.non.assoc),rep(2,n.assoc))
plot(ps[ord],col=col[ord],type="h")
```

The fraction of false positives (False Discovery Rate) depends on cut-off values.

```{r}
cut_offs <- seq(from=0,to=1,length=100)
ps.non.assoc <- ps[1:n.non.assoc]
ps.assoc <- ps[(n.non.assoc+1):n.genes]
n.true.positive <- n.false.positive <- rep(0,length(cut_offs))
for(i in 1:length(cut_offs)){
  n.true.positive[i] <- length(which(ps.assoc < cut_offs[i]))
  n.false.positive[i] <- length(which(ps.non.assoc < cut_offs[i]))
}
matplot(cut_offs,cbind(n.false.positive,n.true.positive),type="l")
```

```{r}
false.discovery.rate <- n.false.positive/(n.false.positive+n.true.positive)
false.negative.rate <- (n.assoc-n.true.positive)/n.assoc
matplot(cut_offs,cbind(false.discovery.rate,false.negative.rate),type="l")
```

We can tell false discovery rate when we know which genes are truely/falsely associated, but in reality we don't know the truth.

We want to have a method that only uses test results and pick cut-off values for various false discovery rate values.

```{r}
qs <- p.adjust(ps,"fdr")
plot(qs)
hist(qs[1:n.non.assoc])
hist(qs[(n.non.assoc+1):n.genes])
ord <- order(qs)
col <- c(rep(1,n.non.assoc),rep(2,n.assoc))
plot(qs[ord],col=col[ord],type="h")

cut_offs <- seq(from=0,to=1,length=100)
qs.non.assoc <- qs[1:n.non.assoc]
qs.assoc <- qs[(n.non.assoc+1):n.genes]
n.true.positive <- n.false.positive <- rep(0,length(cut_offs))
for(i in 1:length(cut_offs)){
  n.true.positive[i] <- length(which(qs.assoc < cut_offs[i]))
  n.false.positive[i] <- length(which(qs.non.assoc < cut_offs[i]))
}
matplot(cut_offs,cbind(n.false.positive,n.true.positive),type="l")
false.discovery.rate <- n.false.positive/(n.false.positive+n.true.positive)
false.negative.rate <- (n.assoc-n.true.positive)/n.assoc
matplot(cut_offs,cbind(false.discovery.rate,false.negative.rate),type="l")
abline(0,1,col=3)
```
Green line indicates "perfect FDR control".

How about Bonferroni's method?

```{r}
qs <- p.adjust(ps,"bonferroni")
plot(qs)
hist(qs[1:n.non.assoc])
hist(qs[(n.non.assoc+1):n.genes])
ord <- order(qs)
col <- c(rep(1,n.non.assoc),rep(2,n.assoc))
plot(qs[ord],col=col[ord],type="h")

cut_offs <- seq(from=0,to=1,length=100)
qs.non.assoc <- qs[1:n.non.assoc]
qs.assoc <- qs[(n.non.assoc+1):n.genes]
n.true.positive <- n.false.positive <- rep(0,length(cut_offs))
for(i in 1:length(cut_offs)){
  n.true.positive[i] <- length(which(qs.assoc < cut_offs[i]))
  n.false.positive[i] <- length(which(qs.non.assoc < cut_offs[i]))
}
matplot(cut_offs,cbind(n.false.positive,n.true.positive),type="l")
false.discovery.rate <- n.false.positive/(n.false.positive+n.true.positive)
false.negative.rate <- (n.assoc-n.true.positive)/n.assoc
matplot(cut_offs,cbind(false.discovery.rate,false.negative.rate),type="l")
abline(0,1,col=3)
```

#### How to correct p-values with Bonferroni and FDR (BH) methods

Bonferroni
$$
q = p \times n
$$
All p-values should be evenly handled.

FDR
$$
q_i = p_i \times n / i
$$

Smaller p-values should be adjusted more.
No adjustment of the biggest p-value.

The order of q-values should be the same with the order of p-values.


```{r}
my.bonferroni <- function(ps){
  n <- length(ps)
  qs <- ps * n
  return(qs)
}
# ps should be sorted
my.fdr <- function(ps){
  n <- length(ps)
  tmp <- ps * n/(1:n) # smaller p is adjusted more
  qs <- rep(0,n)
  for(i in 1:n){
    qs[i] <- min(tmp[i:n]) # orders of q and p are the same
  }
  return(list(qs=qs,tmp=tmp))
}

ps. <- sort(sample(ps,100))
qs.bonferroni <- my.bonferroni(ps.)
qs.fdr <- my.fdr(ps.)

plot(ps.,qs.bonferroni,ylim=c(0,max(qs.bonferroni)))
abline(h=1,col=2)
plot(ps.,qs.fdr$tmp,ylim=c(0,max(qs.fdr$tmp)))
abline(h=1,col=2)
plot(ps.,qs.fdr$qs,ylim=c(0,max(qs.fdr$tmp)))
abline(h=1,col=2)
plot(qs.fdr$qs,qs.fdr$tmp)
```

When p-values are from pure null hypotheses, FDR-q-values should be 1 throughout.

```{r}
ps. <- sort(runif(1000))
qs.bonferroni <- my.bonferroni(ps.)
qs.fdr <- my.fdr(ps.)

plot(ps.,qs.bonferroni,ylim=c(0,max(qs.bonferroni)))
abline(h=1,col=2)
plot(ps.,qs.fdr$tmp,ylim=c(0,max(qs.fdr$tmp)))
abline(h=1,col=2)
plot(ps.,qs.fdr$qs,ylim=c(0,max(qs.fdr$tmp)))
abline(h=1,col=2)
```


## Clustering and heatmap

```{r}
n.genes <- 10^2
frac.assoc <- 0.2
n.assoc <- n.genes*frac.assoc
n.non.assoc <- n.genes-n.assoc

n1 <- n2 <- 50
d1.non.assoc <- matrix(rnorm(n1*n.non.assoc),nrow=n1)
d2.non.assoc <- matrix(rnorm(n2*n.non.assoc),nrow=n2)

d1.assoc <- matrix(rnorm(n1*n.assoc,2),nrow=n1) 
d2.assoc <- matrix(rnorm(n2*n.assoc,0),nrow=n2)

d1 <- cbind(d1.non.assoc,d1.assoc)
d2 <- cbind(d2.non.assoc,d2.assoc)

d.all <- rbind(d1,d2)
heatmap(d.all)

```

#### Distance, similarlity

```{r}
s1 <- c(3,0)
s2 <- c(0,4)
plot(rbind(s1,s2),xlim = c(0,5),ylim=c(0,5))
```

Distance

```{r}
plot(rbind(s1,s2),xlim = c(0,5),ylim=c(0,5))
segments(s1[1],s1[2],s2[1],s2[2])
sqrt(sum((s1-s2)^2))
```

Angle measures similarlity as well.

```{r}
s1 <- c(1,0)
s2 <- c(1/2,sqrt(3)/2)
plot(rbind(s1,s2),xlim = c(0,2),ylim=c(0,2))
segments(0,0,s1[1],s1[2])
segments(0,0,s2[1],s2[2])

x <- sum(s1*s2)/sqrt(sum(s1^2)*sum(s2^2))
theta <- acos(x)
theta/pi
s2. <- c(cos(theta),sin(theta))
s2.
s2
```

#### Distance matrix and trees

```{r}
n1 <- 10
d <- 2
cl1 <- matrix(rnorm(n1*d),ncol=d)
n2 <- 8
cl2 <- matrix(rnorm(n2*d),ncol=d) + 5

dat <- rbind(cl1,cl2)
plot(dat,col=c(rep(1,n1),rep(2,n2)))
```

Distance matirx.

Red: close

Yellow : Apart

```{r}
dm <- as.matrix(dist(dat))
dm
image(dm)
```

Clustering
```{r}
plot(hclust(dist(dm)))
```


#### Tree generation

There are multiple ways to generate cluster trees.

Neighbor joining method is one of them.
It is good to understande what the tree shapes and length of edges mean.


When there are three items to be clustered, distance values are maintained in the tree.
```{r}
# install.packages("ape")
library(ape)
n.tips <- 3
k <- 2 # dimension
d <- matrix(rnorm(n.tips*k),ncol=k)
dm <- as.matrix(dist(d))
cl <- nj(dm)
plot(cl,type="unroot")
cl$edge
cl$edge.length
dm
```

```{}
[Exercise]
Check the IDs of tip nodes and the node inserted and calculate distances amon the tips in the tree yourself.
```
A tree cluster is generated.
Two nodes are inserted and four original points are tips of the tree.

The distance along the tree from a tip to another tip is the same with the value in the distance matrix.

You can check it by making a graph object with edges and edge length information.

```{r}
library(igraph)
g <- graph.edgelist(cl$edge)
E(g)$weight <- cl$edge.length
plot(g)
dm.from.g <- distances(g)
dm.from.g
dm.from.g[1:3,1:3]
dm
dm.from.g[1:3,1:3] - dm
```

The tree and its edge lenght represent distance values of all pairs.


```{r}
n.tips <- 100
k <- 4
d <- matrix(rnorm(n.tips*k),ncol=k)
dm <- as.matrix(dist(d))
cl <- nj(dm)
plot(cl,type="unroot")
#cl$edge
#cl$edge.length
g <- graph.edgelist(cl$edge)
E(g)$weight <- cl$edge.length
plot(g)
dm.from.g <- distances(g)
plot(c(dm),c(dm.from.g[1:n.tips,1:n.tips]))
#range(dm.from.g[1:n.tips,1:n.tips] - dm)
```

#### Heatmap ~ clustering samples and clustering genes

```{r}
n.genes <- 10^2
frac.assoc <- 0.2
n.assoc <- n.genes*frac.assoc
n.non.assoc <- n.genes-n.assoc

n1 <- n2 <- 50
d1.non.assoc <- matrix(rnorm(n1*n.non.assoc),nrow=n1)
d2.non.assoc <- matrix(rnorm(n2*n.non.assoc),nrow=n2)

d1.assoc <- matrix(rnorm(n1*n.assoc,2),nrow=n1) 
d2.assoc <- matrix(rnorm(n2*n.assoc,0),nrow=n2)

d1 <- cbind(d1.non.assoc,d1.assoc)
d2 <- cbind(d2.non.assoc,d2.assoc)

d.all <- rbind(d1,d2)
heatmap(d.all)
```

d.all: 

rows : samples

columns : genes

```{r}
dm.samples <- dist(d.all)
dm.genes <- dist(t(d.all))

plot(hclust(dm.samples))

plot(hclust(dm.genes))
```

```{}
[Exercise]
Use (1-cos(theta)), where theta is an angle between two vectors as dissimilarity measure (definition of distance) and make a function that makes a distance matrix.
```

#### Non-supervised

The clusters were generated just based on the data set. This means the clustering procedure is non-supervised, or no-true-answer-information.

## Supervised clustering

### Data simulation

```{r}
n.genes <- 10^3
frac.assoc <- 0.5
n.assoc <- n.genes*frac.assoc
n.non.assoc <- n.genes-n.assoc

n1 <- n2 <- 50
d1.non.assoc <- matrix(rnorm(n1*n.non.assoc),nrow=n1)
d2.non.assoc <- matrix(rnorm(n2*n.non.assoc),nrow=n2)

# Associated genes have various effect size
d1.assoc <- matrix(rnorm(n1*n.assoc,2),nrow=n1)
for(i in 1:n.assoc){
  d1.assoc[,i] <- rnorm(n1,rnorm(1,0,0.1))
}
d2.assoc <- matrix(rnorm(n2*n.assoc,0),nrow=n2)

d1 <- cbind(d1.non.assoc,d1.assoc)
d2 <- cbind(d2.non.assoc,d2.assoc)

d.all <- rbind(d1,d2)
heatmap(d.all)
```





```{r}
# Gene expression levels are assumed to be in log-normal distribution.
ps <- ratios <- rep(0,n.gene)
for(i in 1:n.genes){
  ps[i] <- t.test(d1[,i],d2[,i])$p.value
  ratios[i] <- mean(exp(d1[,i]))/mean(exp(d2[,i]))
}

plot(log2(ratios),-log10(ps))
```

### Supervised model fitting

We want to predict phenotypes with gene expression data.

We have phenotype data, that are "answers".

We can generate models with "answers", supervised.

Rank genes based on the ratio.

```{r}
ord <- order(abs(log2(ratios)),decreasing=TRUE)
plot(abs(log2(ratios))[ord])
```

From the top of the absolute ratio values to the bottom, genes are to be added to logistic regression model.

```{r}
phenotype <- c(rep(0,n2),rep(1,n2))
n.factor <- 30
predicted <- matrix(0,n1+n2,n.factor)
for(i in 1:n.factor){
  log.lm.out <- glm(phenotype~d.all[,ord[1:i]],family = binomial(link='logit'))
  predicted[,i] <- predict(log.lm.out)
}
```

```{r}
matplot(t(predicted),type="l",col=c(rep(1,n1),rep(2,n2)))
```

With the increase in the number of genes, the segregation performance becomes better.

### Over-fitting

```{r}
n.genes <- 10^3
frac.assoc <- 0.5
n.assoc <- n.genes*frac.assoc
n.non.assoc <- n.genes-n.assoc

n1 <- n2 <- 50
N1 <- n1 * 2
N2 <- n2 * 2
d1.non.assoc <- matrix(rnorm(N1*n.non.assoc),nrow=N1)
d2.non.assoc <- matrix(rnorm(N2*n.non.assoc),nrow=N2)

# Associated genes have various effect size
d1.assoc <- matrix(rnorm(N1*n.assoc,2),nrow=N1)
for(i in 1:n.assoc){
  d1.assoc[,i] <- rnorm(N1,rnorm(1,0,0.1))
}
d2.assoc <- matrix(rnorm(N2*n.assoc,0),nrow=N2)

D1 <- cbind(d1.non.assoc,d1.assoc)
D2 <- cbind(d2.non.assoc,d2.assoc)

D.all <- rbind(d1,d2)

d1.training <- D1[1:n1,]
d2.training <- D2[1:n2,]

d1.test <- D1[(1+n1):N1,]
d2.test <- D2[(1+n2):N2,]
d.training <- rbind(d1.training,d2.training)
d.test <- rbind(d1.test,d2.test)

```

```{r}
# Gene expression levels are assumed to be in log-normal distribution.
ps <- ratios <- rep(0,n.gene)
for(i in 1:n.genes){
  ps[i] <- t.test(d1.training[,i],d2.training[,i])$p.value
  ratios[i] <- mean(exp(d1.training[,i]))/mean(exp(d2.training[,i]))
}

plot(log2(ratios),-log10(ps))
```

```{r}
ord <- order(abs(log2(ratios)),decreasing=TRUE)
plot(abs(log2(ratios))[ord])

phenotype <- c(rep(0,n2),rep(1,n2))
n.factor <- 30
predicted.training <- predicted.test <- matrix(0,n1+n2,n.factor)
for(i in 1:n.factor){
  log.lm.out <- glm(phenotype~d.training[,ord[1:i]],family = binomial(link='logit'))
  predicted.training[,i] <- apply(log.lm.out$coefficients[-1] * t(d.training[,ord[1:i]]),2,sum)
  predicted.test[,i] <- apply(log.lm.out$coefficients[-1] * t(d.test[,ord[1:i]]),2,sum)
}
```

```{r}
matplot(t(predicted.training),type="l",col=c(rep(1,n1),rep(2,n2)))
matplot(t(predicted.test),type="l",col=c(rep(1,n1),rep(2,n2)))
```

With the increase in the number of genes included the model, the classification of samples used to generate the model becomes better.

```{r}
n.upper.half <- rep(0,n.factor)
for(i in 1:n.factor){
  tmp.pred0 <- predicted.training[,i]
  tmp.pred <- predicted.training[1:n1,i]
  n.upper.half[i] <- length(which(tmp.pred<quantile(tmp.pred0,0.5)))
}
plot(1:n.factor,n.upper.half)
```

On the contrary, the prediction for the samples that were not used to generate the models becomes better when the number of genes are relatively few, but it gets worse with larger number of genes.

```{r}
n.upper.half <- rep(0,n.factor)
for(i in 1:n.factor){
  tmp.pred0 <- predicted.test[,i]
  tmp.pred <- predicted.test[1:n1,i]
  n.upper.half[i] <- length(which(tmp.pred<quantile(tmp.pred0,0.5)))
}
plot(1:n.factor,n.upper.half)
```

The modeld became too good to the particular samples of training but they were deviated from the truth.

It is called overfitting.

### How to avoid overfitting

#### Training and testing

Generate models with a fraction of samples (training samples).

Check the models' performance with the rest of the samples (test samples).

The good model(s) should segregate the test samples well.

#### Cross-validation methods

Generation of models are better with more samples.

We can separate the whole samples into training and testing in many ways by randomly assigning them to two subsets. 

For each separation, the "best" gene set is identified. The final selection of "predictive gene set" should be the consensus of many attempts.

https://www.r-bloggers.com/cross-validation-for-predictive-analytics-using-r/

```{}
[Exercise]
Run the examples of cross-valiadtion function mentioned in the r-blogger article.
```

## Validation for its clinical use

http://link.springer.com/article/10.1007%2Fs10549-010-0814-2 
```{}
[Exercise]
Summarize how the paper validated the utility of the set of genes.
```