GeneTrail2 1.5
Statistical analysis of molecular signatures
Regulatory effect analysis
The gene expression patterns in nearly all cell types are altered when environmental factors (e.g. diseases or drug treatment) change. The majority of these alterations are controlled by transcriptional regulators like transcription factors and chromatin modifiers. Hence it is of utmost importance to identify those regulators that might be responsible for the changes in gene expression between two experimental conditions (disease vs. control). Here we describe several methods that can be used for this purpose.
Overrepresentation analysis (ORA)
In the last few years several methods have been proposed that use overrepresentation analysis (ORA) to find important regulators. The goal of all these approaches is to identify those regulators that have more targets in a list of deregulated genes than expected by chance. For all methods we assume that we have a regulator with $k$ targets in a test set $T=\{t_1,t_2,...,t_n\}$ and $l$ targets in our collection of regulatortarget interactions (RTIs) $R=\{t_1,t_2,...,t_m\}.$ The target genes in the RTI collection form the background set for the ORA.
Hypergeometric test
Essaghir et al. [1] proposed to calculate an uppertailed pvalue based on the hypergeometric test in order to identify transcription factors with significantly more targets than expected:
$$ p_{hyper}=\sum_{i=k}^{n}\frac{\binom{l}{i}\binom{ml}{ni}}{\binom{m}{n}} $$Fisher's exact test
Alternatively, the Fisher's exact test can also be used for this task. This test has the advantage that it can also be used if test set $T$ is not a subset of $R$.
$$ p_{fisher}=\sum_{i=k}^{n}\frac{\binom{n}{i}\binom{m}{l+ki}}{\binom{m+n}{l+k}} $$Hypergeometric test or Fisher's exact test
Since it might not always be obvious for the user which of the two tests should be used, we additionally implemented a combined version of the test that was originally presented by Backes et al. [2]. Here, the authors use the hypergeometric test if $T$ is a subset of $R$ and the Fisher's exact test if this is not the case.
Binomial test
Alternatively Yang et. al [3] proposed to use the binomial test.
$$ p_{binom}=\sum_{i=k}^{l}\binom{l}{i}\left(\frac{n}{i}\right)^i\left(1\frac{n}{m}\right)^{ni} $$Regulatory impact factors (RIF1 + RIF2)
Reverter et al. [4] introduced the concept of regulatory impact factors. In their paper they propose two metrics RIF1 and RIF2 that both integrate the change in correlation between regulator and the differentially expressed (DE) target genes, the amount of differential expression as well as the abundance of DE genes. According to the authors, a " RIF analysis assigns an extreme score to those TF that are consistently most differentially coexpressed with the highly abundant and highly DE genes (RIF1), and to those TF with the most altered ability to predict the abundance of DE genes (RIF2) ".
Both metrics are defined as follows [4]:
Assume we are given a transcription factor $i$ with $n$ DE target genes and a gene expression data set with two conditions being compared. Then $a_j$ is the mean expression of the $j$th DE gene, $e1_j$ and $e2_j$ are the averaged expression values of condition $1$ and $2$ respectively, $d_j$ is the estimated differential expression of the $j$th DE gene and $r1_{ij}$ as well as $r2_{ij}$ are the correlation coefficients between the $i$th TF and the $j$th DE gene in conditions $1$ and $2$.
$$ RIF1 = \frac{1}{n} \sum_{j=1}^{n} a_j \times d_j \times (r1_{ij}  r2_{ij}) $$ $$ RIF2 = \frac{1}{n} \sum_{j=1}^{n} \left[ (e1_j \times r1_{ij})^2  (e2_j \times r2_{ij})^2 \right] $$Regulatorgene association enrichment (REGGAE) analysis
We recently developed an alternative approach that is based on nonparametric enrichment analysis. Given a matrix of gene expression values where the samples belong to two conditions (e.g. disease and control), a list of genes $D=\{g_1,g_2,...,g_n\}$ for which influential regulators should be found, and a collection of regulatortarget interactions (RTIs), REGGAE measures the effect of transcriptional regulators by performing the following steps:
Step 1: Calculating the influence of regulators for every target gene
For each deregulated gene $g_i \in D$, our collection of RTIs contains a list of regulators $R_{g_i}=\{r_{i1}, r_{i_2}, ...\}$ that may influence its expression. For every regulatortarget pair, we calculate the correlation between the two variables in order to estimate the influence of the regulator. Each gene regulator list $R_{g_i}$ is now sorted with respect to this influence (cf. Figure 1A).
Step 2: Creating the sorted regulator list
Based on a sorted list $D=\{g_1,g_2,...,g_n\}$ and the associated regulator list $R_{g_i}=\{r_{i1}, r_{i_2}, ...\}$, we create a new list $L=\{r_{11}, r_{12}, ..., r_{n1}, r_{n2}, ...\}$ that sorts the involved regulators in a columnwise fashion (cf. Transition Figure 1A > Figure 1B).
Step 3: Enrichment analysis
Please note that each regulator can appear up to $n$ times in list $L$ and we assume that the most influential ones are enriched at the top of the list. Hence, for each regulator in our RTI database, we carry out an enrichment analysis on by using either the Wilcoxon ranksum test [5] or the unweighted version of the KolmogorovSmirnov test [2] (cf. Figure 1C). Finally, all regulators are sorted with respect to their adjusted pvalues.

The Wilcoxon ranksum test [5] is a nonparametric alternative to the independent twosample ttest which is based solely on the order of the values in the two samples. It can be used to test if two samples are drawn from populations with the same underlying distribution. The test statistic can be obtained as follows:
The results of the two groups $X = (x_{1},x_{2},\ldots,x_{n})$ and $Y = (y_{1},y_{2},\ldots,y_{m})$ are combined and sorted increasingly. Each element in the sorted list receives its rank as new value. In case multiple entries have the same score, the mean of the available rank numbers is assigned to all of them. Based on this information the test statistic is defined as:
$$W = \sum_{i=1}^{n}R(x_{i}) $$$R(x)$ is the rank of value $x_i$ in the sorted and pooled list of values.
For $n > 25$ and $m > 25$, $W_{m,n}$ is approximately normally distributed [5].
The Zscore in this case is:
$$Z = \frac{W  m_W}{s_W}$$ $$m_W = \frac{n(n + m +1)}{2}$$ $$s_W = \sqrt{\frac{n \cdot m (n + m + 1)}{12}}$$A pvalue for the $Z$score can be derived from the standard normal distribution. If a normal approximation is not possible, the pvalues for test statistic $W$ can be looked up in a table [6].

KolmogorovSmirnov test a nonparametric hypothesis test, which is based solely on the order of an input list L. Focusing on ranks rather than on the absolute value has the advantage that the method is more robust and can penalize outliers, which might otherwise have a big influence on the results. We use this test to check if the targets of a transcription factor are randomly distributed or accumulated on top or bottom of the list [7].
The test statistic for a transcriptional regulator can be defined as a running sum statistic based on its targets occurring in a sorted list $L = \{l_{1},l_{2},\ldots,l_{n}\}$.
$$ RS(i)=\begin{cases} 0 &, \text{if }i=0\\ RS(i1) + n  j &, \text{if }l_i \text{ belongs to C}\\ RS(i1) j &, \text{if }l_i \text{ does not belong to C} \end{cases} $$The test statistic $RS_C$ is the maximal deviation from 0 of $RS(i)$.
$RS_C = \max\limits_{1 \le i \le n} \{RS(i)\}$$A huge advantage of the unweighted GSEA is that an exact pvalue for test statistic $RS_C$ can be computed via the dynamic programming algorithm presented in the next section.
Keller et al. [8] showed that a pvalue for GSEA can be computed as the probability that a random running sum reaches a maximal deviation greater or equal than $RS_C$.
The authors developed an algorithm, which is able to compute the number of running sum statistics that are smaller than $RS_C$. This can be used to compute an exact pvalue for $RS_C$ based on the complement of this event.
$p = 1\frac{X}{Y},$where X is the number of running sum statistics with a maximum deviation of at most $RS1$ and Y is the number of all possible running sum statistics. Y can be computed as $\binom{n}{j}$, where n is the length of list L and j is the number of entries that belong to category C.
The value of X can be calculated via a dynamic programming algorithm. The algorithm iterates over a matrix M with $(j+1) \times (n  j +1)$ entries. $M(i,k)$ is the number of running sum statistics with i members of the considered category and k nonmembers whose maximum deviation of zero is less than $RS_C1$.
The matrix is initialized with:
$$ M(i,0)=\begin{cases} 1&, \text{if } i\cdot(nk) < RS_C\\ 0& \text{else} \end{cases} $$ $$ M(0,j)=\begin{cases} 1&, \text{if } j\cdot k< RS_C\\ 0& \text{else} \end{cases} $$The value for X can be found via dynamic programming over matrix M for $i=0$ to $j$ and for $k=0$ to $nj$:
$$ M(i,k)=\begin{cases} M(i1,k) + M (i,k1)&, \text{if } (*)\\ 0& \text{else} \end{cases}, $$where (*) is $ i\cdot(nj)  k \cdot j < RS_C$ for an uppertailed pvalue and $RS_C < i\cdot(nj)  k \cdot j$ for a lowertailed pvalue.
Step 4 (optional): Bootstrapping
In order to account for technical noise that might influence gene expression values, correlation coefficients and, ultimately, the order of the regulators, we propose to perform $B \in [1000,10000]$ bootstrap replications. In each replication run a new test statistic is calculated. In the end an exact pvalue is calculated for the median value in the test statistic list.
Bibliography
 Transcription factor regulation can be accurately predicted from the presence of target gene signatures in microarray gene expression data Nucleic acids research Oxford Univ Press
 GeneTrail—advanced gene set enrichment analysis Nucleic acids research Oxford Univ Press (View online)
 DCGL v2. 0: an R package for unveiling differential regulation from differential coexpression PLoS One Public Library of Science
 Regulatory impact factors: unraveling the transcriptional regulation of complex traits from expression data Bioinformatics Oxford Univ Press
 Taschenbuch der Statistik Harri Deutsch Verlag (View online)
 100 statistical tests Sage (View online)
 Nonparametric statistical methods John Wiley and Sons
 Computation of significance scores of unweighted Gene Set Enrichment Analyses BMC Bioinformatics (View online)