Regulatory effect analysis

The gene expression patterns in nearly all cell types are altered when environmental factors (e.g. diseases or drug treatment) change. The majority of these alterations are controlled by transcriptional regulators like transcription factors and chromatin modifiers. Hence it is of utmost importance to identify those regulators that might be responsible for the changes in gene expression between two experimental conditions (disease vs. control). Here we describe several methods that can be used for this purpose.

Over-representation analysis (ORA)

In the last few years several methods have been proposed that use over-representation analysis (ORA) to find important regulators. The goal of all these approaches is to identify those regulators that have more targets in a list of deregulated genes than expected by chance. For all methods we assume that we have a regulator with $k$ targets in a test set $T=\{t_1,t_2,...,t_n\}$ and $l$ targets in our collection of regulator-target interactions (RTIs) $R=\{t_1,t_2,...,t_m\}.$ The target genes in the RTI collection form the background set for the ORA.

Hypergeometric test

Essaghir et al. [1] proposed to calculate an upper-tailed p-value based on the hypergeometric test in order to identify transcription factors with significantly more targets than expected:

$$ p_{hyper}=\sum_{i=k}^{n}\frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}} $$

Fisher's exact test

Alternatively, the Fisher's exact test can also be used for this task. This test has the advantage that it can also be used if test set $T$ is not a subset of $R$.

$$ p_{fisher}=\sum_{i=k}^{n}\frac{\binom{n}{i}\binom{m}{l+k-i}}{\binom{m+n}{l+k}} $$

Hypergeometric test or Fisher's exact test

Since it might not always be obvious for the user which of the two tests should be used, we additionally implemented a combined version of the test that was originally presented by Backes et al. [2]. Here, the authors use the hypergeometric test if $T$ is a subset of $R$ and the Fisher's exact test if this is not the case.

Binomial test

Alternatively Yang et. al [3] proposed to use the binomial test.

$$ p_{binom}=\sum_{i=k}^{l}\binom{l}{i}\left(\frac{n}{i}\right)^i\left(1-\frac{n}{m}\right)^{n-i} $$

Regulatory impact factors (RIF1 + RIF2)

Reverter et al. [4] introduced the concept of regulatory impact factors. In their paper they propose two metrics RIF1 and RIF2 that both integrate the change in correlation between regulator and the differentially expressed (DE) target genes, the amount of differential expression as well as the abundance of DE genes. According to the authors, a " RIF analysis assigns an extreme score to those TF that are consistently most differentially co-expressed with the highly abundant and highly DE genes (RIF1), and to those TF with the most altered ability to predict the abundance of DE genes (RIF2) ".

Both metrics are defined as follows [4]:

Assume we are given a transcription factor $i$ with $n$ DE target genes and a gene expression data set with two conditions being compared. Then $a_j$ is the mean expression of the $j$th DE gene, $e1_j$ and $e2_j$ are the averaged expression values of condition $1$ and $2$ respectively, $d_j$ is the estimated differential expression of the $j$th DE gene and $r1_{ij}$ as well as $r2_{ij}$ are the correlation coefficients between the $i$th TF and the $j$th DE gene in conditions $1$ and $2$.

$$ RIF1 = \frac{1}{n} \sum_{j=1}^{n} a_j \times d_j \times (r1_{ij} - r2_{ij}) $$ $$ RIF2 = \frac{1}{n} \sum_{j=1}^{n} \left[ (e1_j \times r1_{ij})^2 - (e2_j \times r2_{ij})^2 \right] $$

Regulator-gene association enrichment (REGGAE) analysis

We recently developed an alternative approach that is based on non-parametric enrichment analysis. Given a matrix of gene expression values where the samples belong to two conditions (e.g. disease and control), a list of genes $D=\{g_1,g_2,...,g_n\}$ for which influential regulators should be found, and a collection of regulator-target interactions (RTIs), REGGAE measures the effect of transcriptional regulators by performing the following steps:

Figure 1. - A general overview of the REGGAE scheme. A: The correlations between all regulator-target pairs are computed and sorted per gene. B: The sorted lists of regulators across genes are now concatenated in a column-wise manner. C: Based on this sorted list, an enrichment analysis is performed.

Step 1: Calculating the influence of regulators for every target gene

For each deregulated gene $g_i \in D$, our collection of RTIs contains a list of regulators $R_{g_i}=\{r_{i1}, r_{i_2}, ...\}$ that may influence its expression. For every regulator-target pair, we calculate the correlation between the two variables in order to estimate the influence of the regulator. Each gene regulator list $R_{g_i}$ is now sorted with respect to this influence (cf. Figure 1A).

Step 2: Creating the sorted regulator list

Based on a sorted list $D=\{g_1,g_2,...,g_n\}$ and the associated regulator list $R_{g_i}=\{r_{i1}, r_{i_2}, ...\}$, we create a new list $L=\{r_{11}, r_{12}, ..., r_{n1}, r_{n2}, ...\}$ that sorts the involved regulators in a column-wise fashion (cf. Transition Figure 1A -> Figure 1B).

Step 3: Enrichment analysis

Please note that each regulator can appear up to $n$ times in list $L$ and we assume that the most influential ones are enriched at the top of the list. Hence, for each regulator in our RTI database, we carry out an enrichment analysis on by using either the Wilcoxon rank-sum test [5] or the unweighted version of the Kolmogorov-Smirnov test [2] (cf. Figure 1C). Finally, all regulators are sorted with respect to their adjusted p-values.

  • The Wilcoxon rank-sum test [5] is a non-parametric alternative to the independent two-sample t-test which is based solely on the order of the values in the two samples. It can be used to test if two samples are drawn from populations with the same underlying distribution. The test statistic can be obtained as follows:

    The results of the two groups $X = (x_{1},x_{2},\ldots,x_{n})$ and $Y = (y_{1},y_{2},\ldots,y_{m})$ are combined and sorted increasingly. Each element in the sorted list receives its rank as new value. In case multiple entries have the same score, the mean of the available rank numbers is assigned to all of them. Based on this information the test statistic is defined as:

    $$W = \sum_{i=1}^{n}R(x_{i}) $$

    $R(x)$ is the rank of value $x_i$ in the sorted and pooled list of values.

    For $n > 25$ and $m > 25$, $W_{m,n}$ is approximately normally distributed [5].

    The Z-score in this case is:

    $$Z = \frac{W - m_W}{s_W}$$ $$m_W = \frac{n(n + m +1)}{2}$$ $$s_W = \sqrt{\frac{n \cdot m (n + m + 1)}{12}}$$

    A p-value for the $Z$-score can be derived from the standard normal distribution. If a normal approximation is not possible, the p-values for test statistic $W$ can be looked up in a table [6].

  • Kolmogorov-Smirnov test a non-parametric hypothesis test, which is based solely on the order of an input list L. Focusing on ranks rather than on the absolute value has the advantage that the method is more robust and can penalize outliers, which might otherwise have a big influence on the results. We use this test to check if the targets of a transcription factor are randomly distributed or accumulated on top or bottom of the list [7].

    The test statistic for a transcriptional regulator can be defined as a running sum statistic based on its targets occurring in a sorted list $L = \{l_{1},l_{2},\ldots,l_{n}\}$.

    $$ RS(i)=\begin{cases} 0 &, \text{if }i=0\\ RS(i-1) + n - j &, \text{if }l_i \text{ belongs to C}\\ RS(i-1)- j &, \text{if }l_i \text{ does not belong to C} \end{cases} $$

    The test statistic $RS_C$ is the maximal deviation from 0 of $RS(i)$.

    $RS_C = \max\limits_{1 \le i \le n} \{|RS(i)|\}$$

    A huge advantage of the unweighted GSEA is that an exact p-value for test statistic $RS_C$ can be computed via the dynamic programming algorithm presented in the next section.

    Keller et al. [8] showed that a p-value for GSEA can be computed as the probability that a random running sum reaches a maximal deviation greater or equal than $RS_C$.

    The authors developed an algorithm, which is able to compute the number of running sum statistics that are smaller than $RS_C$. This can be used to compute an exact p-value for $RS_C$ based on the complement of this event.

    $p = 1-\frac{X}{Y},$

    where X is the number of running sum statistics with a maximum deviation of at most $RS-1$ and Y is the number of all possible running sum statistics. Y can be computed as $\binom{n}{j}$, where n is the length of list L and j is the number of entries that belong to category C.

    The value of X can be calculated via a dynamic programming algorithm. The algorithm iterates over a matrix M with $(j+1) \times (n - j +1)$ entries. $M(i,k)$ is the number of running sum statistics with i members of the considered category and k non-members whose maximum deviation of zero is less than $RS_C-1$.

    The matrix is initialized with:

    $$ M(i,0)=\begin{cases} 1&, \text{if } i\cdot(n-k) < |RS_C|\\ 0& \text{else} \end{cases} $$ $$ M(0,j)=\begin{cases} 1&, \text{if } j\cdot k< |RS_C|\\ 0& \text{else} \end{cases} $$

    The value for X can be found via dynamic programming over matrix M for $i=0$ to $j$ and for $k=0$ to $n-j$:

    $$ M(i,k)=\begin{cases} M(i-1,k) + M (i,k-1)&, \text{if } (*)\\ 0& \text{else} \end{cases}, $$

    where (*) is $ i\cdot(n-j) - k \cdot j < |RS_C|$ for an upper-tailed p-value and $-|RS_C| < i\cdot(n-j) - k \cdot j$ for a lower-tailed p-value.

Step 4 (optional): Bootstrapping

In order to account for technical noise that might influence gene expression values, correlation coefficients and, ultimately, the order of the regulators, we propose to perform $B \in [1000,10000]$ bootstrap replications. In each replication run a new test statistic is calculated. In the end an exact p-value is calculated for the median value in the test statistic list.


  1. Essaghir, Ahmed and Toffalini, Federica and Knoops, Laurent and Kallin, Anders and van Helden, Jacques and Demoulin, Jean-Baptiste Transcription factor regulation can be accurately predicted from the presence of target gene signatures in microarray gene expression data Nucleic acids research Oxford Univ Press
  2. Backes, Christina and Keller, Andreas and Kuentzer, Jan and Kneissl, Benny and Comtesse, Nicole and Elnakady, Yasser A and Müller, Rolf and Meese, Eckart and Lenhof, Hans-Peter GeneTrail—advanced gene set enrichment analysis Nucleic acids research Oxford Univ Press (View online)
  3. Yang, Jing and Yu, Hui and Liu, Bao-Hong and Zhao, Zhongming and Liu, Lei and Ma, Liang-Xiao and Li, Yi-Xue and Li, Yuan-Yuan DCGL v2. 0: an R package for unveiling differential regulation from differential co-expression PLoS One Public Library of Science
  4. Reverter, Antonio and Hudson, Nicholas J and Nagaraj, Shivashankar H and Perez-Enciso, Miguel and Dalrymple, Brian P Regulatory impact factors: unraveling the transcriptional regulation of complex traits from expression data Bioinformatics Oxford Univ Press
  5. Rinne, Horst Taschenbuch der Statistik Harri Deutsch Verlag (View online)
  6. Gopal K Kanji 100 statistical tests Sage (View online)
  7. Hollander, Myles and Wolfe, Douglas A and Chicken, Eric Nonparametric statistical methods John Wiley and Sons
  8. Keller, A. and Backes, C. and Lenhof, H. P. Computation of significance scores of unweighted Gene Set Enrichment Analyses BMC Bioinformatics (View online)