## Regulatory effect analysis

The gene expression patterns in nearly all cell types are altered when environmental factors (e.g. diseases or drug treatment) change. The majority of these alterations are controlled by transcriptional regulators like transcription factors and chromatin modifiers. Hence it is of utmost importance to identify those regulators that might be responsible for the changes in gene expression between two experimental conditions (disease vs. control). Here we describe several methods that can be used for this purpose.

### Over-representation analysis (ORA)

In the last few years several methods have been proposed that use over-representation analysis (ORA) to find important regulators. The goal of all these approaches is to identify those regulators that have more targets in a list of deregulated genes than expected by chance. For all methods we assume that we have a regulator with $k$ targets in a test set $T=\{t_1,t_2,...,t_n\}$ and $l$ targets in our collection of regulator-target interactions (RTIs) $R=\{t_1,t_2,...,t_m\}.$ The target genes in the RTI collection form the background set for the ORA.

#### Hypergeometric test

Essaghir et al.  proposed to calculate an upper-tailed p-value based on the hypergeometric test in order to identify transcription factors with significantly more targets than expected:

$$p_{hyper}=\sum_{i=k}^{n}\frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}}$$

#### Fisher's exact test

Alternatively, the Fisher's exact test can also be used for this task. This test has the advantage that it can also be used if test set $T$ is not a subset of $R$.

$$p_{fisher}=\sum_{i=k}^{n}\frac{\binom{n}{i}\binom{m}{l+k-i}}{\binom{m+n}{l+k}}$$

#### Hypergeometric test or Fisher's exact test

Since it might not always be obvious for the user which of the two tests should be used, we additionally implemented a combined version of the test that was originally presented by Backes et al. . Here, the authors use the hypergeometric test if $T$ is a subset of $R$ and the Fisher's exact test if this is not the case.

#### Binomial test

Alternatively Yang et. al  proposed to use the binomial test.

$$p_{binom}=\sum_{i=k}^{l}\binom{l}{i}\left(\frac{n}{i}\right)^i\left(1-\frac{n}{m}\right)^{n-i}$$

### Regulatory impact factors (RIF1 + RIF2)

Reverter et al.  introduced the concept of regulatory impact factors. In their paper they propose two metrics RIF1 and RIF2 that both integrate the change in correlation between regulator and the differentially expressed (DE) target genes, the amount of differential expression as well as the abundance of DE genes. According to the authors, a " RIF analysis assigns an extreme score to those TF that are consistently most differentially co-expressed with the highly abundant and highly DE genes (RIF1), and to those TF with the most altered ability to predict the abundance of DE genes (RIF2) ".

Both metrics are defined as follows :

Assume we are given a transcription factor $i$ with $n$ DE target genes and a gene expression data set with two conditions being compared. Then $a_j$ is the mean expression of the $j$th DE gene, $e1_j$ and $e2_j$ are the averaged expression values of condition $1$ and $2$ respectively, $d_j$ is the estimated differential expression of the $j$th DE gene and $r1_{ij}$ as well as $r2_{ij}$ are the correlation coefficients between the $i$th TF and the $j$th DE gene in conditions $1$ and $2$.

$$RIF1 = \frac{1}{n} \sum_{j=1}^{n} a_j \times d_j \times (r1_{ij} - r2_{ij})$$ $$RIF2 = \frac{1}{n} \sum_{j=1}^{n} \left[ (e1_j \times r1_{ij})^2 - (e2_j \times r2_{ij})^2 \right]$$

### Regulator-gene association enrichment (REGGAE) analysis

We recently developed an alternative approach that is based on non-parametric enrichment analysis. Given a matrix of gene expression values where the samples belong to two conditions (e.g. disease and control), a list of genes $D=\{g_1,g_2,...,g_n\}$ for which influential regulators should be found, and a collection of regulator-target interactions (RTIs), REGGAE measures the effect of transcriptional regulators by performing the following steps: Figure 1. - A general overview of the REGGAE scheme. A: The correlations between all regulator-target pairs are computed and sorted per gene. B: The sorted lists of regulators across genes are now concatenated in a column-wise manner. C: Based on this sorted list, an enrichment analysis is performed.

#### Step 1: Calculating the influence of regulators for every target gene

For each deregulated gene $g_i \in D$, our collection of RTIs contains a list of regulators $R_{g_i}=\{r_{i1}, r_{i_2}, ...\}$ that may influence its expression. For every regulator-target pair, we calculate the correlation between the two variables in order to estimate the influence of the regulator. Each gene regulator list $R_{g_i}$ is now sorted with respect to this influence (cf. Figure 1A).

#### Step 2: Creating the sorted regulator list

Based on a sorted list $D=\{g_1,g_2,...,g_n\}$ and the associated regulator list $R_{g_i}=\{r_{i1}, r_{i_2}, ...\}$, we create a new list $L=\{r_{11}, r_{12}, ..., r_{n1}, r_{n2}, ...\}$ that sorts the involved regulators in a column-wise fashion (cf. Transition Figure 1A -> Figure 1B).

#### Step 3: Enrichment analysis

Please note that each regulator can appear up to $n$ times in list $L$ and we assume that the most influential ones are enriched at the top of the list. Hence, for each regulator in our RTI database, we carry out an enrichment analysis on by using either the Wilcoxon rank-sum test  or the unweighted version of the Kolmogorov-Smirnov test  (cf. Figure 1C). Finally, all regulators are sorted with respect to their adjusted p-values.

• The Wilcoxon rank-sum test  is a non-parametric alternative to the independent two-sample t-test which is based solely on the order of the values in the two samples. It can be used to test if two samples are drawn from populations with the same underlying distribution. The test statistic can be obtained as follows:

The results of the two groups $X = (x_{1},x_{2},\ldots,x_{n})$ and $Y = (y_{1},y_{2},\ldots,y_{m})$ are combined and sorted increasingly. Each element in the sorted list receives its rank as new value. In case multiple entries have the same score, the mean of the available rank numbers is assigned to all of them. Based on this information the test statistic is defined as:

$$W = \sum_{i=1}^{n}R(x_{i})$$

$R(x)$ is the rank of value $x_i$ in the sorted and pooled list of values.

For $n > 25$ and $m > 25$, $W_{m,n}$ is approximately normally distributed .

The Z-score in this case is:

$$Z = \frac{W - m_W}{s_W}$$ $$m_W = \frac{n(n + m +1)}{2}$$ $$s_W = \sqrt{\frac{n \cdot m (n + m + 1)}{12}}$$

A p-value for the $Z$-score can be derived from the standard normal distribution. If a normal approximation is not possible, the p-values for test statistic $W$ can be looked up in a table .

• Kolmogorov-Smirnov test a non-parametric hypothesis test, which is based solely on the order of an input list L. Focusing on ranks rather than on the absolute value has the advantage that the method is more robust and can penalize outliers, which might otherwise have a big influence on the results. We use this test to check if the targets of a transcription factor are randomly distributed or accumulated on top or bottom of the list .

The test statistic for a transcriptional regulator can be defined as a running sum statistic based on its targets occurring in a sorted list $L = \{l_{1},l_{2},\ldots,l_{n}\}$.

$$RS(i)=\begin{cases} 0 &, \text{if }i=0\\ RS(i-1) + n - j &, \text{if }l_i \text{ belongs to C}\\ RS(i-1)- j &, \text{if }l_i \text{ does not belong to C} \end{cases}$$

The test statistic $RS_C$ is the maximal deviation from 0 of $RS(i)$.

### Bibliography

1. Essaghir, Ahmed and Toffalini, Federica and Knoops, Laurent and Kallin, Anders and van Helden, Jacques and Demoulin, Jean-Baptiste Transcription factor regulation can be accurately predicted from the presence of target gene signatures in microarray gene expression data Nucleic acids research Oxford Univ Press
2. Backes, Christina and Keller, Andreas and Kuentzer, Jan and Kneissl, Benny and Comtesse, Nicole and Elnakady, Yasser A and Müller, Rolf and Meese, Eckart and Lenhof, Hans-Peter GeneTrail—advanced gene set enrichment analysis Nucleic acids research Oxford Univ Press (View online)
3. Yang, Jing and Yu, Hui and Liu, Bao-Hong and Zhao, Zhongming and Liu, Lei and Ma, Liang-Xiao and Li, Yi-Xue and Li, Yuan-Yuan DCGL v2. 0: an R package for unveiling differential regulation from differential co-expression PLoS One Public Library of Science
4. Reverter, Antonio and Hudson, Nicholas J and Nagaraj, Shivashankar H and Perez-Enciso, Miguel and Dalrymple, Brian P Regulatory impact factors: unraveling the transcriptional regulation of complex traits from expression data Bioinformatics Oxford Univ Press
5. Rinne, Horst Taschenbuch der Statistik Harri Deutsch Verlag (View online)
6. Gopal K Kanji 100 statistical tests Sage (View online)
7. Hollander, Myles and Wolfe, Douglas A and Chicken, Eric Nonparametric statistical methods John Wiley and Sons
8. Keller, A. and Backes, C. and Lenhof, H. P. Computation of significance scores of unweighted Gene Set Enrichment Analyses BMC Bioinformatics (View online)