# GeneTrail2 1.5

#### Statistical analysis of molecular signatures

## TEPIC + INVOKE

Epigenetics data contains a wealth of information on gene regulation. It was shown that especially data on open chromatin is well suited to build predictive, linear models of gene-expression [1]. Interpreting these models allows the inference of regulators that may play a key role in gene expression regulation.

Here, we offer an integrated analysis of epigenetics data, i.e. open chromatin data (DNase1-seq, ATAC-seq, NOMe-seq) and gene expression data to determine key transcriptional regulators in the analysed sample.

Note that although incorporating epigenetic data greatly improved the performance of TF binding predictions, both, computing TF binding predictions and linking TFs to genes are still unsolved problems and all predictions should be seen as suggestions and not as the absolute truth.

INVOKE (the IdeNtification Of Key transcriptional regulators using Epigenetics data) comprises two steps that are detailed in the following sections:

- Computation of transcription factor (TF)-gene scores on the basis of epigenetic data using
*TEPIC*. - Learning of a linear regression model to predict gene expression from TF-gene scores computed in step 1.

### Computing transcription factor scores using TEPIC

#### Motivation

The main advantage of considering epigenetics data for the task of TF binding prediction is that the number of false positive predictions can be reduced [2], although not completely eliminated. One way of incorporating epigenetics data is to reduce the genomic search space to a few candidate regions of TF binding. As shown before, genome-wide candidate sites for TF binding can be determined by open-chromatin experiments [3], [4], [5], [6], e.g. peaks or footprints in DNase1-seq data and/or by considering histone marks [7], [4], e.g. H3K4me3.

Here, we compute TF affinities for a species specific set of *Position Specific Energy Matrices (PSEM)* using *TRAP*
[8] which is based on a biophysical model of TF binding [9].
The main advantage of affinity based predictions compared to hit-based methods like Fimo [10] is that
low-affinity binding sites can be included [11], [8].
Using the *TEPIC* method, we compute TF gene scores by aggregating TF predictions calculated for a user defined set of candidate regions.
The scores, either per peak/region or gene, can be interpreted as a quantitative measurement of TF binding.

#### Computing TF gene scores

Currently, we offer the annotation of five different species, including the most common model organisms:
*Homo sapiens*, *Mus musculus*, *Rattus norvegicus*, *Drosophila melanogaster*, and *Caenorhabditis elegans*.
Using our collections of species specific *PSEMs*, *TRAP* computes TF binding affinities in all user provided regions
that could be found in the reference genomes of the respective species and
overlap with a window of user defined size $w$ that is centered at the most $5'$ TSS of all annotated genes in the considered organism.
Then, TF gene scores are computed by incorporating all candidate binding sites within the window centered around the $5'$ TSS of genes in the final score.
The contribution of the individual sites is weighted by their distance to the selected TSS with an exponential decay function [12].

Formally, the TF gene score $a_{g,i}$ for gene $g$ and TF $i$ is computed as

$$a_{g,i}^{w}=\sum_{p \in P_{g,w}} {a_{p,i}e^{-\frac{d_{p,g}}{d_0}}}$$where, $a_{p,i}$ is the affinity of TF $i$ in peak $p$, the set $P_{g,x}$ contains all open-chromatin peaks in a window of size $w$ around gene $g$, $d_{p,g}$ is the distance from the center of peak $p$ to the TSS of gene $g$, and $d_0$ is a constant fixed at $5000$bp [12]. The workflow of TEPIC is depicted in Figure 1.

#### Required input

To compute TF gene scores a user needs to specify:

- a reference genome,
- a set of
*PSEMs*, - a set of genomic regions in BED format.

Note that the chromosome identifiers in the BED file must match the identifiers used in the reference genomes, neglecting the *chr* prefix.
Otherwise they can not be considered. Special care should be taken for *Caenorhabditis elegans*, as Roman digits are used for enumeration.

#### Output

This step generates the following output, which is available for download:

- TF affinities for all selected
*PSEMs*in the regions provided by the user that passed the filtering step. - TF gene scores for all selected
*PSEMs*calculated as described above. - A meta data file listing all used parameters.

Note that the interactive view on the website does not contain all the aforementioned results.

### Linear regression to predict gene expression

#### Motivation

In order to learn about potentially important regulators, we build a linear, interpretable regression model,
comparable to methods proposed in [11], [13], [14], [15].
Here, we use TF gene scores computed with *TEPIC* as features in a linear regression setup to predict gene expression.
In such a *per sample* approach, we stick to the simplifying assumption that all genes are regulated similarly.
Features with a high regression coefficient can be suggested to be key regulators in the analysed sample, as they seem to effect the expression
of a large portion of the genes under consideration. However, the results of this method should be seen as suggestions for possible regulators and not as the absolute truth.

Details on the learning setup and on the available regularization methods are provided in the next section.

#### Available regularization methods

We offer three different regularization techniques:

- $\hat{\beta}=\underset{\beta}{\arg\,min} ||y-X\beta||^2 + ||\beta||,$
- $\hat{\beta}=\underset{\beta}{\arg\,min} ||y-X\beta||^2 + ||\beta||^2,$
- $\hat{\beta}=\underset{\beta}{\arg\,min} ||y-X\beta||^2 + \alpha||\beta||^2 + (1-\alpha)||\beta||,$

where, $\beta$ represents the regression coefficient vector, $\hat{\beta}$ represents the estimated coefficients, $X$ is the feature matrix, $y$ is the response vector, and the parameter $\alpha$ controls the distribution between Ridge and Lasso penalty in the elastic net.

Using Lasso regularization, models are sparse and can be learned very fast. But, Lasso can not properly deal with correlated features, e.g. instead of distributing the coefficients among them, only one is selected. Also, Lasso solutions are not stable and therefore should be interpreted with caution. Nevertheless, Lasso regularization is good to get a first impression of model performance.

The disadvantage of Ridge regression is that it can not produce sparse models (many coefficients being exactly 0), which may hinder interpretability.

Elastic net regularization was designed to overcome the limitations of both regularization techniques mentioned above. It resolves the correlation between features by distributing the feature weights among them, and simultaneously leads to sparse and stable models [16]. However, learning a model using elastic net penalty is slower than using either only Lasso or Ridge regularization.

#### Details on the learning setup

The data matrix $X$, containing TF gene scores, and the response vector $y$, containing gene expression values, are log-transformed, with a pseudo-count of $1$, centered and scaled to fit them as. Regression coefficients are computed in a inner cross validation, the $\alpha$ parameter of elastic net regularization is optimized with a default step size of $0.1$.

We offer two ways to use our learning pipeline:

- Learn a model for feature interpretation without computing performance measures: In order to provide a time efficient way of obtaining an interpretable model and to prevent a potential loss of information by considering only a portion of the full data set for model training, the regression coefficients are determined on the entire data set.
- Learn a model for feature interpretation and compute model performance: Nested cross-validation is used to learn the models and to assess their performance. Per default, $20\%$ of the data are used as test data and $80\%$ are used as training data. Model performance is assessed in an outer cross validation. We report the mean pearson correlation, the mean spearman correlation, and the mean squared error over the outer folds as measures of model performance. Additionally, a model is learned on the entire data set as described in (1) for interpretation of the coefficients.

All parameters mentioned in this section can be changed by the user. The learning process is sketched in Figure 2.

#### Required input

In addition to the input required for the computation of TF gene scores in TEPIC, a file containing gene expression data must be provided. This file should be structured such that column $1$ contains the gene identifiers and column $2$ holds expression values. Besides, we support the upload of a matrix containing gene expression data for several samples. In that case, the user has to select the column/sample that should be used for model construction.

#### Output and hints for interpretation

The user is always provided with the following files:

- a list of regression coefficients computed on the entire data set,
- a bar plot showing the regression coefficients with an absolute value $> 0.025$.

The larger a regression coefficient, the stronger is the inferred effect of the corresponding TF on gene expression. Positive coefficients suggest an activating influence of TFs, negative coefficients suggest a inhibiting effect

If model performance was assessed, the following is available in addition:

- a summary on model performance containing the aforementioned measures (pearson correlation, spearman correlation, mean squared error)
- a list of regression coefficients determined in the outer cross validation,
- a heatmap visualizing the regression coefficients determined in the outer cross validation for at most the top $10$ positive and negative features, sorted according to their mean.
- an image showing a box plot for pearson and spearman correlation respectively (for download only).
- scatter plots showing the predicted vs the measured gene expression for each outer cross validation fold (for download only).

The heatmap can be easily used to judge model performance, as it shows the regression coefficients of all outer-cross validation runs. The box plots provide further insights into model performance and stability across the outer folds of the cross validation. Note that the interactive view on the website does not contain all of the aforementioned output files.

### Bibliography

- Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction Nucleic Acids Research Oxford Univ Press
- Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data Genome Res.
- Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection Nucleic Acids Res.
- Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications Bioinformatics
- Epigenetic priors for identifying active transcription factor binding sites Bioinformatics
- Using DNase digestion data to accurately identify transcription factor binding sites Pac Symp Biocomput
- Predicting expression: the complementary power of histone modification and transcription factor binding data Epigenetics Chromatin
- Predicting trancription factor affinities to DNA from a biophysical model Bioinformatics
- On the specificity of DNA-protein interactions Proceedings of the National Academy of Sciences National Acad Sciences
- FIMO: Scanning for occurrences of a given motif Bioinformatics (View online)
- Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction Nucleic Acids Res.
- ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells Proc. Natl. Acad. Sci. U.S.A.
- Predicting cell-type-specific gene expression from regions of open chromatin Genome Res.
- Predictive modelling of gene expression from transcriptional regulatory elements Brief. Bioinformatics
- Genome-wide in silico prediction of gene expression Bioinformatics
- Regularization and variable selection via the Elastic Net Journal of the Royal Statistical Society, Series B