Sequence To Affinity Prediction

Description

The program use a biophysical model to analyzes transcription factor (TF)-DNA binding data, such as ChIP-chip or ChIPSeq data. The program assumes that the measured affinity of a sequence to a TF (TF_exp) in some ChIP-chip or ChIP-seq experiment is determined by: 1) the number and strength of binding sites of TF_exp in this sequence; 2) the presence of other sites that may interact cooperatively with the sites of TF_exp in the neighborhood. Specifically, it takes as input a set of DNA sequences, their binding affinities to some TF as measured by experiments (TF_exp), and the position weight matrices (PWMs) of a set of TFs, including TF_exp. It will learn the relevant parameters of the biophysical model of TF-DNA interaction, including those of TF-DNA interaction and those of TF-TF cooperative interactions. The program can be used for several purposes:

(1) Test if a given TF binding motif can predict the binding affinities of the sequences. It predicts the binding sites based on this motif, and computes the theoretical values of the binding affinities of the sequences. The predicted values will be compared with the observations to judge the sucess of the model.

(2) When multiple motifs are given as inputs, the program assumes the first motif is the one of the experimetnal TF (TF_exp), and the rest are the motifs that may cooperatively interact with TF_exp (meaning that the adjacent sites of other factors can facilicate DNA binding of TF_exp). The program will learn which motifs are likely to interact cooperatively with TF_exp.

(3) Once a biophysical model is learned, it can be applied to predict affinities of sequences not used in training the model. This would be useful, for example, for analyze sequences in a different organism.

Downloads:
Source Code , Manual, NanogMotif

Installation

The program needs GNU Scientific Library (GSL). If it is not installed in your system, go to: http://www.gnu.org/software/gsl/

Note that after installing GSL, you need to change the start-up script of your shell, e.g., .bash_profile at your home directory if you are using bash. Suppose the GSL installation directory is /raid/apps/gsl-1.8/lib:

LD_LIBRARY_PATH=/raid/apps/gsl-1.8/lib:$LD_LIBRARY_PATH

export LD_LIBRARY_PATH

After extracting the program, change the GSL directory in src/Makefile, e.g.:

GSL_DIR = my_gsl_dir

Then simply type:

Make

Running the program

Usage: ./seq2binding -s <seqFile> -d <dataFile> -m <motifFile>

The program takes three arguments as input:

seqFile: the FASTA format file of sequences. See examples/Nanog_top_500.fa.

>chr1:136351629-136351631 136351630 -250 +250

gtggtgatgcccaaccacagaattattttgttgctactttataactgtaattttgatcct

>chr3:122137593-122137593 122137593 -250 +250

atttctagttccagtgactgggagactgaaacaagagagtcacttgagtacaggagtgca

dataFile: the binding data of all sequences in the seqFile. The first column is the sequence id (must be the same as those used in seqFile, and in the same order), and the second column is the measured strength of binding. See examples/Nanog_top_500.txt.

chr1:136351629-136351631 312

chr3:122137593-122137593 307

motifFile: the motifs of the TFs. It could contain multiple motifs. The header line consists of motif name, length and pseudocount (0.5 should be OK for most motifs). The first motif should be the one of TF_exp, and the rest are putative TFs that interact cooperatively with TF_exp. See examples/Nanog_Oct4_Sox2.wtmx and examples/Nanog.wtmx.

>Nanog 9 0.5

20 225 46 209

70 0 19 411

50 66 381 3

434 45 0 21

55 5 66 374

17 32 222 229

74 18 325 83

8 243 146 103

48 145 6 301

Output:

(1) Estimated parameters: binding parameter (how strongly the TF binds with its binding site); the interaction parameters between any pair of TFs (the order of motifs in the matrix follows the order defined in motifFile): greater than 1 if favorable interaction, less than 1 unfavorable, 1 if no interaction.

(2) Pearson correlation between predicted binding and observed binding.

Examples:

(1) Run with a single factor: test if the provided Nanog motif explains the binding of top 500 Nanog sequences in experiments: (under examples/ directory)

../src/seq2binding -m Nanog.wtmx -s Nanog_top_500.fa -d Nanog_top_500.txt

(2) Run with multiple factors (TF_exp, and other factors): test if Nanog interacts cooperatively with Oct4 and Sox2 in the top 500 Nanog sequences: (under examples/ directory)

../src/seq2binding -m Nanog_Oct4_Sox2.wtmx -s Nanog_top_500.fa -d Nanog_top_500.txt

Advanced options

-ts <testSeqFile> -td <testDataFile>: test the trained model in additional testing data. The format of testSeqFile and testDatafile is the same as seqFile and dataFile.

-n <nExps>: the number of experiments being analyzed. The default value is 1, i.e. only one experiment (binding data of one TF) is analyzed. When analyzing binding data of multiple TFs, set nExps as the number of TFs. In this case, it is assumed that seqFile and dataFile contain the concantenation of data of multiple factors (assume the number of records of each TF is equal, thus no explicit delimiter is needed between data of different TFs).

-cv <K>: K-fold cross validation, report the average performance (correlation)

-p <trainPredictionFile>: print the predicted binding intensities (of the training sequences in seqFile) in the file trainPredictionFile.

-co coopOption: the option of cooperative binding. 0 - no cooperativity at all; 1 - no self-cooperativity, but hetero-cooperativity; 2 - allow all cooperativities

-io interactionOption: the option for modeling factor-factor interaction. 0 - binary; 1 - linear; 2 - periodic.

-dt <d_max>: the maximum distance of interaction (beyond which there will be no interaction)

There are other parameters that control the factor-factor interaction model. The file utils/run_pair.sh contains examples of using these parameters. In most cases, you probably do not need to set these parameters.

Utilities

In utils/ directory, some useful scripts are included. However, not all of them can be ready to execute in your system. Some of them are included only for the purpose of demonstrating the use of program.

run_pair.sh: demonstrate the use of program (for analyzing cooperative interactions using binding data of two TFs).

create_null_distr.sh: suppose we want to find cooperative factors of one TF (TF_exp). First run the program using TF_exp and the test motif, and obtain the correlation coefficient (CC) of the test motif. Then run this script to get the null distribution of the CC. The script will sample random motifs from a specified collection of motif, and calculate CC of the random motifs.

shuffle_wtmx.pl: random shuffling of a motif, used by create_null_distr.sh.

split_wtmx.pl: split a file of many PWMs into multiple files, each of which contains a single motif.

Reference:

A biophysical model for analysis of transcription factor interaction and binding site arrangement from genome-wide binding data
Xin He, Chieh-Chun Chen, Feng Hong, Fang Fang, Saurabh Sinha, Huck-Hui Ng, Sheng Zhong. PLoS ONE, accepted. This paper will be presented on RECOMB Regulatory Genomics 09.

Contact Chieh-Chun Chen (cchen63 AT illinois DOT edu) for any problem or comment concerning STAP.