Sequence To Affinity Prediction
Description
The program use a biophysical
model to analyzes transcription factor (TF)-DNA binding data, such as ChIP-chip
or ChIPSeq data. The program assumes that the measured affinity of a sequence
to a TF (TF_exp) in some ChIP-chip or
ChIP-seq experiment is determined by: 1) the number and strength of binding
sites of TF_exp in this sequence; 2)
the presence of other sites that may interact cooperatively with the sites of TF_exp in the neighborhood.
Specifically, it takes as input a set of DNA sequences, their binding
affinities to some TF as measured by experiments (TF_exp), and the position weight matrices (PWMs) of a set of TFs,
including TF_exp. It will learn the relevant parameters of the biophysical
model of TF-DNA interaction, including those of TF-DNA interaction and those of
TF-TF cooperative interactions. The program can be used for several
purposes:
(1)
Test if a given TF binding motif can predict the binding affinities of the
sequences. It predicts the binding sites based on this motif, and computes the
theoretical values of the binding affinities of the sequences. The predicted
values will be compared with the observations to judge the sucess of the model.
(2)
When multiple motifs are given as inputs, the program assumes the first motif
is the one of the experimetnal TF (TF_exp),
and the rest are the motifs that may cooperatively interact with TF_exp
(meaning that the adjacent sites of other factors can facilicate DNA binding of
TF_exp). The program will learn which
motifs are likely to interact cooperatively with TF_exp.
(3)
Once a biophysical model is learned, it can be applied to predict affinities of
sequences not used in training the model. This would be useful, for example,
for analyze sequences in a different organism.
Downloads:
Source Code , Manual, NanogMotif
Installation
The
program needs GNU Scientific Library (GSL). If it is not installed in your
system, go to: http://www.gnu.org/software/gsl/
Note
that after installing GSL, you need to change the start-up script of your
shell, e.g., .bash_profile at your home directory if you are using bash.
Suppose the GSL installation directory is /raid/apps/gsl-1.8/lib:
LD_LIBRARY_PATH=/raid/apps/gsl-1.8/lib:$LD_LIBRARY_PATH
export
LD_LIBRARY_PATH
After
extracting the program, change the GSL directory in src/Makefile, e.g.:
GSL_DIR
= my_gsl_dir
Then
simply type:
Make
Running the program
Usage: ./seq2binding -s <seqFile> -d
<dataFile> -m <motifFile>
The
program takes three arguments as input:
seqFile: the FASTA format file of
sequences. See examples/Nanog_top_500.fa.
>chr1:136351629-136351631 136351630 -250 +250
gtggtgatgcccaaccacagaattattttgttgctactttataactgtaattttgatcct
>chr3:122137593-122137593 122137593 -250 +250
atttctagttccagtgactgggagactgaaacaagagagtcacttgagtacaggagtgca
dataFile: the binding data of all
sequences in the seqFile. The first column is the sequence id (must be the same
as those used in seqFile, and in the same order), and the second column is the
measured strength of binding. See examples/Nanog_top_500.txt.
chr1:136351629-136351631 312
chr3:122137593-122137593 307
motifFile: the motifs of the TFs. It could
contain multiple motifs. The header line consists of motif name, length and
pseudocount (0.5 should be OK for most motifs). The first motif should be the
one of TF_exp, and the rest are putative TFs that interact cooperatively with
TF_exp. See examples/Nanog_Oct4_Sox2.wtmx and examples/Nanog.wtmx.
>Nanog
9
0.5
20 225 46 209
70 0 19 411
50 66 381 3
434 45 0 21
55 5 66 374
17 32 222 229
74 18 325 83
8 243 146 103
48 145 6 301
<
Output:
(1) Estimated parameters: binding parameter (how
strongly the TF binds with its binding site); the interaction parameters
between any pair of TFs (the order of motifs in the matrix follows the order defined
in motifFile): greater than 1 if favorable interaction, less than 1
unfavorable, 1 if no interaction.
(2) Pearson correlation between predicted binding
and observed binding.
Examples:
(1) Run with a single factor: test if the provided
Nanog motif explains the binding of top 500 Nanog sequences in experiments:
(under examples/ directory)
../src/seq2binding -m Nanog.wtmx -s Nanog_top_500.fa
-d Nanog_top_500.txt
(2) Run with multiple factors (TF_exp, and other
factors): test if Nanog interacts cooperatively with Oct4 and Sox2 in the top
500 Nanog sequences: (under examples/ directory)
../src/seq2binding -m Nanog_Oct4_Sox2.wtmx -s
Nanog_top_500.fa -d Nanog_top_500.txt
Advanced options
-ts <testSeqFile> -td <testDataFile>:
test the trained model in additional testing data. The format of testSeqFile
and testDatafile is the same as seqFile and dataFile.
-n <nExps>: the number of experiments being
analyzed. The default value is 1, i.e. only one experiment (binding data of one
TF) is analyzed. When analyzing binding data of multiple TFs, set nExps as the
number of TFs. In this case, it is assumed that seqFile and dataFile contain
the concantenation of data of multiple factors (assume the number of records of
each TF is equal, thus no explicit delimiter is needed between data of
different TFs).
-cv <K>: K-fold cross validation, report the
average performance (correlation)
-p <trainPredictionFile>: print the predicted
binding intensities (of the training sequences in seqFile) in the file
trainPredictionFile.
-co coopOption: the option of cooperative binding. 0
- no cooperativity at all; 1 - no self-cooperativity, but hetero-cooperativity;
2 - allow all cooperativities
-io interactionOption: the option for modeling
factor-factor interaction. 0 - binary; 1 - linear; 2 - periodic.
-dt <d_max>: the maximum distance of
interaction (beyond which there will be no interaction)
There
are other parameters that control the factor-factor interaction model. The file
utils/run_pair.sh contains examples of using these parameters. In most cases,
you probably do not need to set these parameters.
Utilities
In
utils/ directory, some useful scripts are included. However, not all of them
can be ready to execute in your system. Some of them are included only for the
purpose of demonstrating the use of program.
run_pair.sh: demonstrate the use of program (for
analyzing cooperative interactions using binding data of two TFs).
create_null_distr.sh: suppose we want to find
cooperative factors of one TF (TF_exp). First run the program using TF_exp and
the test motif, and obtain the correlation coefficient (CC) of the test motif.
Then run this script to get the null distribution of the CC. The script will
sample random motifs from a specified collection of motif, and calculate CC of
the random motifs.
shuffle_wtmx.pl: random shuffling of a motif, used
by create_null_distr.sh.
split_wtmx.pl: split a file of many PWMs into
multiple files, each of which contains a single motif.
Reference:
A biophysical model for analysis of transcription factor
interaction and binding site arrangement from genome-wide binding data
Xin He, Chieh-Chun Chen, Feng Hong, Fang Fang, Saurabh Sinha, Huck-Hui Ng,
Sheng Zhong. PLoS ONE, accepted. This paper will be presented on RECOMB Regulatory Genomics 09.
Contact Chieh-Chun
Chen (cchen63 AT