APEG: Affinity Prediction by Epigenome and Genome

 

Description

The program uses a biophysical model to analyze transcription (TF)-DNA binding data, such as ChIP-seq data by incorporating epigenomic modifications and genome sequence data. This model can learn synergistic and antagonistic interactions between specific TFs and epigenomic modifications from genome-wide TF binding and epigenomic data.

 

Download

Source code, Example data

 

Installation

The program needs GNU Scientific Library (GSL). If it is not installed in your system, go to: http://www.gnu.org/software/gsl/. Note that after installing GSL, you need to change the start-up script of your shell, e.g., .bash_profile at your home directory if you are using bash. Suppose the GSL installation directory is /raid/apps/gsl-1.15/lib:

 

LD_LIBRARY_PATH=/raid/apps/gsl-1.15/lib:$LD_LIBRARY_PATH

export LD_LIBRARY_PATH

 

After extracting the program, change the GSL directory in src/Makefile, e.g.:

GSL_DIR = my_gsl_dir

 

Then simply type:

Make

 

Run the program

 

Usage:

 ./seq2binding –s <seqFile> -d <dataFile> -m <motifFile> -nep <number of epigenomic marks> -ep <epiFile1> ( <epiFile2> ..)

 

If you have multiple epifiles, then after the command –ep please type in your epifiles separated with a space.

 

The program takes the following input files. See example dataset for reference.

seqFile: the FASTA format file of sequences. See example data.

 >chr1:136351629-136351631

gtggtgatgcccaaccacagaattattttgttgctactttataactgtaattttgatcct

>chr3:122137593-122137593

atttctagttccagtgactgggagactgaaacaagagagtcacttgagtacaggagtgca

 

dataFile: the binding data in bed format consist of all sequences in the seqFile. The first column is the sequence id (must be the same as those used in seqFile, and in the same order), and the second column is the measured strength of binding. chr1:136351629-136351631[tab]312

chr3:122137593-122137593[tab]307

 

motifFile: the motif of the TF. The header line consists of motif name, length and pseudocount (0.5 should be OK for most motifs).

 

>Nanog  9       0.5

20      225     46      209

70      0       19      411

50      66      381     3

434     45      0       21

55      5       66      374

17      32      222     229

74      18      325     83

8       243     146     103

48      145     6       301

 

epiFile: the epigenomic mark data in wig format. It should cover whole genome. Otherwise, the program won’t be able to find the epigenomic data to annotate the binding sites.

 

track name=mouse

fixedStep chrom=chr1 start=0 step=25

0.00

0.00

1.50

1.50

1.50

0.00

0.00

0.00

 

Output:

(1) Estimated parameters: binding parameter (how strongly the TF binds with its binding site); the interaction parameters between TF and epigenomic mark: greater than 1 if favorable interaction, less than 1 unfavorable, 1 if no interaction.

(2) Pearson correlation between predicted binding and observed binding.

 

Advanced options

-ts <testSeqFile>: test the trained model in additional testing data. The format of testSeqFile is the same as seqFile.

-td <testDataFile>: test the trained model in additional testing data. The format of testDatafile is the same as dataFile.

-p <trainPredictionFile>: print the predicted binding intensities (of the training sequences in seqFile) in the file trainPredictionFile.

-tp <testPredictionFile>: print the predicted binding intensities (of the testing sequences in seqFile) in the file testPredictionFile.

 

 

Contact Chieh-Chun Chen (cchen63 AT illinois DOT edu) for any problem concerning APEG.