An Evolution Model of Transcription Networks


 

Reference

Towards an evolutionary model of transcription networks. Dan Xie*, Chieh-Chun Chen*, Xin He, Xiaoyi Cao, Sheng Zhong. PLoS Computational Biology, 7(6): e1002064, 2011.


Downloads
  
   Source Code, Simulation Data, Yeast Data, Mammalian Data

 

Running the program

 

Usage:  ./EvoRgNet ¡Vmf miscfile ¡Vbs seqfile0 ¡Vfs seqfile1 ¡Ve cluster [-r realdata]

 

The program takes five arguments as input files:

 

miscfile: The file contains the information of the phylogenetic tree and the initial probabilities for each cluster to be unregulated or regulated, denoted as Q and P in the main text. Here is the format of a sample file.

 

0.02     0.05     0.13     0.13

0.1       0.225   0.225   0.225   0.225

0.1       0.225   0.225   0.225   0.225

0.1       0.225   0.225   0.225   0.225

0.80     0.05     0.05     0.05     0.05

0.80     0.05     0.05     0.05     0.05

0.80     0.05     0.05     0.05     0.05

 

The first line is the branch length of the phylogenetic tree for mammalian data.  

 

The second to fourth lines are the probabilities of two clusters to be unregulated by the transcription factor (TF) of our interest. Each line represents a species with the probabilities of five expression clusters to be unregulated.

The fifth to seventh lines are the probabilities of two clusters to be regulated by the transcription factor (TF) of our interest. Each line represents a species with the probabilities of five expression clusters to be regulated. The order of the species should be the same thoughtout this file. Note that the numbers of clusters don¡¦t need to be the same for all species. All above numbers are separated by a tab.

 

seqfile0: The file represents the background distribution of the sequence file. Here is the format of a sample file.

 

1          1          1

1          1          1

1          1          1

1          1          1

1          1          1

 

The column represents each species and the row represents a gene. The numbers are separated by a tab.

 

seqfile1: The file represents the sequence scores generated by a hidden markov model with both the motifs and background states, mentioned in method section in main text. The format is the same as seqfile0. Here is the format of a sample file.

 

0.222879028843425   0.222879028843425   0.222879028843425

5.10540162563391     3.08836285206605     24.9876539355866

0.222879028843425   1.90894545956532     2.127779957246

2.16253350422447     0.222879028843425   4.4694497289581

0.222879028843425   0.222879028843425   0.222879028843425

 

cluster: The file contains the cluster index information for each gene. Here is the format of a sample file. The column represents each species and the row represents a gene. The numbers in the file show the genes in that particular cluster in the corresponding species.

 

1          2          1

2          2          1

0          4          2

2          2          3

0          0          4

 

            Note that the index of the clusters starts from 0. The order of the clusters in each

species should be the same as it in the miscfile.

 

realdata: The file contains the tree state for each gene. Here is the format of a sample file. Each line represents the tree state to each gene.

 

1

8

1

1

6

 

Each state represents different regulatory case, shown in the following table, where 0 means unregulated and 1 means regulated, respectively.

States

Species 1

Species 2

Species 3

1

0

0

0

2

0

0

1

3

0

1

0

4

0

1

1

5

1

0

0

6

1

0

1

7

1

1

0

8

1

1

1

           

Note that the realdata file is only required if you want to infer the regulatory relationship and further do a cross validation to check the performance.

 

 

Output:

If you want to infer the regulatory relationship without known regulatory data:

(1) The likelihood output containing the likelihood for eight different outcomes

(2) The regulatory state output containing the regulatory states for each species

 

If you want to infer the regulatory relationship with known regulatory data, you can choose cross validation to test the prediction of regulatory relationship:

(1) The performance of cross validation.

 

 

Examples:

(1) Run without known regulatory data

./EvoRgNet ¡Vmf test_misc_file1 ¡Vbs SeqScore0 ¡Vfs SeqScore1 ¡Ve ExpIndex

 

(2) Run with known regulatory data

./EvoRgNet ¡Vmf test_misc_file1 ¡Vbs SeqScore0 ¡Vfs SeqScore1 ¡Ve ExpIndex  -r realdata ¡Vcv 4

 

 

 

Advanced options

-bt <beta>:  The tuning paramters for the weights of expression data. The default value is 2. We recommend setting the value around 0.1 to 30.

 

-gm <gamma>: The tuning paramters for the weights of sequence data. The default value is 1. We recommend setting the value around 0.1 to 30.

 

-o <output>: The output filename for the regulatory relationships. The default name is ¡§output.txt¡¨.

 

-lo <likelihoodoutput>: The output filename for the likelihoods in all possible states. The default name is ¡§output_likilihood.txt¡¨.

 

-cv <K>: K-fold cross validation, report the average performance (correlation)

 

-co <cvoutput>: The output filename for cross validation. The default name is ¡§output_crossvalidation.txt¡¨.

 

-l <lambda>: the transition rate (from 0:unregulated to 1:regulated) during evolution. The default value is 0.0053 estimated from yeast data. The estimate value for mammalian data is 0.17.

 

-u <mu> : the transition rate (rom 1:regualted to 0:unregulated) during evolution. The default value is 0.0742 estimated from yeast data. The estimate value for mammalian data is 0.51.

 

-px <px1>: the prior probability, the default value is 0.25.

 

-al <alpha>: the initial proportion of genes been regulated by the TF. The default value is 0.5.

 


Contact Chieh-Chun Chen (cchen63 AT illinois DOT edu) for any problem or comment.