Fork me on GitHub

   HOME   |   DOWNLOAD&INSTALLATION   |   USAGE&TUTORIAL   |   DATA FILE FORMAT   |   EXAMPLE   |   UPDATE NOTES

   

GATE is a model-based, open source tool for chromatin states prediction based on time-course Epigenetic marks data.

 


 

Running the Program

  1. Launch R and change the workspace to a new folder. (Notice that you need to run 64-bit R executable if you want to use the 64-bit GATE package.)

  2. Extract "FMM-HMM-functions.R" and "FMM-HMM.dll" from GATE package file. Put these two files and your data file (in tab-delimited format as described in "Data File Format") into the new folder mentioned above.

  3. Open a new R code file and execute:
    >source("FMM-HMM-functions.R")

  4. preprocess the data and split it to two files with epigenetic data and location information

  5. >data=Inputdata("data.txt",nmarker,ntime)
    Here "nmark" is the number of epigenetic marks used for prediction, "ntime" is the number of time points in your experiments, This step will give two output variables data$location and data$observation. The first one contains information of each window's location within the genome, the last one contains processed epigenetic counts information.

  6. Run key function in GATE and set the relative function parameters.
    There are 8 parameters in total in the key function call of GATE:
    >results=FMM.HMM.program(data$observation,ncluster,maxiteration,nstep,ndistance,initial)
    1. observation is the second variable you generated from Inputdata function, and it is data$observation here;
    2. ncluster is the number of clusters you want to give for the model, we will group similar clusters in next step, so it can be a relative large number;
    3. maxiteration is the largest number of total iterations it will use to train the parameters if the stop thresholds aren't met. 2000-4000 were recommended;
    4. nstep is the largest number of iterations used for optimization of HMM parameters each time given cluster information. 20-40 were recommended;
    5. ndistance is the threshold of distance between two transition parameter sets for optimization of HMM parameters each time given cluster information. The training iterations will be stopped when the distance of transition parameter sets between two adjacent iterations is below ndistance, no matter iteration number is smaller than nstep or not.    
    6. initial specifies what initial clustering information to use. "initial=1" use a simple kmean-clustering to give initial clustering, other numbers give a random initial clustering information to the model.

    Several results can be obtained from this main function:
    1. results$cluster gives cluster information of each genomic window, it is a 1*nregion vector, here nregion is the number of genomic windows in "data.txt".
    2. results$lambda gives the emission Poisson parameters for HMM in each cluster, it is a data matrix with ncluster rows and 2*nmark columns. the order for the columns is "mark1_Hidden0, mark1_Hidden1, mark2_Hidden0, mark2_Hidden1, ...";
    3. results$hidden gives the hidden state information (0-undifferentiated, 1-differentiated) for each genomic window, it is a matrix with ntime columns and nregion rows.

       
  7. Group clusters with similar patterns and integrate cluster and group information together with location information in one variable.
    >results.data.cl.gr=grouping.clusters(data$location, results, ngroup)
    Here,
    data$loaction is one output of step 4 and results is output of step 5, ngroup is the number of groups you want to specify.
    the output
    results.data.cl.gr contains location information in first 3 columns, cluster information in 4th column and group information in 5th column.
    Also, a file called "result_cl-gr-hid.txt" will be generated with cluster,group and hidden state information for each region.

  8. Generate colorful BED contain cluster, group and hidden states information for visualization in UCSC genome browser.
    >col.BED(results.data.cl.gr,results$hidden)
    Here, results.data.cl.gr is the output of step 6, results$hidden in one of outputs in step 5.
    The function will give an output BED file with could be uploaded to UCSC genome browser after adding this line as header at first line:

    track name=predicted_cl_gr_hid description='predicted_cl_gr_hid' visibility=2 itemRgb='On'
    Below is a figure description for the visualization in UCSC genome browser:

     

 

Last updated on 07/01/2013 by Pengfei Yu