GATE is a model-based, open source tool for chromatin states
prediction based on time-course Epigenetic marks data.
Running the Program
- Launch R and change
the workspace to a new folder. (Notice that you need to run 64-bit R executable
if you want to use the 64-bit GATE package.)
- Extract
"FMM-HMM-functions.R" and
"FMM-HMM.dll"
from GATE package file. Put these two files and your data file (in tab-delimited format as
described in "Data File Format") into the new folder mentioned above.
- Open a new R
code file and execute:
>source("FMM-HMM-functions.R")
-
preprocess the data and split it to two files with epigenetic data and location
information
>data=Inputdata("data.txt",nmarker,ntime)
Here "nmark" is the number of epigenetic marks used for prediction, "ntime" is the number of time points in your experiments,
This step will give two output
variables
data$location and
data$observation. The first one
contains information of each window's location within the genome, the last one
contains processed epigenetic counts information.
- Run key function in GATE and set the relative function
parameters.
There are 8 parameters in total in the key function call of GATE:
>results=FMM.HMM.program(data$observation,ncluster,maxiteration,nstep,ndistance,initial)
- observation
is the second variable you generated from
Inputdata function, and it is data$observation
here;
- ncluster
is the number of clusters you want to give for the model, we will group
similar clusters in next step, so it can be a relative large number;
- maxiteration
is the largest number of total iterations it will use to train the
parameters if the stop thresholds aren't met. 2000-4000 were recommended;
- nstep
is the
largest number of iterations used for optimization of HMM parameters each
time given cluster information. 20-40 were recommended;
- ndistance
is
the threshold of distance between two transition parameter sets for
optimization of HMM parameters each time given cluster information. The
training iterations will be stopped when the distance of transition
parameter sets between two adjacent iterations is below
ndistance,
no matter iteration number is smaller than
nstep or not.
- initial
specifies what initial clustering information to use. "initial=1"
use a simple kmean-clustering to give initial clustering, other numbers give
a random initial clustering information to the model.
Several results
can be obtained from this main function:
- results$cluster
gives cluster information of each genomic window, it is a 1*nregion
vector, here
nregion
is the number of genomic windows in "data.txt".
results$lambda
gives the emission Poisson parameters for HMM in each cluster, it is a data
matrix with
ncluster
rows and
2*nmark columns. the order for the
columns is "mark1_Hidden0, mark1_Hidden1, mark2_Hidden0, mark2_Hidden1, ...";
results$hidden
gives the hidden state information (0-undifferentiated, 1-differentiated) for
each genomic window, it is a matrix with ntime
columns and
nregion rows.
Group clusters with similar patterns and integrate cluster and group
information together with location information in one variable.
>results.data.cl.gr=grouping.clusters(data$location,
results, ngroup)
Here,
data$loaction
is one output of step 4 and
results is output of step 5,
ngroup is the number of groups you
want to specify.
the output
results.data.cl.gr contains
location information in first 3 columns, cluster information in 4th column and
group information in 5th column.
Also, a file called "result_cl-gr-hid.txt" will
be generated with cluster,group and hidden state information for each region.
Generate colorful BED contain cluster, group and hidden states
information for visualization in UCSC genome
browser.
>col.BED(results.data.cl.gr,results$hidden)
Here,
results.data.cl.gr is the output of step 6, results$hidden in one of outputs in
step 5.
The function will
give an output BED file with could be uploaded to
UCSC genome browser after
adding this line as header at first line:
track name=predicted_cl_gr_hid description='predicted_cl_gr_hid' visibility=2 itemRgb='On'
Below is a figure description for the visualization in UCSC genome browser:
Last updated on 07/01/2013 by Pengfei Yu