Reads processing

Regardless if a library was prepared using pxMARGI or diMARGI, the following procedures were the same.

First, we pre-processed the reads according to the pipeline depicted in Fig 1, where we: (1) assessed the sequencing properties of RNA and DNA reads (FastQC), and (2) de-duplicated read-pairs (fastuniq) to avoid potential artifacts due to PCR over-amplification.

https://sysbio.ucsd.edu/public/rivasas2/responseToReviewers/RNAchrom/quality.png

Figure 1: Depiction of the quality assessment and de-duplication of reads. We analyzed the quality of the RNA and DNA reads independently. As our sequencing protocol attaches two random bases at the 5’ end of the RNA reads, we removed the first two nucleotides of each RNA sequence before undergoing this step. Any sequencing adaptor found among RNA or DNA reads were trimmed using FastX. After this, for every library, we check for duplicated RNA-DNA pairs and collapsed them using fastuniq. For this, we did not remove the two randomly attached bases at the 5’ end of the RNA reads as they help to distinguish false from true PCR duplicates. For all downstream analyses, we used only single-copy read pairs.

Mapping

Our mapping pipeline, which uses a splice tolerant aligner (STAR 2.5.1) is depicted in Fig 2.

https://sysbio.ucsd.edu/public/rivasas2/responseToReviewers/RNAchrom/mapping.png

Figure 2: Depiction of the read mapping procedure. Unlike traditional paired-end alignments where both mates of each pair are simultaneously inputted to the aligner tool, we mapped, for each library, the RNA and DNA reads independently. We allowed splice-junctions among RNA reads (where splicing naturally occurs) but not among DNA reads. As the bases at the 3’ end of either RNA or DNA reads may correspond to a chimeric fragment, in both cases we allowed the aligner to clip bases at their 3’ end. Also, in the case of RNA reads we took care of removing the two random bases attached to their 5’ end. After mapping the RNA and DNA reads, the corresponding mates were merged into a single BAM file using a python script (fixBAMmates).

We ran STAR using the following options: –alignEndsType Extend5pOfRead1, –seedSearchStartLmax 25, –outFilterScoreMin 21, –outFilterScoreMinOverLread 0.13, –outFilterMatchNmin 21, –outFilterMatchNminOverLread 0.13.

To run STAR we built two genomic indexes: one with spliced-junction information (gtf file from ENSEMBL hg38 release 84), and another without such information. The former was used when aligning RNA reads, and the second when aligning DNA reads. To prevent spliced-junctions among DNA reads we used the option –alignIntronMax 1.

For later analysis including the identification of pxRNAs and diRNAs, we only considered RNA reads whose DNA mate was located in a different chromosome, or, if in the same chromosome, located no closer than 2000 bases)

Identified caRNAs

To identify pxRNAs and diRNAs, we tested the significance of their observed RNA counts under the null hypothesis that RNA reads are distributed proportionally to gene length. That is, for a gene \(i\) we modelled its counts of RNA reads, \(R_i\), as:

\[R_i \sim \mbox{Poisson}(R_i|M \cdot l_i/L)\]

where \(M\) is the total number or RNA reads distributed among all genes (to avoid the effects of outliers, we discarded any RNA read count above the 99% quantile), \(l_i\) is the exonic length of gene \(i\), and \(L\) is the combined length of all genes’ exons. Thus, the P-value of gene \(i\) is computed as:

\[\mbox{P-value}_i = \sum_{R_j>=R_i} \mbox{Poisson}(R_j|M \cdot l_i/L)\]

As we are doing multiple independent tests, we corrected for Type I errors by computing q-values (using the R package qvalue) for each gene:

Below you can find the lists of MARGI-identified caRNAs. Each list contain 6 columns:

  1. gene_id: ENSEMBL gene identifier.
  2. gene_name: Gene symbol.
  3. biotype: The biotype of the gene extracted from the ENSMEBL gtf file.
  4. rna_count: Count of RNA ends (R1 reads) overlapping the exonic region of the caRNA
  5. pvalue: P-value of the observed rna_count.
  6. qvalue Q-value of the observed rna_count.

Here the results using pxMARGI:

and diMARGI:

Identified RNA-DNA associations

Below you can find the MARGI-identified RNA-DNA associations. Each table contains 11 columns:

  1. chrom_RNA: Chromosome of RNA end.
  2. start_RNA: Start coordinate of RNA end.
  3. end_RNA: End coordinate of RNA end.
  4. strand_RNA: Strand of RNA end
  5. associated_RNA: Identifier (gene_id|gene_name|biotype) of the RNA overlapping the RNA end
  6. chrom_DNA: Chromosome of the DNA end.
  7. start_DNA: Start coordinate of the DNA end.
  8. end_DNA: End coordinate of the DNA end.
  9. strand_DNA: Strand of the coordinate end.
  10. read_ID: Identifier of the read (extracted from the Fastq files)
  11. type_of_link: Type of links formed by the read pair. It can be proximal or distal if the distance between pairs is below, or above or equal to 2kbp, respectively, or interChrom if the RNA and DNA ends are located in different chromosomes.

Here the results using pxMARGI:

and diMARGI: