m6AConquer: Methodological Details


Content:

Ⅰ. Introduction

The m6AConquer database offers curated quantitative m6A profiling sequencing data sets. This page provides a brief overview of the processing workflows (Figure 1) utilized to convert raw reads into the data available in this database.

Figure 1: The workflow of m6AConquer

Ⅱ. Data Processing

1. Data gathering

m6AConquer summarises sequencing datasets from 10 different m6A profiling techniques. Both wild type and perturbation samples are included in the database. The perturbation types mainly include the manipulation of m6A-related genes, such as the over-expression of FTO and the knockout of Mettl3. IVT control samples are also included for quantification calibration.

All the sequencing datasets are downloaded from three central database (GEO, GSA, and EMBL-EBI).

2. Quality Control

FastQ read files for NGS techniques are retrieved and downloaded accordingly. Most reads undergo initial trimming using tools like Cutadapt, Trim Galore, and Flexbar to eliminate adaptors and low-quality reads. Some reads are deduplicated through Samtools, Fastx Collapser, and UMI-tools to remove PCR duplications. Fastp is used for both trimming and deduplication.

As for Oxford Nanopore sequencing data, the raw Fast5 files are obtained from the Singapore Nanopore-Expression Project. Quality control is conducted during the basecalling process with Guppy.

The table below summarize the software and tools used in data cleaning and read mapping:

TechniquesData cleaningRead mapping
DART-seqFastp v.0.23.4HISAT-3N v.2.2.1-3n-0.0.3 (C → T)
eTAM-seqCutadapt v.2.9, UMI-tools v.1.1.4HISAT-3N v.2.2.1-3n-0.0.3 (A → G)
GLORITrim Galore v.0.6.6, Fastp v.0.23.4STAR v.2.7.11a, bowtie v.1.3.1
m6A-REF-seqCutadapt v.2.9HISAT2 v.2.1.0
m6A-SAC-seqCutadapt v.2.9, Fastx_collapser v.0.0.13STAR v.2.7.11a
m6ACEFastp v.0.23.4STAR v.2.7.11a
MAZTER-seqTrim Galore v.0.6.6STAR v.2.7.11a
MeRIP-seqTrim Galore v.0.6.6HISAT2 v.2.1.0
Oxford-nanoporeGuppy v.3.1.5 (Basecalling)Minimap2 v.2.17
scDART-seqFlexbarSTAR v.2.7.11a

3. Alignment

For NGS sequencing data, 4 aligners (STAR, Bowtie, HISAT2, and HISAT-3N) are utilized for different techniques based on their original pipelines. Bowtie is employed explicitly for transcriptome alignment, while the other three aligners are utilized for genome alignment. The genome index of HISAT2 can be obtained from the provided link: https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz. Meanwhile, the genome indexes for STAR and HISAT-3N are constructed as per the specific requirements. All sequencing reads are aligned to the GRCh38 or hg38 human genome, and various versions of gene annotation from RefSeq, Genecode, and Ensemble are employed under their original pipelines. The codes for index building can be accessed on GitHub.

In the context of Oxford Nanopore sequencing data, the aligner minimap2 is utilized for transcriptome alignment. The Ensemble transcript reference of GRCh38 is indexed using the default parameters of minimap2.

4. Feature counting

The m6AConquer database summarizes quantitative m6A profiling results in the form of SummarizedExperiment (SE) objects. These SE objects summarize the total and m6A read coverages at a consistent range of genomic locations and gene expression levels. The consistent genomic locations consist of 2526220 adenosines (A) at the center of exonic DRACH motifs and 262485 GLORI m6A sites discovered at other genomic locations reported in the supplementary data of GLORI. The m6AConquer database also includes gene expression matrices in SE objects on 58560 genes (Transctipt-Omics).

5. Omics data frameworks

The SummarizedExperiment objects consist of four primary components: row ranges, column data, assays matrices, and metadata.

Figure 2: Data structure of SummarizedExperiment objects for m6A-Omics.

In SE objects for m6A-omics (Figure 2),

  • Assays: The matrices for m6A counts and total read coverages at this candidate A sites, m6A site probabilities predicted by beta-binomial mixture models, and the Benjamini–Hochberg method corrected p-values (FDR).
  • Row ranges: Genomic ranges of consistent A locations.
  • Column data: Sequencing data information provided by the data generators in the public repositories.
  • Metadata: Fitted parameters of beta-binomial mixture models and IDR outcomes.

Figure 3: Data structure of SummarizedExperiment objects for Transcript-Omics.

In SE objects for transcript-omics (Figure 3),

  • Assays: The matrices for read counts and RPKM at consistent gene features.
  • Row ranges: Genomic ranges of gene features.
  • Column data: Sequencing data information provided by the data generators in the public repositories, identical to m6A-Omics SE.

Figure 4: Data structure of Multi-Assay Experiment objects for Multi-Omics.

Apart from these single-omics data in SE format, multi-omics data in the form of MultiAssayExperiment (MAE) for each profiling technique are also provided. The MAE files are the integration of m6A-Omics and Transcrit-Omics SE objects. The MAE contains distinct components such as experiments, sample maps, and column data for MAE.

In MAE objects for multi-omits (Figure 4),

  • Experiments: m6A-Omics and Transcrit-Omics SE objects
  • Column data in MAE: The union of column data in all experiment SEs.
  • Sample Map: The map relating the column data in experiments to the column data in MAE.

All three types of files (multi-omics, m6A-omics, and transcript-omics) can all be downloaded from the Download page.

5. IVT calibration

In the m6AConquer, we have incorporated in vitro transcription (IVT) control samples from GLORI, eTAM-seq, and MeRIP-seq datasets. The IVT control sequencing data from GLORI and eTAM-seq employ a beta-binomial mixture model to call m6A sites separately. Sites with a false discovery rate (FDR) below 0.05 are considered false positive m6A sites for the respective GLORI or eTAM-seq datasets. In the case of IVT control sequencing data from MeRIP-seq, m6ACali is used to identify false positive m6A sites. For other sequencing data within these three datasets, the m6A counts and total counts at their identified false positive m6A sites are set to 0.

III. Site-calling and Normalization: Beta-binomial mixture model

Beta-binomial mixture model

The two-component beta-binomial mixture model is a statistical model that assumes the data is generated from a mixture of two beta distributions. Each beta distribution represents a different underlying mechanism or subgroup within the population. The probability of observing success for each individual is assumed to come from one of the two beta distributions, and the choice of distribution is determined by a latent (unobserved) variable that follows a binomial distribution:

xπBeta(α1,β1)+(1π)Beta(α2,β2)\begin{equation} x \sim \pi \cdot \text {Beta}(\alpha_1, \beta_1) + (1-\pi)\cdot \text {Beta}(\alpha_2, \beta_2)\end{equation}

where π\pi and 1π1-\pi are the proportion of two beta distribution components, and α\alpha and β\beta are beta distribution parameters.

Expectation-Maximization (EM) algorithm

In m6AConquer, we model the m6A methylation ratios detected in each sequencing sample as a two-component beta-binomial mixture distribution representing two possible methylation states: methylated state (foreground, “fg\text {fg}”) and unmethylated (background, “bg\text {bg}”) state. The fitting of the beta-binomial mixture model follows the Expectation-Maximization (EM) algorithm. The EM algorithm is an iterative method that starts with initial estimates for the parameters and then changes between the E-step (Expectation) and the M-step (Maximization). In the E-step, the log-likelihood is computed the based on the current parameter estimates. In the M-step, the log-likelihood is maximized to update the parameter estimates.

1. Initialization

For each site with m6A read coverages, let mim_i and nin_i be the m6A and total read coverages at site ii. Let {xi=mini}, i=1,2,N\{x_i = \frac{m_i}{n_i}\},\ i=1,2\ldots,N, denotes the m6A methylation ratios at site ii, where NN is the number of sites with m6A read coverage.

To obtain a closer initial estimate for π\pi, a two-component binomial mixture distribution is fitted for xix_i:

xπBinomial(N,pfg)+(1π)Binomial(N,pbg)\begin{equation} x \sim \pi \cdot \text {Binomial}(N, p_{\text {fg}}) + (1-\pi)\cdot \text {Binomial}(N, p_{\text {bg}}) \end{equation}

The fitting of this primitive binomial mixture distribution is also implemented by EM algorithm.

To obtain initial estimates for αfg\alpha_{\text {fg}}, βfg\beta_{\text {fg}}, αbg\alpha_{\text {bg}} , and βbg\beta_{\text {bg}}, weighted method of moments is firstly used for estimation (18-21), following the weighted maximum likelihood estimation (8-17) to obtain a closer estimation. The details of these two estimation methods are introduced in the M-step.

2. E-step

In the E-step, the algorithm calculates the value of the likelihood function concerning the conditional distribution of the latent variable given the observed data and current estimate of the parameters. This is where posterior probabilities are computed:

ri,fg=p(mini,αfg,βfg)p(αfg,βfg)p(miθ)=p(mini,αfg,βfg)πp(mini,αfg,βfg)π+p(mini,αbg,βbg)(1π)\begin{equation}\begin{aligned} {r_i}_{\text {,fg}} &= \frac{p(m_i|n_i,\alpha_{\text {fg}}, \beta_{\text {fg}})\cdot p(\alpha_{\text {fg}}, \beta_{\text {fg}})}{p(m_i|\theta)} \\&= \frac{p(m_i|n_i,\alpha_{\text {fg}}, \beta_{\text {fg}})\cdot\pi}{{p(m_i|n_i,\alpha_{\text {fg}}, \beta_{\text {fg}})\cdot\pi}+{p(m_i|n_i,\alpha_{\text {bg}}, \beta_{\text {bg}})\cdot(1-\pi)}} \end{aligned}\end{equation}
p(mini,αj,βj)j{fg,bg}=(nimi)B(mi+αj,nimi+βj)B(αj,βj)\begin{equation} \underset{j\in\{\text {fg},\text {bg}\}}{p(m_i|n_i,\alpha_j, \beta_j)} = \binom{n_i}{m_i} \frac{\text B(m_i + \alpha_j, n_i - m_i + \beta_j)}{\text B(\alpha_j, \beta_j)} \end{equation}
ri,bg=1ri,fg\begin{equation}{r_i}_{\text {,bg}} = 1-{r_i}_{\text {,fg}} \end{equation}

where ri,fg{r_i}_{\text {,fg}} and ri,bg{r_i}_{\text {,bg}} denotes the posterior probability of xix_i belonging to the foreground and background component respectively, mim_i and nin_i be the m6A and total read coverages, (4) denotes the probability mass function of beta distribution component, and B\text B denotes beta function.

3. M-step

In the M-step, the algorithm finds the parameters that maximize the likelihood. By default, weighted maximum likelihood estimation (WMLE) is used to obtained the estimated parameters. If the WMLE fails to converge, weighted method of moments is used in current iteration as the estimation method.

4. Convergence

The algorithm checks for convergence by testing if the absolute differences between the current and the previous parameter values are smaller than a certain tolerance. If not, the parameters are updated, and the next iteration starts. If yes, the process stops, and the final estimates for the parameters are returned.

Model outputs

The beta-binomial mixture model in m6AConquer outputs two statistics: m6A site probabilities and false discovery rates.

  • m6A site probabilities: These are the posterior probability (responsibilities) of the sites being assigned to the foreground component (3, 4) with the final fitted model. This can be regarded as the normalized m6A ratios.
  • False discovery rates (FDR): These are the Benjamini-Hochberg (BH) method corrected p-values. P-values are the probabilities of the true m6A counts equal or larger than the observed m6A counts in the background beta-binomial component. This can be calculated through the cumulative density function (CDF) of the beta-binomial distribution with the final fitted parameters for the background component:

P-value=1F(x;αbg,βbg)\begin{equation} \text {P-value} = 1-F(x;\alpha_{bg}, \beta_{bg})\end{equation}
F(x;αbg,βbg)=f(0)++f(x)\begin{equation} F(x;\alpha_{bg}, \beta_{bg}) = f(0)+ \dots + f(x)\end{equation}
f(x)=k=1x(N+1kk)(αbg+x1)!(βbg+Nx1)!B(αbg,βbg)(αbg+βbg+N+1)!\begin{equation} \begin{aligned} f(x) = &\prod_{k = 1}^x(\frac{N+1-k}{k}) \\ &\cdot {\frac{(\alpha_{bg}+x-1)!\cdot (\beta_{bg}+N-x-1)! \cdot \text B(\alpha_{bg}, \beta_{bg})}{(\alpha_{bg}+\beta_{bg}+N+1)!}}\end{aligned} \end{equation}

where xx is the m6A counts, F(x;αbg,βbg)F(x;\alpha_{bg}, \beta_{bg}) and f(x)f(x) is the cumulative density function (CDF) and probability mass function (PMF) of the background beta-binomial component respectively, NN is the number of sites with m6A read coverage and B\text B is the beta function. The CDF of the background component is calculated through the sum of PMF, which can be implemented with a recursive algorithm.

IV. Technical Orthogonal Integration: IDR analysis

IDR analysis

To identify reliable modification sites and investigate differences between techniques, m6AConquer implements Irreproducible Discovery Rate (IDR) analysis, a method commonly used in CHIP-seq data to identify consistent peaks in replicates. In our implementation, we evaluate the consistency of identified sites between different techniques, particularly those employing orthogonal mechanisms to quantify m6A abundances. IDR analysis involves three key steps to assess the reproducibility of methylation sites in various m6A profiling techniques (Figure 5).

Figure 5: The workflow of IDR analysis

1. Site ranking

The overlapped sites between each pair of profiling techniques are ranked based on the value of interest. In m6AConquer, m6A site probabilities are selected as the evaluation value. Therefore, the overlapped sites are ranked based on posterior probabilities, and the ranks are used as modelling statistics in the following step.

2. Modelling

The underlying assumption in IDR analysis is that reproducibility sites should rank higher and consistently in each pair of techniques. To model the heterogeneity, a mixture framework is employed, where a positively correlated bivariate Gaussian distribution denotes the component of reproducible sites and a zero-correlated bivariate Gaussian distribution denotes the component of irreproducible sites.

(x,y)pNormal((μμ),(σ2ρσ2ρσ2σ2))+(1p)Normal((00),(1001))\begin{equation}\begin{aligned} (x,y) \sim &p \cdot \text {Normal}(\begin{pmatrix}\mu \\ \mu \end{pmatrix}, \begin{pmatrix} \sigma^2 & \rho\cdot \sigma^2\\ \rho\cdot \sigma^2 & \sigma^2 \end{pmatrix}) \\&+ (1-p)\cdot \text {Normal}(\begin{pmatrix} 0\\ 0 \end{pmatrix}, \begin{pmatrix} 1 & 0\\0 & 1 \end{pmatrix})\end{aligned}\end{equation}

where (x,y)(x,y) denotes site ranks of each pair of techniques, pp denotes the proportion of the reproducible site component, and μ\mu, σ2\sigma^2, and ρ\rho denote the parameters for the positively correlated bivariate Gaussian distribution.

A two-component semiparametric copula mixture model is employed to assess the dependence structure between the ranks of paired techniques. The associations among the variables are modelled by a parametric joint distribution, while the marginal distributions are estimated nonparametrically based on their ranks. The Gaussian copula family is employed to model dependence structures, treating the observed data as being generated from underlying latent variables (zx,zy)(z_x, z_y). The underlying latent variables reflect the biological replicates and are assumed to have the same marginal distribution. Therefore, the joint parametric model for copula with latent variables (zx,zy)(z_x, z_y) is:

(zxzy)K=kNormal((μkμk),(σk2ρkσk2ρkσ2σk2)),k=0,1\begin{equation}\begin{pmatrix}z_{x} \\ z_{y}\end{pmatrix} | K = k \sim \text {Normal} \left( \begin{pmatrix} \mu_k \\ \mu_k \end{pmatrix}, \begin{pmatrix} \sigma^2_k & \rho_k \cdot \sigma^2_k \\ \rho_k \cdot \sigma^2 & \sigma^2_k \end{pmatrix} \right), \qquad k = 0,1 \end{equation}

where KiBernoulli(p)K_i \sim \text {Bernoulli} (p), μ0=0\mu_0 = 0, μ1>0\mu_1 > 0, ρ0=0\rho_0 = 0, and 0<ρ1<10<\rho_1<1.

The IDR value is the posterior probability that a rank pair ii denoted as (xi,yi)(x_i, y_i) belongs to the irreproducible component:

IDR=CirrepCirrep+Crep\begin{equation} {\text {IDR}} = {\frac{C_\text{irrep}}{C_\text{irrep} + C_\text{rep}}} \end{equation}
Cirrep=ph0(G1(Fx(x)),G1(Fy(y)))\begin{equation} {C_\text{irrep}} = p \cdot h_0 (G^{-1}(F_x(x)), G^{-1}(F_y(y))) \end{equation}
Crep=(1p)h1(G1(Fx(x)),G1(Fy(y)))\begin{equation} {C_\text{rep}} = (1-p) \cdot h_1 (G^{-1}(F_x(x)), G^{-1}(F_y(y))) \end{equation}

where h0Normal((00),(1001))h_0 \sim \text{Normal} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \right) and h1Normal((μ1μ1),(σ12ρ1σ12ρ1σ12σ12))h_1 \sim \text{Normal} \left( \begin{pmatrix} \mu_1 \\ \mu_1 \end{pmatrix}, \begin{pmatrix} \sigma_1^2 & \rho_1 \cdot \sigma_1^2 \\ \rho_1 \cdot \sigma_1^2 & \sigma_1^2 \end{pmatrix} \right). FxF_x and FyF_y are the unknown marginal distributions of ranks xx and yy respectively. G1G^{-1} is are the right-continuous inverses of GG which is defined as:

uxG(zx)=pσ1Φ(zxμ1σ1)+(1p)Φ(zx)\begin{equation} u_{x} \equiv G(z_{x}) = \frac{p}{\sigma_1} \cdot \Phi\left(\frac{z_{x} - \mu_1}{\sigma_1}\right) + (1-p) \cdot \Phi(z_{x}) \end{equation}
uyG(zy)=pσ1Φ(zyμ1σ1)+(1p)Φ(zy)\begin{equation} u_{y} \equiv G(z_{y}) = \frac{p}{\sigma_1} \cdot \Phi\left(\frac{z_{y} - \mu_1}{\sigma_1}\right) + (1-p) \cdot \Phi(z_{y}) \end{equation}
x=Fx1(ux)\begin{equation} x = F_x^{-1}(u_x) \end{equation}
y=Fy1(uy)\begin{equation} y = F_y^{-1}(u_y) \end{equation}

where Fx1F_x^{-1} and Fy1F_y^{-1} are the right-continuous inverses of FxF_x and FyF_y.

To fit procedure for this model is implemented with a EM algorithm coupled with “pseudo-likelihood” likelihood approach. For detailed fitting procedure, please refer to Li et al. (2011).

3. Threshold selection

By setting a threshold for the IDR value, the irreproducible discoveries can be controlled. In m6AConquer, this threshold is set to 0.05, which means the overlapped sites with posterior probabilities of belonging to the irreproducible component of less than 5% are considered reproducible.

Technical orthogonal integration

IDR analysis is conducted between orthogonal technique pairs. Here, 10 m6A quantitative profiling techniques are classified into 4 groups based on their sequencing principles:

Figure 6: 4 sequencing strategies used by m6A quantitative profiling techniques

Technqiues in different strategy groups are considered orthogonal technique pairs. One exception can be the techniques in antibody-assisted and direct RNA sequencing strategy groups. They are not considered orthogonal because the computational detection method in direct RNA sequencing techniques incorporates m6ACE-seq sequencing results as its training labels. m6ACE-seq is classified in the antibody-assisted strategy group.

In each orthogonal technique pair, overall and condition-specific sample integration methods can be employed. The overall integration method is the sum of m6A counts and total counts at each site from all samples within each technique. As for the condition-specific integration method, summation takes place on samples with common cell lines or tissues.

The IDR analysis-based integration results are presented in Genomic Ranges and CSV format in the m6AConquer database. For each orthogonal technique pair, the results contain the genomic locations and IDR values of all reproducible sites, in addition to their m6A methylation ratios, posterior probabilities, and Benjamini-Hochberg (BH) corrected p-values as modelled by each technique. Furthermore, a merged Genomic Range and CSV files are provided, containing the union of reproducible sites identified in all orthogonal technique pairs, along with the names and numbers of supported orthogonal technique pairs. Users can access and download these results from the Download page.

V. Codes

The codes for processing sequencing data is available on GitHub.

The fitting of beta-binomial mixture model and other candidate models can be implemented through OmixM6A R package, which is also available on GitHub.

VI. References

Morgan M, Obenchain V, Hester J, Pagès H (2024). SummarizedExperiment: SummarizedExperiment container. R package version 1.34.0, https://bioconductor.org/packages/SummarizedExperiment.

Ramos M, Schiffer L, Re A, Azhar R, Basunia A, Rodriguez Cabrera C, Chan T, Chapman P, Davis S, Gomez-Cabrero D, Culhane A, Haibe-Kains B, Hansen K, Kodali H, Louis M, Mer A, Reister M, Morgan M, Carey V, Waldron L (2017). “Software For The Integration Of Multi-Omics Experiments In Bioconductor.” Cancer Research77(21), e39-42. doi:10.1158/0008-5472.CAN-17-0344https://cancerres.aacrjournals.org/content/77/21/e39.

Qunhua Li. James B. Brown. Haiyan Huang. Peter J. Bickel. "Measuring reproducibility of high-throughput experiments." Ann. Appl. Stat. 5 (3) 1752 - 1779, September 2011. https://doi.org/10.1214/11-AOAS466

Konstantin Krismer, Yuchun Guo, David K Gifford, IDR2D identifies reproducible genomic interactions, Nucleic Acids Research, Volume 48, Issue 6, 06 April 2020, Page e31, https://doi.org/10.1093/nar/gkaa030