[Paper Note] Multi-omics Integration
1. Overview
This note summarizes a representative paper on multi-omics integration for cancer, where histology images are combined with bulk RNA-seq and other molecular profiles. The core idea is to learn a shared latent space that captures complementary information from each modality while controlling for batch effects and clinical confounders.
I focus less on every implementation detail and more on: (i) how they formalize the integration problem, (ii) how they design the objective, and (iii) what lessons might transfer to computational pathology + spatial / single-cell data.
z that aligns image and multi-omics views,
is predictive of clinical outcomes, and disentangles biological signal from batch / technical noise.
2. Data & Modalities
2.1 Cohort
The study uses a cohort of several hundred cancer patients, each with:
- FFPE or frozen H&E whole-slide images (WSIs)
- Bulk RNA-seq (TPM / counts) and basic clinical variables
- Optional: copy number profiles or mutation data (used in a subset of analyses)
2.2 Preprocessing
WSIs are tiled into patches and fed into a pretrained histology encoder (e.g., a ResNet or a pathology foundation model). Patch features are aggregated into slide-level representations (attention pooling / simple mean pooling).
For RNA, the authors use log-transformed expression of selected genes (either highly variable genes or a curated panel). All omics features are z-score normalized across samples.
3. Method & Objective
3.1 Latent space
The method learns a shared latent vector z for each patient.
Two encoders map image features and omics features into this common space:
f_img(x_img) → zf_omics(x_omics) → z
The goal is that z captures biology that is
consistent across modalities, while also being
informative for downstream tasks such as subtype classification or prognosis.
f_img, f_omics and task head g s.t.
z_img = f_img(x_img), z_omics = f_omics(x_omics)
and we encourage z_img ≈ z_omics, while
g(z) predicts labels y (subtype / survival).
3.2 Loss design
The training objective typically includes:
- Alignment loss (e.g., contrastive loss, cosine similarity, or CCA-style loss) to bring image- and omics-derived embeddings of the same patient close in the latent space.
- Reconstruction or prediction loss, e.g. predicting gene expression from image embedding or vice versa, to encourage cross-modal predictability.
- Task-specific loss (e.g., cross-entropy for subtype labels, Cox loss for survival) so that the latent space is clinically meaningful.
- Optional regularization terms to control batch effects or known confounders (age, site, technical batch).
Conceptually, this is close to a supervised or semi-supervised multi-view representation learning framework.
| Model | Alignment loss | Task loss | Notes |
|---|---|---|---|
| Image-only | – | CE / Cox | Baseline WSI model |
| Omics-only | – | CE / Cox | Baseline transcriptomics model |
| Joint (paper) | Contrastive | CE + Cox | Multi-omics integrated latent space |
4. Key Results
- The integrated latent representations outperform unimodal baselines (image-only or RNA-only) on classification of molecular subtypes.
- Survival models built on the integrated space show improved risk stratification compared to clinical covariates alone.
- When visualizing the latent space (t-SNE / UMAP), clusters often align with both morphology patterns and expression-defined subgroups.
- Cross-modal prediction (e.g., predicting RNA from image embeddings) is not perfect but recovers major axes such as immune vs. stromal vs. tumor signals.
5. My Notes & Takeaways
- The framework is flexible: in principle, additional modalities (ATAC, methylation, spatial transcriptomics) could be added as more encoders.
- The choice of loss terms is crucial. A stronger alignment loss can oversmooth real modality-specific signals; a weak one can fail to remove technical noise.
- For pathology, patch-level signals are heterogeneous. Slide-level aggregation might hide local patterns that are important for prognosis. This suggests combining multi-omics integration with multiple instance learning or region-of-interest modeling.
- It is still unclear how much of the survival gain comes from “true biology” vs. better regularization and feature compression.
6. Open Questions for My Future Work
- How to extend this framework to spatial and single-cell data, where each slide contains many cells/spots with partially matched omics?
- Can we design a latent space that explicitly separates shared vs. modality-specific components (e.g., via structured VAEs or flow matching)?
- For foundation models in pathology, is it better to: (a) first pretrain a strong image-only encoder, then align with omics; or (b) train a multimodal foundation model from scratch?
- How to evaluate whether the learned space truly captures causal biology, instead of correlational patterns tied to the cohort?
I would like to revisit this paper when designing my own framework for multi-omics + histology integration, especially the loss design and the way they handle confounders.