Data Preparation¶
This how-to explains how to prepare biological data for multiverse.
Multiverse expects a well-formed MuData file, plus explicit metadata keys that say which column represents batch and which column represents biological labels.
Data State: What multiverse Assumes¶
Multiverse does not perform scientific preprocessing decisions for you. It preserves and records the choices you make before registration.
| Requirement | Practical guidance |
|---|---|
| Cells are observations | Cell barcodes should be in .obs_names. |
| Features are variables | Genes, peaks, proteins, etc. should be in .var_names. |
| Modalities are explicit | Use MuData modalities such as rna, atac, and adt for multimodal studies. |
batch_key metadata is present |
Use a string or categorical .obs column for donor, chemistry, lane, site, or another nuisance source. If not available, we will assign the same default value for all samples. |
cell_type_key metadata is present when available |
Use a string or categorical .obs column. |
| Preprocessing | Store the matrix state as raw counts, perform the processing expected by the selected models in the implementation through the GUI/manifest file, and document normalization, filtering, and HVG selection. |
Raw Counts, Normalized Data, and HVGs¶
Different models make different assumptions. Multiverse records what you run, but it does not convert one scientific data state into another without your knowledge.
Recommended practice:
- Keep raw counts in a layer or adjacent file when possible.
- For PCA baselines, normalized and log-transformed matrices with HVG selection are often appropriate.
- For probabilistic count models, preserve count-like input unless the model documentation says otherwise.
Metadata Columns¶
Your registered batch and cell_type keys control which metrics are possible.
adata.obs["batch"] = adata.obs["donor_id"].astype(str)
adata.obs["cell_type"] = adata.obs["manual_annotation"].astype(str)
Use stable labels:
- Good
batchvalues:donor_1,donor_2,10x_v2,10x_v3,site_a. - Good
cell_typevalues:CD4 T,B cell,monocyte,NK. - Avoid mixed types, empty strings, and columns with many accidental spellings.
If cell_type or batch is missing, Multiverse can still run models, but the evaluation metrics pipeline cannot run without having both available. To circumvent this limitation in the scib-metrics source code, we assign random labels for the samples for each of the missing keys (for now). Therefore, in case a label is misisng, the results shown for that missing label might be misleading.
Prepare RNA AnnData¶
from pathlib import Path
import scanpy as sc
adata = sc.read_h5ad("notebooks/processed_pbmc.h5ad")
adata.obs["batch"] = adata.obs["donor_id"].astype(str)
adata.obs["cell_type"] = adata.obs["annotation"].astype(str)
# Example PCA-oriented state. Adjust to your study and selected models.
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=3000)
dataset_dir = Path("store/datasets/pbmc_rna")
data_dir = dataset_dir / "data"
data_dir.mkdir(parents=True, exist_ok=True)
adata.write_h5ad(data_dir / "rna.h5ad")
Create the manifest:
import yaml
manifest = {
"name": "PBMC RNA",
"omics": ["rna"],
"raw_files": {
"rna": "data/rna.h5ad",
},
"metadata_keys": {
"batch": "batch",
"cell_type": "cell_type",
},
}
with open(dataset_dir / "dataset.yaml", "w") as f:
yaml.safe_dump(manifest, f, sort_keys=False)
Register Through the GUI¶
Sequential GUI workflow:
- Start multiverse and open
http://localhost:28501. - Open the Registry tab.
- Expand Register New Dataset.
- If you already wrote
dataset.yaml, keep Build manifest from fields off and enter the manifest path. - If you want the GUI to create the manifest, switch Build manifest from fields on and fill in dataset name, omics, file paths,
batch_key, andcell_type_key. - Click Register Dataset.
- Click Refresh Registry.
- Confirm the dataset appears as
READY.
Read Results Back into Jupyter¶
Each successful run writes embeddings.h5 with a dataset named latent.
from pathlib import Path
import h5py
import mudata as md
import scanpy as sc
artifact_dir = Path("store/artifacts/run_output/store/artifacts/<artifact-id>")
# Copy the exact path from the Results tab.
with h5py.File(artifact_dir / "embeddings.h5", "r") as f:
latent = f["latent"][:]
mdata = md.read_h5mu("store/datasets/pbmc_multiome/data/processed.h5mu")
adata = mdata["rna"].copy()
adata.obsm["X_multiverse_multivi"] = latent
sc.pp.neighbors(adata, use_rep="X_multiverse_multivi")
sc.tl.umap(adata)
sc.pl.umap(adata, color=["batch", "cell_type"])
For MuData, attach the embedding to a modality or to a notebook-side AnnData object used for plotting. The important point is that row order must match the cells used by the model run.
Publishing Checklist¶
Before writing the paper, keep a copy of the files that make the run reproducible:
dataset.yamlfor each dataset;- the notebook or script that produced the registered
.h5ador.h5mu; run_manifest.yaml;- each successful run's
job_spec.json; metrics.json;run.logandcontainer.log;provenance.jsonor other provenance artifacts when present in the run directory.
In the Methods section, report:
- matrix state for each modality;
- filtering thresholds;
- normalization and transformation choices;
- HVG or peak selection rules;
- batch and cell-type metadata keys;
- model names and versions;
- random seed;
- metric families used for bio-conservation and batch correction.
Recommended wording:
All integration benchmarks were executed with multiverse. The benchmark plan, including
dataset identifiers, model selections, hyperparameters, random seed, and requested
metrics, is provided as `run_manifest.yaml` in the Supplementary Material. Per-run
runtime instructions and provenance files are provided with the corresponding model
artifacts.