Skip to content

Data Registration

This how-to explains how to make a prepared dataset visible to multiverse through the Streamlit Registry tab. The CLI equivalent is make register slug=<slug> or uv run multiverse register-dataset --slug <slug>.

Use Data Preparation for notebook-side formatting details. This page focuses on onboarding the prepared files into the platform.

What Registration Does

Registration tells multiverse:

  • what the dataset is called;
  • which modalities are available;
  • where the prepared files live;
  • which .obs column is the batch key;
  • which .obs column is the cell-type key.

[IMAGE: Registry Tab Ingestion Wizard]

Tutorial: Register a Dataset Visually

  1. Start multiverse and open the Streamlit GUI.
  2. Open the Registry tab.
  3. Expand Register New Dataset.
  4. Switch on Build manifest from fields if you do not already have a dataset.yaml.
  5. Enter a descriptive dataset name, for example PBMC Multiome RNA+ATAC.
  6. Select available omics: rna, atac, adt, or other.
  7. Enter the path to each prepared .h5ad or .h5mu file.
  8. Enter batch_key, for example donor_id, sample, or chemistry.
  9. Enter cell_type_key, for example cell_type, annotation, or cell_ontology_class.
  10. Click Register Dataset.
  11. Click Refresh Registry.
  12. Confirm the dataset appears with status READY.

Hello World Dataset

Minimal RNA-only dataset manifest:

name: "Hello PBMC RNA"
omics: ["rna"]
raw_files:
  rna: "data/rna.h5ad"
metadata_keys:
  batch: "batch"
  cell_type: "cell_type"

Folder layout:

store/datasets/hello_pbmc/
  dataset.yaml
  data/
    rna.h5ad

Notebook-side sanity check:

import scanpy as sc

adata = sc.read_h5ad("store/datasets/hello_pbmc/data/rna.h5ad")
assert "batch" in adata.obs
assert "cell_type" in adata.obs
assert adata.n_obs > 0
assert adata.n_vars > 0

Two registration modes

A dataset can be registered in one of two ways:

  • Raw ingestion — declare raw_files and run preprocessing to fuse them into data/processed.h5mu (shown above).
  • Processed registration — you already have a processed .h5mu/.h5ad and want to register it directly, skipping preprocessing. Declare processed_path instead of raw_files:
name: "Hello PBMC (processed)"
omics: ["rna"]
processed_path: "data/processed.h5mu"
metadata_keys:
  batch: "batch"
  cell_type: "cell_type"

The manifest must provide exactly one of raw_files or processed_path. Model runs always consume the processed .h5mu; raw_files belongs only to the raw-ingestion workflow.

Register from the CLI

For a dataset stored at store/datasets/hello_pbmc/dataset.yaml:

uv run multiverse register-dataset --slug hello_pbmc
# or with an explicit manifest path
uv run multiverse register-dataset --manifest store/datasets/hello_pbmc/dataset.yaml

Use --update when you intentionally changed an existing manifest:

uv run multiverse register-dataset --slug hello_pbmc --update

Reference: dataset.yaml Fields

Field Required Meaning Example
name Yes Human-readable dataset name. PBMC Multiome RNA+ATAC
omics Yes Modalities available in the dataset. ["rna", "atac"]
raw_files Conditional Mapping from modality to raw file path relative to the dataset folder. Required for raw ingestion; omit when using processed_path. rna: "data/rna.h5ad"
processed_path Conditional Path (relative to the dataset folder) to an already-processed .h5mu/.h5ad. Required for processed registration; omit when using raw_files. data/processed.h5mu
metadata_keys.batch Recommended .obs column used for batch-correction metrics. donor_id
metadata_keys.cell_type Optional .obs column used for supervised bio-conservation metrics. cell_type

Explanation: Why Metadata Keys Matter

The same embedding can look good or bad depending on the biological question. batch_key tells multiverse which technical or donor grouping should be mixed. cell_type_key tells multiverse which biological labels should be preserved.

flowchart TD
    A[Prepared AnnData or MuData] --> B[batch_key]
    A --> C[cell_type_key]
    B --> D[Batch-correction metrics]
    C --> E[Bio-conservation metrics]
    D --> F[Comparison report]
    E --> F

Common Errors

Symptom Likely cause What to do
Dataset does not appear after registration Registry view is cached. Click Refresh Registry.
Registration fails with missing file raw_files path is wrong. Make paths relative to store/datasets/<slug>/.
Batch metrics are skipped Batch column missing or has one value. Check adata.obs[batch_key].value_counts() in Jupyter.
Label metrics are skipped cell_type_key missing or misspelled. Confirm the column name exactly matches .obs.
A model is incompatible Dataset modalities do not match model requirements. Choose a compatible model in Configure.

How to Cite Registered Data

For publications, archive the dataset.yaml file with the notebook that produced the .h5ad or .h5mu. In Methods, report the matrix state, filtering, normalization, batch key, and cell-type key.