Data Registration¶

This how-to explains how to make a prepared dataset visible to multiverse through the Streamlit Registry tab. The CLI equivalent is make register slug=<slug> or uv run multiverse register-dataset --slug <slug>.

Use Data Preparation for notebook-side formatting details. This page focuses on onboarding the prepared files into the platform.

What Registration Does¶

Registration tells multiverse:

what the dataset is called;
which modalities are available;
where the prepared files live;
which .obs column is the batch key;
which .obs column is the cell-type key.

[IMAGE: Registry Tab Ingestion Wizard]

Tutorial: Register a Dataset Visually¶

Start multiverse and open the Streamlit GUI.
Open the Registry tab.
Expand Register New Dataset.
Switch on Build manifest from fields if you do not already have a dataset.yaml.
Enter a descriptive dataset name, for example PBMC Multiome RNA+ATAC.
Select available omics: rna, atac, adt, or other.
Enter the path to each prepared .h5ad or .h5mu file.
Enter batch_key, for example donor_id, sample, or chemistry.
Enter cell_type_key, for example cell_type, annotation, or cell_ontology_class.
Click Register Dataset.
Click Refresh Registry.
Confirm the dataset appears with status READY.

Hello World Dataset¶

Minimal RNA-only dataset manifest:

name: "Hello PBMC RNA"
omics: ["rna"]
raw_files:
  rna: "data/rna.h5ad"
metadata_keys:
  batch: "batch"
  cell_type: "cell_type"

Folder layout:

store/datasets/hello_pbmc/
  dataset.yaml
  data/
    rna.h5ad

Notebook-side sanity check:

import scanpy as sc

adata = sc.read_h5ad("store/datasets/hello_pbmc/data/rna.h5ad")
assert "batch" in adata.obs
assert "cell_type" in adata.obs
assert adata.n_obs > 0
assert adata.n_vars > 0

Two registration modes¶

A dataset can be registered in one of two ways:

Raw ingestion — declare raw_files and run preprocessing to fuse them into data/processed.h5mu (shown above).
Processed registration — you already have a processed .h5mu/.h5ad and want to register it directly, skipping preprocessing. Declare processed_path instead of raw_files:

name: "Hello PBMC (processed)"
omics: ["rna"]
processed_path: "data/processed.h5mu"
metadata_keys:
  batch: "batch"
  cell_type: "cell_type"

The manifest must provide exactly one of raw_files or processed_path. Model runs always consume the processed .h5mu; raw_files belongs only to the raw-ingestion workflow.

Register from the CLI¶

For a dataset stored at store/datasets/hello_pbmc/dataset.yaml:

uv run multiverse register-dataset --slug hello_pbmc
# or with an explicit manifest path
uv run multiverse register-dataset --manifest store/datasets/hello_pbmc/dataset.yaml

Use --update when you intentionally changed an existing manifest:

uv run multiverse register-dataset --slug hello_pbmc --update

Reference: `dataset.yaml` Fields¶

Field	Required	Meaning	Example
`name`	Yes	Human-readable dataset name.	`PBMC Multiome RNA+ATAC`
`omics`	Yes	Modalities available in the dataset.	`["rna", "atac"]`
`raw_files`	Conditional	Mapping from modality to raw file path relative to the dataset folder. Required for raw ingestion; omit when using `processed_path`.	`rna: "data/rna.h5ad"`
`processed_path`	Conditional	Path (relative to the dataset folder) to an already-processed `.h5mu`/`.h5ad`. Required for processed registration; omit when using `raw_files`.	`data/processed.h5mu`
`metadata_keys.batch`	Recommended	`.obs` column used for batch-correction metrics.	`donor_id`
`metadata_keys.cell_type`	Optional	`.obs` column used for supervised bio-conservation metrics.	`cell_type`

Explanation: Why Metadata Keys Matter¶

The same embedding can look good or bad depending on the biological question. batch_key tells multiverse which technical or donor grouping should be mixed. cell_type_key tells multiverse which biological labels should be preserved.

flowchart TD
    A[Prepared AnnData or MuData] --> B[batch_key]
    A --> C[cell_type_key]
    B --> D[Batch-correction metrics]
    C --> E[Bio-conservation metrics]
    D --> F[Comparison report]
    E --> F

Common Errors¶

Symptom	Likely cause	What to do
Dataset does not appear after registration	Registry view is cached.	Click Refresh Registry.
Registration fails with missing file	`raw_files` path is wrong.	Make paths relative to `store/datasets/<slug>/`.
Batch metrics are skipped	Batch column missing or has one value.	Check `adata.obs[batch_key].value_counts()` in Jupyter.
Label metrics are skipped	`cell_type_key` missing or misspelled.	Confirm the column name exactly matches `.obs`.
A model is incompatible	Dataset modalities do not match model requirements.	Choose a compatible model in Configure.

How to Cite Registered Data¶

For publications, archive the dataset.yaml file with the notebook that produced the .h5ad or .h5mu. In Methods, report the matrix state, filtering, normalization, batch key, and cell-type key.