Data Registration¶
This how-to explains how to make a prepared dataset visible to multiverse through the Streamlit Registry tab. The CLI equivalent is make register slug=<slug> or uv run multiverse register-dataset --slug <slug>.
Use Data Preparation for notebook-side formatting details. This page focuses on onboarding the prepared files into the platform.
What Registration Does¶
Registration tells multiverse:
- what the dataset is called;
- which modalities are available;
- where the prepared files live;
- which
.obscolumn is the batch key; - which
.obscolumn is the cell-type key.
[IMAGE: Registry Tab Ingestion Wizard]
Tutorial: Register a Dataset Visually¶
- Start multiverse and open the Streamlit GUI.
- Open the Registry tab.
- Expand Register New Dataset.
- Switch on Build manifest from fields if you do not already have a
dataset.yaml. - Enter a descriptive dataset name, for example
PBMC Multiome RNA+ATAC. - Select available omics:
rna,atac,adt, orother. - Enter the path to each prepared
.h5ador.h5mufile. - Enter
batch_key, for exampledonor_id,sample, orchemistry. - Enter
cell_type_key, for examplecell_type,annotation, orcell_ontology_class. - Click Register Dataset.
- Click Refresh Registry.
- Confirm the dataset appears with status
READY.
Hello World Dataset¶
Minimal RNA-only dataset manifest:
name: "Hello PBMC RNA"
omics: ["rna"]
raw_files:
rna: "data/rna.h5ad"
metadata_keys:
batch: "batch"
cell_type: "cell_type"
Folder layout:
Notebook-side sanity check:
import scanpy as sc
adata = sc.read_h5ad("store/datasets/hello_pbmc/data/rna.h5ad")
assert "batch" in adata.obs
assert "cell_type" in adata.obs
assert adata.n_obs > 0
assert adata.n_vars > 0
Two registration modes¶
A dataset can be registered in one of two ways:
- Raw ingestion — declare
raw_filesand run preprocessing to fuse them intodata/processed.h5mu(shown above). - Processed registration — you already have a processed
.h5mu/.h5adand want to register it directly, skipping preprocessing. Declareprocessed_pathinstead ofraw_files:
name: "Hello PBMC (processed)"
omics: ["rna"]
processed_path: "data/processed.h5mu"
metadata_keys:
batch: "batch"
cell_type: "cell_type"
The manifest must provide exactly one of raw_files or processed_path.
Model runs always consume the processed .h5mu; raw_files belongs only to
the raw-ingestion workflow.
Register from the CLI¶
For a dataset stored at store/datasets/hello_pbmc/dataset.yaml:
uv run multiverse register-dataset --slug hello_pbmc
# or with an explicit manifest path
uv run multiverse register-dataset --manifest store/datasets/hello_pbmc/dataset.yaml
Use --update when you intentionally changed an existing manifest:
Reference: dataset.yaml Fields¶
| Field | Required | Meaning | Example |
|---|---|---|---|
name |
Yes | Human-readable dataset name. | PBMC Multiome RNA+ATAC |
omics |
Yes | Modalities available in the dataset. | ["rna", "atac"] |
raw_files |
Conditional | Mapping from modality to raw file path relative to the dataset folder. Required for raw ingestion; omit when using processed_path. |
rna: "data/rna.h5ad" |
processed_path |
Conditional | Path (relative to the dataset folder) to an already-processed .h5mu/.h5ad. Required for processed registration; omit when using raw_files. |
data/processed.h5mu |
metadata_keys.batch |
Recommended | .obs column used for batch-correction metrics. |
donor_id |
metadata_keys.cell_type |
Optional | .obs column used for supervised bio-conservation metrics. |
cell_type |
Explanation: Why Metadata Keys Matter¶
The same embedding can look good or bad depending on the biological question. batch_key tells multiverse which technical or donor grouping should be mixed. cell_type_key tells multiverse which biological labels should be preserved.
flowchart TD
A[Prepared AnnData or MuData] --> B[batch_key]
A --> C[cell_type_key]
B --> D[Batch-correction metrics]
C --> E[Bio-conservation metrics]
D --> F[Comparison report]
E --> F
Common Errors¶
| Symptom | Likely cause | What to do |
|---|---|---|
| Dataset does not appear after registration | Registry view is cached. | Click Refresh Registry. |
| Registration fails with missing file | raw_files path is wrong. |
Make paths relative to store/datasets/<slug>/. |
| Batch metrics are skipped | Batch column missing or has one value. | Check adata.obs[batch_key].value_counts() in Jupyter. |
| Label metrics are skipped | cell_type_key missing or misspelled. |
Confirm the column name exactly matches .obs. |
| A model is incompatible | Dataset modalities do not match model requirements. | Choose a compatible model in Configure. |
How to Cite Registered Data¶
For publications, archive the dataset.yaml file with the notebook that produced the .h5ad or .h5mu. In Methods, report the matrix state, filtering, normalization, batch key, and cell-type key.