Skip to content

Evaluation Metrics

This reference explains the biological and technical metrics used to compare model outputs. The platform evaluates successful model runs from their saved latent embeddings, then records metrics for comparison in the results tables and MLflow when tracking is enabled.

Required Dataset Metadata

Evaluation depends on metadata registered with the dataset:

Metadata key Purpose
batch_key Observation column identifying experimental batch, donor, technology, or another known nuisance grouping. Required for batch-correction metrics.
cell_type_key Observation column identifying biological labels. Used by supervised bio-conservation metrics.

Important note: If cell_type or batch is missing, Multiverse can still run models, but the evaluation metrics pipeline cannot run without having both available. To circumvent this limitation in the scib-metrics source code, we assign random labels for the samples for each of the missing keys (for now). Therefore, in case a label is misisng, the results shown for that missing label might be misleading.

We use Bio-conservation and Batch-correction metrics from scib-metrics. For more information about these metrics, please consult the scib-metrics package.

Interpretation Notes

Use bio-conservation and batch-correction metrics together. A model can mix batches well by erasing meaningful cell-type structure, or preserve biology while leaving strong technical separation. The most defensible interpretation is therefore comparative: inspect both metric groups across the same dataset, metadata keys, and selected model set.

Metric availability is part of the result. If a supervised metric is absent, first check whether the dataset was registered with a valid cell_type_key and contains more than one unique value. If batch-correction metrics are absent, check that the registered batch_key exists and contains more than one unique value.

Model-Level Metrics

Model-level metrics come from each model wrapper and are written to metrics.json before cross-model evaluation. They are useful for diagnostics but should not be treated as interchangeable biological benchmark scores.

Model Default model-level metrics
PCA total_variance
MOFA total_variance
MultiVI silhouette_score when labels are available
TotalVI elbo_train, reconstruction_loss_train
Mowgli ot_loss
Cobolt loss

Losses and ELBO values are model-specific training diagnostics. They can help compare trials of the same model, especially during Optuna sweeps, but they should not be used alone to rank different model families.

How to Cite Metric Results

When reporting metrics, cite multiverse and the underlying metric or model method where appropriate. Archive run_manifest.yaml, metrics.json, and provenance files so readers can connect each reported value to its run recipe.