Observability¶
multiverse ships two observability services as part of docker-compose.yml: an MLflow tracking server and an Optuna Dashboard. They are launched by make services-up and stopped by make services-down. The Streamlit Analysis tab embeds both, but each is also a fully featured standalone web UI.
This page documents what gets logged where, how the components are wired, and how to verify connectivity. It is written for platform operators; researchers can usually rely on the embedded views without thinking about the plumbing.
Service Map¶
| Service | URL | Container | Backing store | Purpose |
|---|---|---|---|---|
| MLflow | http://localhost:25000 |
mvr-mlflow |
store/mlflow.db (SQLite) + store/artifacts/ |
Cross-run parameter and metric comparison, artifact browser. |
| Optuna Dashboard | http://localhost:28080 |
mvr-optuna |
store/optuna.db (SQLite) |
Sweep visualization, parameter importance, pruning history. |
Streamlit (optional, profile gui) |
http://localhost:28501 |
mvr-streamlit |
host bind-mount + Docker socket | Runs the GUI inside a container for shared lab installs. |
Host ports default to the high range in the repo-root .env file (25000, 28080, 28501) so they are less likely to clash with other users on a shared machine. Override MLFLOW_PORT, OPTUNA_PORT, STREAMLIT_PORT, or MLFLOW_TRACKING_URI in .env or your shell. All three services bind-mount ./store to /data, so they read and write the same SQLite databases as the host orchestrator.
MLflow Run Model¶
Each containerized model execution corresponds to one MLflow run, opened by the host orchestrator before the container launches and closed after it exits. The run captures four kinds of data:
- Hyperparameters and tags — logged at run start by
multiverse.tracking.start_parent_mlflow_run(), so they are visible in MLflow before training begins. - System metrics — CPU, GPU, and RAM utilization sampled by MLflow's built-in monitor while the parent run is open (i.e., the duration of the container).
- Per-epoch metrics — streamed from inside the container by
multiverse.worker.EpochLogger. The host injectsMLFLOW_RUN_IDinto the container environment, andEpochLoggerattaches to that run instead of opening a duplicate. Models without per-epoch hooks (e.g. PCA) skip this step. - Final scalars and artifacts — appended by the host after the container exits via
log_successful_run_to_mlflow(), which sanitisesNaN/±Inf, flattens nested metric dictionaries, and then closes the run with statusFINISHEDorFAILED.
Optuna sweeps appear as child runs under a parent MLflow run that represents the study, so a sweep's trials remain navigable as a group.
Network Configuration¶
The Docker runner forwards MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_NAME into each model container. On Linux, localhost / 127.0.0.1 URIs are rewritten to host.docker.internal and the container is started with --add-host=host.docker.internal:host-gateway, so containers can reach an MLflow server bound on the host loopback.
If the MLflow service is itself running inside Docker (the default after make services-up), the URI resolves to http://mlflow:5000 over the compose network.
If your MLflow server is bound to 127.0.0.1 only on the host, start it with --host 0.0.0.0 so the gateway can route container traffic.
Optuna Study Model¶
When a manifest specifies globals.run_gridsearch: true, multiverse/runner/tuner.py creates one Optuna study per job. Studies are persisted to store/optuna.db and surfaced in the Optuna Dashboard.
| Concept | Where it lives |
|---|---|
| Study name | <experiment_name>__<dataset_slug>__<model_slug> |
| Trial parameters | Sampled from the model's hyperparameter schema; sweepable fields are flagged x-sweepable: true. |
| Trial metric | Configured via globals.metrics; defaults to a primary bio-conservation metric. |
| Pruning | MedianPruner by default; configurable per-job in future releases. |
Each trial also logs to MLflow as a child of the study's parent run, so the same numerical comparison is available in either UI.
Local Sidecars¶
Even with both services running, each successful run writes two local sidecars to its artifact directory:
| File | Contents |
|---|---|
metrics.json |
Final scalars and (optionally) a history block. |
metrics.jsonl |
One JSON object per epoch (step, timestamp, metrics) emitted by EpochLogger. Survives crashes. |
The artifact tree is therefore self-contained: even if mlflow.db is lost, the per-run record is recoverable from disk.
Verifying the Wiring¶
make services-up
make status # docker compose ps with port bindings
curl -sf http://localhost:25000/health # MLflow health endpoint
curl -sf http://localhost:28080/ # Optuna Dashboard root
Inside a model container, the SDK's EpochLogger will log a warning and fall back to JSONL-only mode if MLFLOW_TRACKING_URI is unset or unreachable. A run is never failed solely because tracking is unavailable.
Troubleshooting¶
| Symptom | Likely cause | What to do |
|---|---|---|
Analysis tab is blank |
Service not running. | make services-up; check make status. |
| MLflow has no entry for a successful run | Container could not reach the tracking server. | Confirm MLFLOW_TRACKING_URI resolves from inside the container. |
| Duplicate runs in MLflow | A model container opened its own run instead of attaching to MLFLOW_RUN_ID. |
Use EpochLogger from multiverse.worker; do not call mlflow.start_run() directly in container code. |
| Optuna Dashboard empty | No study has been created yet. | Run a manifest with run_gridsearch: true. |
store/mlflow.db-wal growing large |
Many concurrent writes. | Expected during heavy benchmarking; SQLite checkpoints periodically. |