Operator guide
Day-two operations for a standard Podium deployment.
For small-team, single-VM use, see Small team instead.
Capacity planning
Baseline (10K artifacts, 100 QPS, 1 GB Postgres, 500 GB object storage on a 3-replica deployment with db.m5.large equivalent) is the starting point. Beyond that:
| Dimension | Threshold | What to do |
|---|---|---|
| Artifacts | 100K | Increase Postgres instance size; review pgvector index parameters; consider sharding embeddings. |
| QPS | 1K | Scale registry replicas horizontally; put a CDN in front of object storage for resource bytes. |
| QPS | 10K | Review search query patterns; consider dedicated Elasticsearch for BM25 with pgvector (or Pinecone/Weaviate/Qdrant) for vector. |
| Tenants | 50 | Confirm RegistryStore connection pool is sized appropriately; increase pgbouncer pool if used. |
| Audit volume | 1M events/day | Set retention explicitly; ship the registry audit stream to an external SIEM by setting PODIUM_AUDIT_LOG_PATH to an http(s):// endpoint, which selects the registry endpoint sink instead of the on-disk hash-chained file. |
Embeddings dominate Postgres growth at scale. Each artifact’s text projection becomes a float vector whose dimension depends on the configured provider (768 for nomic-embed-text via Ollama, 1024 for voyage-3 or embed-v4, 1536 for text-embedding-3-small). At ~6 KB per row including metadata at 1536 dim, 100K artifacts is ~600 MB of embeddings.
Object-storage growth is dominated by bundled resources. Most teams’ p99 artifact size sits well under the 256 KB inline cutoff, so the inline manifest body fits in Postgres and only larger resources go to S3.
Monitoring
Both the registry and the MCP server expose Prometheus metrics. The reference Grafana dashboard ships with the registry. Key signals:
Registry:
podium_request_duration_seconds{endpoint}: per-endpoint latency histogram. Watchload_domain,search_domains,search_artifacts,load_artifact, andload_artifactsagainst the SLOs (p99 < 200ms / 200ms / 200ms / 500ms manifest / 2s with resources).podium_request_total{endpoint}andpodium_request_errors_total{endpoint}: request volume and error count per endpoint. The error counter increments for any response status at or above 400, so the per-endpoint error rate israte(podium_request_errors_total) / rate(podium_request_total).podium_visibility_denied_total: reads rejected by visibility filtering. This signal is informational; a sudden spike usually means a layer config error rather than an authorization issue.podium_cache_hits_totalandpodium_cache_misses_total: server-side cache hits and misses. A low hit ratio often indicates a CDN or import-glob misconfiguration.podium_ingest_success_totalandpodium_ingest_failure_total: ingest attempts that succeeded or failed. Flag a recent uptick in the failure counter.podium_vector_outbox_depth: pending rows in the external-vector-backend outbox. A rising depth indicates the drain worker is falling behind. The gauge reads 0 on a collocated backend that uses no outbox.
MCP server:
podium_mcp_requests_total{tool}andpodium_mcp_request_errors_total{tool}: per-tool call volume and error count at the bridge.podium_mcp_request_duration_seconds{tool}: per-tool call latency at the bridge.
Alerting
A reasonable starting set, tuned for the baseline deployment:
# Critical: page on-call
- alert: PodiumDown
expr: up{job="podium-registry"} == 0
for: 2m
- alert: PodiumLoadArtifactSLOBreached
expr: histogram_quantile(0.99, rate(podium_request_duration_seconds_bucket{endpoint="load_artifact"}[5m])) > 0.5
for: 5m
- alert: PodiumHighErrorRate
expr: sum(rate(podium_request_errors_total[5m])) / sum(rate(podium_request_total[5m])) > 0.05
for: 5m
# Warning: investigate within hours
- alert: PodiumIngestFailing
expr: increase(podium_ingest_failure_total[1h]) > 5
for: 15m
- alert: PodiumVectorOutboxBacklog
expr: podium_vector_outbox_depth > 1000
for: 10m
# Informational: review weekly
- alert: PodiumLowCacheHitRatio
expr: sum(rate(podium_cache_hits_total[1h])) / (sum(rate(podium_cache_hits_total[1h])) + sum(rate(podium_cache_misses_total[1h]))) < 0.5
The Helm chart ships these as a starter; tune thresholds to your SLOs.
Backup and restore
- Postgres. Managed services handle this. Enable point-in-time recovery (PITR) with at least 7 days of retention. For self-run Postgres, run logical (
pg_dump) daily and physical (base backups + WAL archiving) for PITR. - Object storage. Enable cross-region replication or daily snapshots. Resources are content-addressed and immutable, so restore is straightforward: replace the bucket contents from the snapshot.
- Default RPO 1h / RTO 4h for a managed-Postgres + replicated-S3 setup. Tighten by reducing PITR granularity or replicating at higher frequency; loosen by extending the PITR window.
Test restores quarterly. The runbook procedure:
1. Spin up a non-production registry pointed at a fresh Postgres + a fresh
S3 bucket.
2. Restore Postgres from PITR to T-1h.
3. Sync the production S3 bucket to the fresh one (rclone or aws s3 sync).
4. Run `podium admin verify --check audit-chain --check signatures` against
the restored deployment. Fix any reported gaps.
5. Spot-check `load_artifact` for a known-good artifact; should match the
pre-restore content_hash.
Upgrade procedure
Schema migrations are bundled in the registry binary and applied additively on startup: a new version creates tables and columns when absent and never drops or rewrites existing ones, so an upgrade migrates the database forward in place. Recommended cadence:
- Pre-upgrade. Read the changelog and the migration notes for the target version. If a migration is non-trivial (reshuffling embeddings, changing the audit schema), schedule a maintenance window.
- Canary. Roll one registry replica to the new version. Watch metrics for 30 min and confirm latency, error rate, and cache hit rate are unchanged.
- Roll. Roll the rest of the replicas. Because migrations are additive, an older replica ignores the new tables and columns, so old and new replicas coexist during the roll.
- Verify. After the roll completes, run
podium admin verify --check schema --check audit-chain.
Roll back by reverting the binary. The additive schema stays forward-compatible with the previous version’s binary, so an older binary continues to run against the migrated database.
Read-only mode
When the Postgres primary becomes unreachable but a read replica is up, the registry falls back to read-only mode: read endpoints continue to serve from the replica; write endpoints (ingest webhooks, layer admin operations, freeze toggles, admin grants, login-driven token issuance) are rejected with the structured error registry.read_only.
A health-state machine governs the transition. The registry probes the primary every 5 s and flips to read-only after three consecutive failures (tunable via PODIUM_READONLY_PROBE_INTERVAL and PODIUM_READONLY_PROBE_FAILURES). It flips back automatically after three consecutive probe successes once the primary is reachable again.
Read responses in read-only mode carry two additional headers:
X-Podium-Read-Only: trueX-Podium-Read-Only-Lag-Seconds: <n>: observed replication lag.
Audit events for state transitions (registry.read_only_entered, registry.read_only_exited) are logged like any other admin action and carry the same hash-chain integrity guarantees.
Security review checklist
Walk through these before launching to a tenant that handles sensitive content.
| Item | Check |
|---|---|
| OAuth identity flow | Device-code flow tested for every IdP in production use. Token lifetimes set to ≤15 min. Revocation propagates within 60s. |
| OIDC group claim mapping | Group claims actually produced by your IdP arrive in the registry’s audit log. Test with a non-admin user. |
| Per-layer visibility | Each layer’s visibility: declaration is correct. Test by impersonating a non-member identity (via injected-session-token test harness). |
| Sensitivity enforcement | PODIUM_VERIFY_SIGNATURES is medium-and-above (or stricter). Test that a tampered artifact fails materialization with materialize.signature_invalid. |
| Audit hash chain | Run podium admin verify --check audit-chain weekly via cron. Detect gaps automatically. |
| Webhook signing | Git provider webhook HMAC secret is unique per layer. Test with an invalid signature; expect ingest.webhook_invalid. |
| Sandbox profile honoring | The hosts in production honor sandbox_profile for non-unrestricted artifacts. Test with a read-only-fs artifact and confirm the host enforces. |
| Object-storage credentials | IAM roles or short-lived credentials, never static keys. Bucket policy denies public access. |
| Backup encryption | Postgres backups + S3 object versioning encrypted at rest. PITR window matches your RTO. |
| Scope preview gating | tenant.expose_scope_preview is set deliberately per tenant; false for tenants where aggregate visibility counts would leak signal. |
Re-run the checklist after every major release and after any change to layer config, IdP, or sandbox enforcement settings.
Common operational pitfalls
These come up a few times a year for most operators:
- Embedding provider rate limits. OpenAI and Voyage rate-limit aggressively under bulk reingest. Stagger
podium layer reingestacross layers, or switch toollamapointed at a local model server for inference during reingest storms. - pgvector index bloat. After many embeddings have churned,
REINDEXthe vector index quarterly or set upauto_vacuumaggressively. - MCP server cache pinning (
PODIUM_CACHE_DIRon slow disks). Developer machines with cache on a network filesystem will see materialization latency well above the SLO. Default to~/.podium/cache/on local disk. - Webhook retries during read-only mode. GitHub will retry webhooks for ~24 h with exponential backoff. If your read-only window exceeds that, ingests will be permanently lost. Trigger manual
podium layer reingestafter recovery. - Force-push on a Git source layer. Default policy is tolerant (
layer.history_rewrittenevent emitted, prior commits preserved in the content store). Ifforce_push_policy: strictis configured, expect ingest rejections after force-pushes. Coordinate with authors. - OIDC token clock skew. The registry tolerates ±60s of skew. NTP drift on a registry node beyond that window causes intermittent
auth.token_expirederrors. Monitor clock skew on registry hosts. - SCIM lag. OIDC group membership changes propagate via SCIM push from the IdP. If your IdP doesn’t push, group membership only updates on the user’s next login. Force a refresh with
podium admin scim-sync --user <id>.
Public-mode misconfiguration
A misconfigured public-mode deployment is the most common security-relevant operational anomaly because the registry serves correctly. It serves to everyone.
Detection:
/healthzreturnsmode: public.- Audit events for read calls show
caller.identity: "system:public"and the flagcaller.public_mode: true. - The startup banner shows the public-mode warning.
podium statussurfaces the flag.
Mitigation:
- Confirm public mode was the intended deployment posture. If it was, no action needed; the audit log already records the intent.
- If public mode was not intended (a misconfigured environment variable, copy-pasted CLI flag, or accidental container image tag), stop the registry, remove
--public-mode/ unsetPODIUM_PUBLIC_MODE, restart. The registry refuses mid-run flips, so a restart is mandatory. - If public mode was running on an internet-exposed registry (which the safety check should have prevented unless
--allow-public-bindwas set), treat as a security incident: rotate any signing keys that were in scope, audit the access log for unfamiliar IPs, and proceed per the org’s incident-response procedure.
Prevention. Container-image and Helm-chart consumers should set PODIUM_NO_AUTOSTANDALONE=1 and use --strict to refuse anything but explicitly-configured deployments. Production CI templates should fail-fast on the presence of PODIUM_PUBLIC_MODE in environment lists.
When to escalate to support / open an issue
- Audit chain gap detected (
podium admin verify --check audit-chainreports a hash mismatch). Treat as a security incident; capture evidence before any cleanup. - Repeated
materialize.signature_invalidfor authored artifacts. Either the signing pipeline broke or someone is tampering. Investigate before continuing. - Sustained latency degradation that doesn’t track CPU / memory / DB load. Often indicates a query-plan regression after a Postgres major upgrade.
- Out-of-band ingest events (artifacts appear in the registry without a corresponding
artifact.publishedoutbound webhook). Indicates webhook config or processing failure.
For all of these, capture: relevant log lines (with trace IDs), the affected tenant id, the affected artifact id(s), and a brief timeline. The more of those you have ready, the faster the fix.