Skip to content

Glass Vault — Storage Blueprint (aligned with lemm-core-specs/docs/data-safety)

This document translates the LEMM Vault data-safety architecture into a pragmatic storage stack for Glass Vault.


1) Storage domains (what must exist)

The spec separates concerns into distinct stores:

  • VaultDB: authoritative metadata + policy state (rights, consent, revocations, lineage)
  • Encrypted ObjectStore: raw audio and large artifacts (never served as “browsable content”)
  • FingerprintIndex: similarity infrastructure (exact fingerprint lookup + ANN vector search)
  • Append-only Logs: auditable event trail for access/training/generation policy enforcement

Summary table

Domain What it stores Primary recommendation Alternatives Notes
VaultDB (OLTP) Track + rights + contributor + submissions + lineage + pack registry Managed PostgreSQL Postgres-compatible serverless; Document DB (if strongly preferred) Relational model matches rights + lineage + pack membership.
ObjectStore Encrypted audio blobs, derived features, exports S3-compatible object storage (S3 / GCS / Azure Blob / R2 / MinIO) Any encrypted object store with signed URL support Keep audio out of the DB; provide short-lived signed URLs/tokens only.
Fingerprint exact index fp_exact: hash -> track_id[] Redis (or DynamoDB-like KV) FoundationDB, RocksDB service, Cassandra Extremely fast point lookups; supports “candidate set” for ANN narrowing.
Vector ANN index embed_music: embedding -> nearest neighbors (HNSW) Qdrant (HNSW) or Milvus HNSWlib service; managed vector DB Keep separate from VaultDB. Tune recall/latency for block/warn thresholds.
Audit & policy logs Immutable event stream + queryable store Kafka/NATS + object-store sink + ClickHouse/OpenSearch BigQuery/Snowflake; Loki (ops-only) Append-only first; query store for investigations & dashboards.

3) Architecture (high-level)

flowchart TB
  subgraph API["Glass Vault API"]
    A1[Ingestion Service]
    A2[Vault Admin/Policy Service]
    A3[Output Safety Service]
    A4[Training Orchestrator]
  end

  subgraph DB["Core Stores"]
    P[(VaultDB: PostgreSQL)]
    O[(ObjectStore: Encrypted audio)]
    L[(Append-only Log)]
  end

  subgraph IDX["Similarity Indexes"]
    KV[(Exact FP KV: Redis/DynamoDB)]
    VEC[(Vector ANN: Qdrant/Milvus)]
  end

  subgraph Q["Query / Analytics"]
    CH[(ClickHouse / OpenSearch)]
  end

  A1 -->|metadata + rights| P
  A1 -->|encrypted blob| O
  A1 -->|fingerprints| KV
  A1 -->|embeddings| VEC
  A2 -->|policy updates| P
  A3 -->|reads candidates| KV
  A3 -->|ANN search| VEC
  A3 -->|fetch track metadata| P
  A4 -->|select packs + runs| P
  A4 -->|read training blobs| O

  A1 -->|events| L
  A2 -->|events| L
  A3 -->|events| L
  A4 -->|events| L
  L --> CH

4) Data flows (critical ones)

4.1 Ingestion flow

sequenceDiagram
  participant U as Uploader/Contributor
  participant S as Ingestion Service
  participant P as VaultDB (Postgres)
  participant O as ObjectStore (Encrypted)
  participant KV as Exact FP KV
  participant V as Vector ANN
  participant L as Append-only Log

  U->>S: Upload track + rights/consent
  S->>O: Store encrypted audio blob
  S->>P: Store Track + RightsRecord + SubmissionHistory
  S->>S: Compute fingerprints + embeddings
  S->>KV: Upsert fp_exact -> track_id[]
  S->>V: Upsert embed_music vector
  S->>L: Emit vault_access / ingest events

4.2 Generation-time safety check (non-reproduction)

sequenceDiagram
  participant G as Generator (model output)
  participant OSS as Output Safety Service
  participant KV as Exact FP KV
  participant V as Vector ANN
  participant P as VaultDB (Postgres)
  participant L as Append-only Log

  G->>OSS: Candidate output (audio/embedding/fingerprint)
  OSS->>KV: fp_exact lookup -> candidate track IDs
  OSS->>V: ANN search (HNSW) -> nearest neighbors
  OSS->>P: Fetch rights/policy for candidates
  OSS->>OSS: Score + apply thresholds (allow/warn/block)
  OSS->>L: Log output_similarity_check (+ decision)
  OSS-->>G: Decision + optional warnings

5) Why this stack matches the spec

VaultDB: Postgres

  • Best fit for rights, revocations, lineage, pack membership, and audit linkage.
  • Keeps “what’s allowed” strongly consistent (transactions, constraints).

ObjectStore: encrypted, token-gated access

  • Raw audio must not be treated as a queryable dataset.
  • Access is mediated through signed tokens/URLs with short TTL and tight scoping.

FingerprintIndex: two complementary indexes

  • Exact FP (hash -> IDs) narrows search quickly and supports strict matching.
  • Vector ANN catches perceptual similarity and “near-copy” detection.

Logs: append-only, queryable

  • You need an immutable trail for:
  • policy changes, vault access, training batches, output similarity checks
  • Separate the “write once” log from the “query many” analytics store.

6) Operational recommendations

Availability

  • For production, run always-on Postgres and always-on vector store.
  • If using serverless Postgres, avoid configurations that suspend to zero unless Glass Vault is non-critical.

Connection model

  • Use connection pooling (app pool or PgBouncer-style) and keep DB connections bounded.

Backups / DR

  • VaultDB: PITR (point-in-time recovery), tested restores.
  • ObjectStore: versioning + lifecycle rules + cross-region replication (if required).
  • Indexes: treat as reconstructable from VaultDB + ObjectStore, but still snapshot if rebuild time is large.

Monitoring (minimum)

  • Postgres: disk/WAL usage, replication lag, autovacuum health, connection count
  • ObjectStore: 4xx/5xx rates, encryption failures, token issuance
  • Vector/KV: latency p95/p99, QPS, memory/disk
  • Logs: ingestion lag, dropped events, query store freshness

7) Suggested implementation milestones

v0 (lean but correct separation)

  • Postgres (VaultDB)
  • Encrypted object storage
  • One ANN service (Qdrant) + basic fp_exact store (Redis)
  • Append-only logs to object storage; minimal query via ClickHouse/OpenSearch later

v1 (scale + governance)

  • Pack registry + training run provenance in Postgres
  • Event stream + structured audit dashboards
  • External catalog integration for similarity checks (if applicable)

8) Open decisions (to resolve for Glass Vault)

  • Expected QPS and dataset size (impacts vector store sizing)
  • Required RPO/RTO (drives DB replication/DR)
  • Multi-tenancy model (row-level security vs per-tenant DB/schema)
  • Encryption model: envelope encryption (KMS) vs app-managed keys