Glass Vault — Storage Blueprint (aligned with lemm-core-specs/docs/data-safety)¶
This document translates the LEMM Vault data-safety architecture into a pragmatic storage stack for Glass Vault.
1) Storage domains (what must exist)¶
The spec separates concerns into distinct stores:
- VaultDB: authoritative metadata + policy state (rights, consent, revocations, lineage)
- Encrypted ObjectStore: raw audio and large artifacts (never served as “browsable content”)
- FingerprintIndex: similarity infrastructure (exact fingerprint lookup + ANN vector search)
- Append-only Logs: auditable event trail for access/training/generation policy enforcement
2) Recommended default stack¶
Summary table¶
| Domain | What it stores | Primary recommendation | Alternatives | Notes |
|---|---|---|---|---|
| VaultDB (OLTP) | Track + rights + contributor + submissions + lineage + pack registry | Managed PostgreSQL | Postgres-compatible serverless; Document DB (if strongly preferred) | Relational model matches rights + lineage + pack membership. |
| ObjectStore | Encrypted audio blobs, derived features, exports | S3-compatible object storage (S3 / GCS / Azure Blob / R2 / MinIO) | Any encrypted object store with signed URL support | Keep audio out of the DB; provide short-lived signed URLs/tokens only. |
| Fingerprint exact index | fp_exact: hash -> track_id[] | Redis (or DynamoDB-like KV) | FoundationDB, RocksDB service, Cassandra | Extremely fast point lookups; supports “candidate set” for ANN narrowing. |
| Vector ANN index | embed_music: embedding -> nearest neighbors (HNSW) | Qdrant (HNSW) or Milvus | HNSWlib service; managed vector DB | Keep separate from VaultDB. Tune recall/latency for block/warn thresholds. |
| Audit & policy logs | Immutable event stream + queryable store | Kafka/NATS + object-store sink + ClickHouse/OpenSearch | BigQuery/Snowflake; Loki (ops-only) | Append-only first; query store for investigations & dashboards. |
3) Architecture (high-level)¶
flowchart TB
subgraph API["Glass Vault API"]
A1[Ingestion Service]
A2[Vault Admin/Policy Service]
A3[Output Safety Service]
A4[Training Orchestrator]
end
subgraph DB["Core Stores"]
P[(VaultDB: PostgreSQL)]
O[(ObjectStore: Encrypted audio)]
L[(Append-only Log)]
end
subgraph IDX["Similarity Indexes"]
KV[(Exact FP KV: Redis/DynamoDB)]
VEC[(Vector ANN: Qdrant/Milvus)]
end
subgraph Q["Query / Analytics"]
CH[(ClickHouse / OpenSearch)]
end
A1 -->|metadata + rights| P
A1 -->|encrypted blob| O
A1 -->|fingerprints| KV
A1 -->|embeddings| VEC
A2 -->|policy updates| P
A3 -->|reads candidates| KV
A3 -->|ANN search| VEC
A3 -->|fetch track metadata| P
A4 -->|select packs + runs| P
A4 -->|read training blobs| O
A1 -->|events| L
A2 -->|events| L
A3 -->|events| L
A4 -->|events| L
L --> CH
4) Data flows (critical ones)¶
4.1 Ingestion flow¶
sequenceDiagram
participant U as Uploader/Contributor
participant S as Ingestion Service
participant P as VaultDB (Postgres)
participant O as ObjectStore (Encrypted)
participant KV as Exact FP KV
participant V as Vector ANN
participant L as Append-only Log
U->>S: Upload track + rights/consent
S->>O: Store encrypted audio blob
S->>P: Store Track + RightsRecord + SubmissionHistory
S->>S: Compute fingerprints + embeddings
S->>KV: Upsert fp_exact -> track_id[]
S->>V: Upsert embed_music vector
S->>L: Emit vault_access / ingest events
4.2 Generation-time safety check (non-reproduction)¶
sequenceDiagram
participant G as Generator (model output)
participant OSS as Output Safety Service
participant KV as Exact FP KV
participant V as Vector ANN
participant P as VaultDB (Postgres)
participant L as Append-only Log
G->>OSS: Candidate output (audio/embedding/fingerprint)
OSS->>KV: fp_exact lookup -> candidate track IDs
OSS->>V: ANN search (HNSW) -> nearest neighbors
OSS->>P: Fetch rights/policy for candidates
OSS->>OSS: Score + apply thresholds (allow/warn/block)
OSS->>L: Log output_similarity_check (+ decision)
OSS-->>G: Decision + optional warnings
5) Why this stack matches the spec¶
VaultDB: Postgres¶
- Best fit for rights, revocations, lineage, pack membership, and audit linkage.
- Keeps “what’s allowed” strongly consistent (transactions, constraints).
ObjectStore: encrypted, token-gated access¶
- Raw audio must not be treated as a queryable dataset.
- Access is mediated through signed tokens/URLs with short TTL and tight scoping.
FingerprintIndex: two complementary indexes¶
- Exact FP (hash -> IDs) narrows search quickly and supports strict matching.
- Vector ANN catches perceptual similarity and “near-copy” detection.
Logs: append-only, queryable¶
- You need an immutable trail for:
- policy changes, vault access, training batches, output similarity checks
- Separate the “write once” log from the “query many” analytics store.
6) Operational recommendations¶
Availability¶
- For production, run always-on Postgres and always-on vector store.
- If using serverless Postgres, avoid configurations that suspend to zero unless Glass Vault is non-critical.
Connection model¶
- Use connection pooling (app pool or PgBouncer-style) and keep DB connections bounded.
Backups / DR¶
- VaultDB: PITR (point-in-time recovery), tested restores.
- ObjectStore: versioning + lifecycle rules + cross-region replication (if required).
- Indexes: treat as reconstructable from VaultDB + ObjectStore, but still snapshot if rebuild time is large.
Monitoring (minimum)¶
- Postgres: disk/WAL usage, replication lag, autovacuum health, connection count
- ObjectStore: 4xx/5xx rates, encryption failures, token issuance
- Vector/KV: latency p95/p99, QPS, memory/disk
- Logs: ingestion lag, dropped events, query store freshness
7) Suggested implementation milestones¶
v0 (lean but correct separation)¶
- Postgres (VaultDB)
- Encrypted object storage
- One ANN service (Qdrant) + basic fp_exact store (Redis)
- Append-only logs to object storage; minimal query via ClickHouse/OpenSearch later
v1 (scale + governance)¶
- Pack registry + training run provenance in Postgres
- Event stream + structured audit dashboards
- External catalog integration for similarity checks (if applicable)
8) Open decisions (to resolve for Glass Vault)¶
- Expected QPS and dataset size (impacts vector store sizing)
- Required RPO/RTO (drives DB replication/DR)
- Multi-tenancy model (row-level security vs per-tenant DB/schema)
- Encryption model: envelope encryption (KMS) vs app-managed keys