Community Dataset & Safety
1. LEMM Vault mission & constraints document¶
1.1 Mission statement¶
Mission
The LEMM Vault is a community-owned, decisively clean music dataset and safety system that:
- collects original tracks from contributors under explicit, documented rights,
- stores them in a controlled vault for training and evaluation of music models only,
- enforces non-reproduction of those tracks and major external catalogs at generation time,
- and provides end-users with commercially usable licensed outputs, while protecting contributors’ interests.
1.2 Definitions¶
“Clean”
Operationally:
-
Every track has:
-
a known contributor identity (or pseudonym tied to a verified account),
-
an explicit rights grant for:
- training + evaluation of models,
- optional internal demo / research use,
- no “non-commercial only” or unclear licensing.
- No bulk scraping; all content is consciously contributed.
- No redistribution of raw audio outside the vault, except:
-
short internal debug snippets (with strict policy),
- synthetic/diagnostic snippets that cannot substitute the track.
“Community-owned”
- Governance: decisions about policies, thresholds, and admission rules are controlled by a small v0 governance group but with explicit path to community participation (contributors and reviewers).
-
Contributors:
-
keep copyright and moral rights in their works,
- grant LEMM-AI a non-exclusive license for training/eval,
- can revoke future use of their tracks (subject to v0 limits on retroactive model retraining).
“Training-only use”
-
Raw tracks are used only to:
-
train and evaluate models,
- build derived indexes (fingerprints, embeddings, features).
-
No:
-
public download endpoints for source audio,
- bundling LEMM Vault tracks as stock libraries, sample packs, etc.
1.3 Rights model → system behavior¶
Per-track rights record drives:
-
Ingestion:
-
Only tracks with
consent.training_allowed == trueand compatible license type enter the “trainable” subset. -
Training selection:
-
Training jobs query the vault for batches filtered by:
training_allowed == true,revoked == false,- required scope (e.g. genre, language).
- Evaluation & internal demos:
-
Separate flags, e.g.
internal_demo_ok. -
Personalization / adapters:
-
Additional explicit flags, e.g.
allow_per_artist_adapters.
Revocation behavior
-
When a contributor revokes:
-
Track moves to
revoked = true, immediately excluded from future training. -
For v0:
-
models trained before revocation may continue to be served, but:
-
LEMM-AI records which models trained on that track,
- policy describes when retraining / fine-tune removal happens (e.g. for high-risk models).
-
2. Contribution & consent pipeline spec¶
2.1 End-to-end flow¶
-
Contributor onboarding
-
User creates an account (email / OAuth).
-
Must affirm:
- “I am the rights holder or have authority to license this audio for training & evaluation.”
- Track submission (UI)
-
Upload audio file(s) (v0: stereo mix only, no stems).
- Fill metadata: title, artist name, collaborators, release status, genre, language, etc.
-
Declare rights:
- “This is an original work, not a cover or remix of copyrighted material I don’t control.”
-
Set consent flags:
-
training_allowed(required for inclusion), internal_demo_ok(optional),allow_per_artist_adapters(optional).- Accept LEMM Vault contribution terms.
- Automated checks (ingestion worker)
-
Audio sanity checks (duration, format).
- External similarity checks.
- Internal similarity checks.
- Basic heuristic content flags (e.g. explicit content, if needed later).
-
Review queue
-
Tracks with no issues may auto-accept or require light reviewer approval (configurable).
- Tracks with similarity hits or inconsistent metadata go into “manual review.”
-
Decision
-
Accept → track enters the LEMM Vault as
status=accepted. - Reject → track stored only in a quarantine area (or removed entirely per policy).
-
Post-acceptance
-
Features/fingerprints are finalized and added to internal indexes.
- Track becomes eligible for training batches.
2.2 Minimal submission UI & endpoints¶
Primary endpoints (pseudo-REST)
POST /tracks/submitGET /tracks/{track_id}GET /tracks/{track_id}/statusPOST /tracks/{track_id}/revokeGET /contributors/me/tracks
Example: POST /tracks/submit
Request (simplified):
{
"audio_upload_token": "upload-123",
"metadata": {
"title": "Glass City",
"artist_display_name": "Nova Arc",
"is_collaboration": true,
"collaborator_names": ["DJ Ray", "A. Rivera"],
"genre_tags": ["electronic", "ambient"],
"language": "instrumental",
"release_status": "unreleased" // or "released", "work_for_hire"
},
"rights": {
"rights_holder_type": "individual", // individual|label|publisher
"is_original": true,
"is_cover": false,
"has_third_party_samples": false,
"territorial_scope": "worldwide",
"term": "perpetual", // v0: perpetual for training-only
"training_allowed": true,
"internal_demo_ok": true,
"allow_per_artist_adapters": false
}
}
Response:
Consent representation (per-track)
{
"track_id": "trk_abc123",
"consent": {
"version": 1,
"training_allowed": true,
"internal_demo_ok": true,
"allow_per_artist_adapters": false,
"revoked": false,
"revoked_at": null,
"revocation_reason": null
}
}
Revocation endpoint: POST /tracks/{id}/revoke
Request:
{
"reason": "Leaving the platform",
"scope": "all_future_use" // v0: only future training; document clearly
}
Behavior:
- Sets
revoked=true, stores timestamp & reason. - Excludes track from future training jobs.
- Emits event:
track.revoked.
2.3 Protecting smaller/independent artists¶
Addressing the prompt’s concerns:
-
Clarity:
-
Contribution terms are written in plain language with bolded bullets:
- “We use your track to train models,”
- “We do not redistribute your original audio,”
- “You can revoke future use later.”
- Easy revocation:
-
Revocation is a one-click action under “My tracks,” not hidden in settings.
-
Bulk-submission constraints (v0 simple rule):
-
Heuristic: a single entity that uploads >X% of the LEMM Vault or thousands of tracks may require:
- additional verification,
- governance review before inclusion in training subsets.
- Prevents one catalog owner from effectively hijacking “community-owned” identity.
3. Vetting, voting & similarity detection design¶
3.1 External similarity pipeline¶
Scope for v0:
- Up to 2 external sources (e.g. “big commercial catalog via fingerprint API” + one open corpus).
Features & scores
For each submitted track:
- Compute audio fingerprint for external API (e.g. robust hash over spectral landmarks).
-
Call external API(s) → receive:
-
match_confidence(0–1), overlap_start,overlap_end(seconds),- matched track metadata (title, artist, ISRC if available).
-
Optionally compute embedding similarity via a music embedding model:
-
cos-sim against nearest neighbors in an open-corpus index.
Decision table (external)
| External score / hit | Conditions | Action |
|---|---|---|
| No hit | No match > 0.4 confidence | Continue to internal checks. |
| Weak hit | 0.4 ≤ score < 0.7, or short overlap (<15s) | Flag for reviewer; allow conditional admit. |
| Strong hit | score ≥ 0.7 and overlap ≥ 15s | Auto-reject & send to dispute queue. |
Numbers are illustrative; v0 will tune them using experiments described later.
3.2 Internal similarity pipeline¶
Internal goals: detect exact duplicates and trivially modified re-uploads.
Features per LEMM Vault track
fp_exact: strong audio fingerprint (e.g. Chromaprint-like).fp_robust: fingerprint variant tolerant to minor EQ/encoding changes.embed_music: 512-D embedding (e.g. CLAP-style audio embedding).- Derived stats: duration, loudness profile.
Indexing
-
Fingerprint index:
-
Key:
fp_exact→ track IDs. - Use hash tables / key-value store.
-
Embedding index:
-
ANN (HNSW) over
embed_music. - Returns top-K nearest neighbors with similarity scores.
Internal decision table
| Internal condition | Action |
|---|---|
| Exact fp match to existing accepted track | Treat as duplicate: auto-reject or ask user if same track. |
| High robust-fp + high embedding similarity | Flag as “near-duplicate”; manual review. |
| Medium embedding similarity only | No immediate block; optional reviewer note. |
3.3 Submission state machine¶
Based on the prompt’s state list.
States
submittedpending_automated_checkspending_reviewacceptedrejectedremoved(post-acceptance removal)
Transitions (v0)
-
submitted → pending_automated_checks -
Trigger: track uploaded.
-
pending_automated_checks → pending_review -
Trigger: automated checks finished.
- If no flags, may auto-mark as “auto-approve candidate.”
-
pending_review → accepted -
Conditions:
- No strong external hit,
- No strong internal duplicate,
- Reviewer(s) approve OR auto-approval threshold satisfied.
pending_review → rejected
-
Conditions:
- Strong external match confirmed,
- Internal duplicate confirmed,
- Reviewer rejects.
accepted → removed
-
Trigger: revocation or post-hoc infringement claim sustained.
3.4 Voting & disputes¶
From prompt’s questions 14–15.
Voting
- v0 roles allowed to vote:
reviewer,admin. -
Voting interface:
-
Buttons:
approve,reject,needs_more_info. -
Decision rule:
-
Single reviewer approval is enough if no automated flags.
-
If flagged:
- Need at least two approvals or one admin approval.
- Admin has override authority.
Disputes
-
False positives (rejected but contributor disagrees):
-
Contributor opens a dispute:
POST /tracks/{id}/appeal. - Track moves into
dispute_pending. -
A separate reviewer or admin re-evaluates, potentially with:
- Additional similarity runs,
- Manual listening.
- Outcome:
acceptedorrejected_final. - False negatives (later infringement claim):
-
Report from external party creates
infringement_claim. -
System:
- Immediately locks track (
status=under_investigation) and removes from training batches. - Triggers similarity re-checks.
- Reviewer/admin decides to keep, reject, or negotiate.
- Immediately locks track (
4. Vault architecture & governance design¶
4.1 Minimal vault data model¶
From prompt.
Core entities
TrackRightsRecordFingerprintRecordSubmissionHistoryContributorModelTrainingRun(for lineage)
Track (simplified)
{
"track_id": "trk_abc123",
"storage_uri": "lemm://audio-bucket/trk_abc123.flac",
"metadata": {
"title": "Glass City",
"primary_artist_id": "contrib_42",
"display_artist_name": "Nova Arc",
"collaborator_ids": ["contrib_77"],
"genre_tags": ["electronic", "ambient"],
"language": "instrumental",
"duration_sec": 182.3,
"bpm": 110,
"mood_tags": ["dreamy", "slow_build"],
"created_at": "2025-01-01T12:00:00Z",
"release_status": "unreleased"
},
"rights_record_id": "rights_999",
"status": "accepted",
"submission_history_id": "subm_555"
}
RightsRecord
{
"rights_record_id": "rights_999",
"track_id": "trk_abc123",
"contributor_id": "contrib_42",
"license_type": "lemm_training_only_v1",
"territorial_scope": "worldwide",
"term": "perpetual",
"rights_holder_type": "individual",
"training_allowed": true,
"internal_demo_ok": true,
"allow_per_artist_adapters": false,
"revoked": false,
"revoked_at": null,
"consent_version": 1
}
FingerprintRecord
{
"track_id": "trk_abc123",
"fp_exact": "hash1",
"fp_robust": "hash2",
"embedding_music": "base64-encoded-vector",
"external_similarity_summary": {
"source_a_max_score": 0.23,
"source_b_max_score": 0.17
}
}
SubmissionHistory
Tracks decision path: states, timestamps, reviewer IDs, votes, and reasons.
4.2 Raw audio storage & protection¶
Prompt: storage, access control, logging.
-
Storage:
-
Encrypted object store (cloud-agnostic in design).
- Paths are internal URIs (
lemm://...), not raw HTTPS URLs. -
Access:
-
Only
systemrole services:TrainingGatewayService,FingerprintService,AdminTooling(restricted).- Access via signed short-lived tokens, issued by a
VaultAccessService. - Logging:
-
Every audio fetch logs:
who(service + role),when,why(job/run ID),track_id.
4.3 Roles & permissions¶
From prompt.
Roles
contributorrevieweradminsystem(non-human)
Permissions (v0)
| Role | Read | Write/Update |
|---|---|---|
| contributor | Their own tracks’ metadata & status | Submit tracks, revoke their own tracks |
| reviewer | Track metadata, similarity summaries | Vote on submissions, add review notes |
| admin | All metadata, logs (not raw audio by default) | Change policies, manage roles, resolve disputes |
| system | Raw audio (as needed), fingerprints, training configs | Update indexes, logs, training lineage |
Minimal governance:
-
Only admins can:
-
change policy configs (thresholds, license templates),
- promote/demote reviewers,
- approve bulk ingest sources.
-
All policy changes are logged with:
-
author, timestamp, old/new values.
4.4 Logging & audit trails¶
- Policy changes:
policy_changeevents (who, what, when). - Data access:
vault_accessevents (service, track_id, purpose). - Training usage:
training_batchevents (run_id, track_ids or pack IDs). - Inference checks:
output_similarity_checkevents (model_id, output_id, similarity scores).
5. Training-only usage & enforcement spec¶
5.1 High-level architecture¶
Prompt wants an architecture diagram; here’s the textual equivalent.
Components
VaultDB(tracks, rights, metadata, fingerprints)ObjectStore(encrypted audio)TrainingGatewayServiceFingerprintServiceTrainingOrchestrator(e.g. job scheduler)ModelRegistryLineageStore
Flow
-
Training job request:
-
TrainingOrchestratorsends a data spec toTrainingGatewayService, including:- model_type,
- desired data packs/filters (e.g. “LEMM Vault core, ambient only”),
- run_id.
TrainingGatewayService:
-
Resolves allowed track set:
- rights:
training_allowed == true,revoked == false, - pack filters (if using pack abstraction),
- optional genre/time filters.
-
Returns:
-
list of track IDs or sample manifests,
- ephemeral signed URLs / tokens for audio or precomputed training chunks.
- Training workers:
- rights:
-
Stream audio using those signed URLs, but cannot list Vault contents themselves.
-
After training:
-
TrainingOrchestratorwritesModelTrainingRunrecord linking:- model version,
- run_id,
- data selection spec,
- resolved track set (or pack IDs).
5.2 Rules for mixing LEMM Vault data with other datasets¶
Borrowing “data packs” pattern from the main system doc.
-
Dataset registry: each dataset (LEMM Vault, open corpora, licensed packs) is a named pack with:
-
allowed uses,
- license constraints.
-
Model config:
-
Each run must declare which packs are used, e.g.
["LEMM_Vault_v1_core", "PD_CLASSICAL_v2"]. -
Constraints:
-
For models advertised as “powered by clean data from the LEMM Vault,”:
- only packs with compatible licenses are allowed.
-
Strict logging:
-
each run stores the exact set of packs used.
Enforcement
- Training configs validated server-side.
- No “ad-hoc” mixing; if you want to include another dataset, it must exist as a pack entry with metadata.
6. Generation-time safety & non-reproduction spec¶
6.1 Output similarity pipeline¶
Prompt: check against the LEMM Vault + one external reference index.
Steps per generation:
- Generation completed (or preview ready).
-
OutputSafetyService: -
Computes
fp_exact,fp_robust,embed_musicfor the output. - Queries internal LEMM Vault fingerprint and embedding indexes.
- Queries external fingerprint API for near-duplicates.
-
Aggregates similarity metrics:
-
sim_internal_max sim_external_max- Applies decision thresholds (v0, conservative):
| Condition | Action |
|---|---|
| both max < 0.4 | Output allowed silently. |
| any in [0.4, 0.7) | Output allowed but flagged; optional soft warning. |
| ≥ 0.7 against LEMM Vault or external track | Output blocked; user asked to regenerate. |
This aligns with broad “multi-axis similarity” direction from the main system doc (audio fingerprint as backstop, more detail later).
6.2 User-facing behavior when outputs are too similar¶
From questions 25–26.
Block behavior
-
Show message like:
-
“This output is too similar to an existing song and can’t be used as-is. Try regenerating with a different prompt or variation.”
-
UX:
-
Offer a
Regeneratebutton. - Optionally, suggest prompt tweaks (“less like X”, “more ambient” etc).
Warning behavior (mid-range similarity)
-
Subtle warning:
-
“We detected some similarity to existing works. If you plan to use this commercially, consider regenerating.”
-
Still allow export, but:
-
Log event with
similarity_level = warning.
Logging & notifications
-
For every blocked/warned output:
-
log
output_similarity_check(scores, matched track IDs, model version). - optionally notify internal moderation / admin if repeated hits occur for same user or pattern.
6.3 Prompts like “make it like [known song]” / “in the style of X”¶
From prompt.
Policy in v0:
-
Explicit prompts referencing specific commercial songs:
-
moderate / disallow:
- either reject outright (“We can’t generate directly in the style of specific copyrighted works”) or:
- accept but combine with stricter similarity thresholds.
- Prompts referencing broad genres or eras:
-
allowed (e.g. “80s synthwave,” “lo-fi hip hop”).
-
Named artists:
-
For non-participating artists: treat similar to specific songs; at least prompt-level warning.
Enforcement:
-
Prompt moderation layer:
-
simple rule-based plus model-assisted classification of:
- “explicit style cloning” vs generic genre.
- Output checks remain the final gate; prompt rules only reduce risk.
7. Auditability & lineage plan¶
Questions 27–29.
7.1 What lineage data is stored in v0?¶
For each model version:
model_id,model_nametype(symbolic core, audio renderer, end-to-end, etc.)training_runs: list of run IDs-
For each run:
-
run_id timestamp_start,timestamp_enddata_packsused (e.g.LEMM_Vault_v1_core)- high-level hyperparameters (batch size, epochs, lr)
- hash of training config
-
Optional: if feasible at v0 scale:
-
track_ids_sampled(or hashed IDs) used in that run.
For each track:
track_idtraining_runs_included: list of run IDs
This matches the data-pack lineage ideas in the main system doc.
7.2 Reconstructing proofs later¶
Needed proofs per prompt:
-
“Model M was trained only on the LEMM Vault + explicitly listed packs”
-
Show:
ModelTrainingRunrecords for M,- each run’s
data_packslist, - dataset registry entries for those packs.
- “Output O was checked against the LEMM Vault + at least one external catalog”
-
Show:
-
output_similarity_checklog for O: -
checked_against_internal = true, checked_against_external_catalogs = ["catalogA"].
-
7.3 Handling rights changes over time¶
From prompt.
-
Immediate effects:
-
Mark revoked tracks as:
revoked = truein RightsRecord,status = removedorinactivein Track.-
Remove from:
-
training-eligible views,
- future similarity reference sets (or keep for safety-only usage if contracts allow).
- Existing models:
-
v0 policy:
- Document that models already trained on the track will not be retroactively retrained by default.
-
For particular high-risk contexts, consider:
-
training new models excluding revoked packs,
- limiting where older models are offered.
8. v0 creator licensing & assurances¶
Questions 30–32.
8.1 Minimal licensing flow for generated outputs¶
Non-lawyer, product-safe description (we’re not drafting a contract).
License type (v0)
-
Standard non-exclusive license for generated outputs:
-
Worldwide,
- Allows commercial use,
-
User owns rights in their output subject to:
- no claim over underlying training data or models,
- no use of outputs to train competitors without permission (optional clause, if desired),
- compliance with content and usage policies.
UX flow
-
At first export from a model:
-
Show compact license summary (“Your rights in generated music”), with link to full terms.
- Require checkbox: “I agree to these terms” before enabling download.
-
For returning users:
-
License acceptance is stored per account and shown as a small reminder on the export screen.
We log for each output:
user_idoutput_idmodel_idlicense_version_accepted
8.2 Assurances to end-users¶
Per prompt.
We can honestly promise that v0:
-
Runs similarity checks against:
-
the LEMM Vault, and
- at least one external catalog, before releasing outputs.
- Applies conservative blocking on near-duplicates.
-
Logs all:
-
outputs,
- checks,
- model versions used.
-
Has internal processes to:
-
review complaints,
- re-run similarity checks with updated tools,
- temporarily block suspect outputs or models.
We do not promise:
- that no generated output will ever raise a dispute,
- or that similarity detection is perfect.
8.3 Contributor protections¶
From prompt.
- We do not redistribute original tracks.
- We do not claim ownership of their works.
- We avoid promising things that are not technically enforceable (e.g. “zero chance of similarity”).
-
We provide:
-
Revocation path,
- Visibility into how many models used their tracks (from lineage),
- A documented “future optional rewards” section without committing specific economics in v0.
9. Risk register & minimal experiments¶
9.1 Key risks (ranked)¶
-
Similarity false negatives
-
Real-world infringement risk from missed matches.
-
Similarity false positives
-
Unfair rejections / blocks that frustrate contributors or users.
-
Governance capture
-
A few entities effectively control what “community-owned” means.
-
User misunderstanding of rights
-
Contributors misread terms; users overestimate safety.
-
Implementation drift
-
Engineering shortcuts that bypass gates (e.g. direct raw audio access).
-
Index scaling / cost
-
Similarity checks become slow or expensive at thousands of tracks.
9.2 Minimal experiments (3–6, per prompt)¶
-
Prototype external similarity pipeline
-
Build a small index of ~1k known commercial tracks plus ~200 synthetic tracks.
-
Test:
- detection rate for identical copies and lightly edited variants,
- thresholds that balance precision/recall.
- Success: >95% detection of exact/near copies with <5% false positive rate on synthetic originals.
-
Internal near-duplicate detection test
-
Generate synthetic pairs:
- same track re-encoded,
- EQ’d, time-stretched versions.
-
Evaluate:
-
performance of exact vs robust fingerprint,
- embedding-based similarity.
- Outcome: candidate thresholds for “internal high similarity.”
-
Non-reproduction filter test
-
Train a small music model on a tiny subset of tracks from the LEMM Vault.
- Prompt it aggressively for reproduction of that data.
-
Run output through safety pipeline:
- measure how often near-copies are blocked.
- Adjust thresholds and multi-axis scoring as per main-system IP risk suggestions.
-
Contributor UX pilot
-
Fake UI with 10–20 independent musicians.
-
Measure:
- whether they understand contribution terms,
- how many are willing to grant training rights,
- whether revocation flow feels discoverable.
-
Governance simulation
-
Simulate a mixed set of submissions:
- obvious originals,
- obvious infringements,
- ambiguous cases.
- Have a small reviewer group use the v0 tools.
-
Examine:
-
inter-reviewer agreement,
- time-to-decision,
- where thresholds or language need refinement.
-
Load and cost profiling of similarity system
-
Benchmark:
- ingestion-time checks (N tracks/day),
- generation-time checks (N outputs/day).
-
Ensure:
-
latencies are acceptable for interactive UX,
- index updates fit in budget.
10. v0 prototype blueprint & build-ready prompt¶
10.1 Component-level blueprint¶
Services
-
AuthService -
User accounts, roles.
-
ContributionAPI -
Endpoints:
POST /tracks/submitGET /tracks/{id}POST /tracks/{id}/revokeIngestionWorker
-
Consumes submission queue.
-
Runs:
- format checks,
- feature extraction,
- external + internal similarity checks.
- Updates
VaultDBandFingerprintIndex. ReviewService
-
UI + APIs for reviewers/admins:
- list pending tracks,
- show similarity summaries,
- record votes.
VaultDB+ObjectStore
-
As in section 4.
-
FingerprintService -
Extracts audio features and maintains indexes.
-
TrainingGatewayService -
Implements training-only data access.
-
ModelRegistry+LineageStore -
Records model versions and their data use.
-
OutputSafetyService -
Generation-time similarity checks.
-
Logging/AuditService- Central log of policy changes, access, training usage, and safety decisions.
Data stores
VaultDB(OLTP relational or document store).FingerprintIndex(ANN structures + hash tables).Logs(append-only, queryable).
10.2 Prioritized implementation phases¶
From prompt’s “Phase 1: submission + vault + external similarity MVP” hint.
Phase 1 – Submission + Vault + External Similarity
-
Implement:
-
Auth + contributor onboarding.
POST /tracks/submit.- ObjectStore upload path and storage.
- External similarity check via single catalog.
- Minimal reviewer UI.
- Vault data model and simple indexes.
-
Goal:
-
Accept/reject tracks with logged decisions.
- Have a small vetted LEMM Vault dataset (hundreds of tracks).
Phase 2 – Internal Similarity + Training Gateway + Lineage
-
Implement:
-
Internal fingerprint & embedding index.
- Internal duplicate detection.
- TrainingGatewayService & nailing down “train-only” enforcement.
- ModelRegistry + basic lineage logging.
-
Goal:
-
Train first prototype models on packs from the LEMM Vault with recorded lineage.
Phase 3 – Generation-time Safety + Licensing UX
-
Implement:
-
OutputSafetyService in the inference pipeline.
- Blocking/warning flows.
- User-facing license acceptance for exports.
-
Goal:
-
End-to-end pipeline: models trained on the LEMM Vault, safe-ish generation, and logged checks.
Phase 4 – Governance & Dispute Tools
-
Implement:
-
Dispute handling,
- more advanced reviewer dashboards,
- policy editing UI for admins.
10.3 Build-ready prompt (to give engineers / code-capable assistant)¶
You wanted something like: “Given this architecture and data model, implement v0…”
Here’s a concise version:
Build prompt – LEMM Vault v0 Submission, Vault & Safety Core
Implement v0 of the LEMM Vault with the following scope:
Auth & Roles
User accounts with roles:
contributor,reviewer,admin.Contribution API
Endpoints:
POST /tracks/submit: accept metadata, rights flags, and an upload token; createTrack,RightsRecord, and enqueue ingestion job.GET /tracks/{track_id}: return track metadata and status to owner or reviewers.POST /tracks/{track_id}/revoke: markrevoked=trueinRightsRecord, logtrack.revokedevent.- Vault & Storage
VaultDBschemas forTrack,RightsRecord,FingerprintRecord,SubmissionHistory,Contributor, andModelTrainingRunas described above.- Store audio in an object store using an internal URI scheme; do NOT expose raw audio URLs publicly.
Ingestion & Similarity
Ingestion worker that:
- validates basic audio properties,
- calls
FingerprintServiceto compute fingerprints & embeddings,- queries one external catalog API for matches and writes similarity scores to
FingerprintRecord,- checks internal indexes for duplicates,
- updates submission state to
pending_reviewwith flags.- Review Workflow
Reviewer UI/API to:
- list pending submissions with similarity summaries,
- approve/reject with comments,
- transition state to
acceptedorrejectedaccording to the state machine.- Training Gateway & Lineage
Service that, given a training config (requested data packs, filters, run_id), returns an iterable of track IDs and signed URLs for audio where:
training_allowed == trueandrevoked == false.- Write
ModelTrainingRunrecords linking model IDs to data packs and run IDs.- Output Safety
Service that:
- receives generated audio (or reference to it), compute fingerprints/embeddings,
- queries the LEMM Vault and the external catalog,
applies thresholds:
block ≥0.7 similarity,
- warn for [0.4, 0.7),
- returns a policy decision (
allow,warn,block) and logs all checks.- Logging & Audit
Central logging for:
- policy changes,
- track submission decisions,
- training data selection (per run),
- output similarity checks.
The implementation should include automated tests for:
- the submission state machine,
- enforcement that revoked tracks are never served to new training jobs,
- correct blocking behavior when similarity thresholds are exceeded.
There you go: mission, pipeline, safety, governance, lineage, risks, and a blueprint that an actual engineer could start coding against. Try not to feed it to something that writes everything in a single 5,000-line file.