Research prompt¶

Title
LEMM: community-owned clean music dataset and safety system, designed for a lean v0 prototype

1. Context and mission¶

We want to design LEMM: a community-owned, decisively clean music dataset and associated governance/safety system for training generative music models.

Key ideas:

Contributors submit original tracks to LEMM under clear, explicit terms.
Submitted works go into a controlled vault, not a public dump.
There is a vetting / voting process on submissions, combining:
Automated similarity detection against external catalogs and internal content.
Community and/or expert review.
Accepted tracks are used only for model training and evaluation, not for redistribution of the raw data.
Trained models are required to:
Not reproduce existing songs in the vault or in external catalogs.
Provide users with outputs that can be safely licensed and used commercially.

This research is not only conceptual. It must end in a lean, technically precise v0 prototype design for LEMM, with:

Clear technical architecture (components, APIs, data models).
Minimal but realistic first implementation scope.
Either:
A concrete “build prompt” that can be handed to an engineering team or code-capable assistant to implement the prototype; or
An explicit v0 implementation spec ready to be executed as-is.

2. Objectives¶

Formalize LEMM’s mission and constraints
Precisely define what “clean”, “community-owned”, and “training-only use” mean in operational, legal, and technical terms.
Design the contribution and consent pipeline
Define how tracks are submitted, described, and authorized for use.
Specify the minimal v0 UX and backend needed to support this.
Specify the vetting, voting, and similarity-detection system
Design external and internal similarity checks.
Define the decision logic for accepting/rejecting/flagging tracks.
Scope a lean v0 implementation (e.g. 1–2 external sources, simple thresholds).
Design the LEMM vault and governance model
Data model and storage for tracks, metadata, rights, and fingerprints.
Minimal governance flows (who can do what in v0).
Define the training-only usage model and technical enforcement
How models can be trained on LEMM in v0.
How to technically prevent raw data leakage.
Design generation-time safety and non-reproduction mechanisms
Operational definition of “too similar”.
Concrete, implementable v0 checks on generated outputs.
Define auditability and data/model lineage mechanisms
Minimal metadata and logging needed to track which data trained which model.
Design an initial creator licensing flow
How end-users obtain clear licenses to generated outputs in v0.
How contributors are protected.
Identify riskiest assumptions and design minimal experiments
Plan small-scale tests that validate similarity detection, non-reproduction, contributor UX, and governance decisions.
Produce a lean v0 prototype blueprint and a build-ready prompt
A technically detailed design for a first working LEMM prototype.
A short, concrete implementation prompt that can be used to drive actual prototyping.

3. Questions to answer¶

3.1 Product and policy definition¶

What is LEMM’s mission statement in precise, implementable terms?
For contributors: what they contribute to, what they allow, what they keep.
For downstream users: what guarantees they get about the models and outputs.
Define “clean dataset” for LEMM:
Minimum rights LEMM must have for each track (e.g. explicit consent for training, no hidden encumbrances).
Rights LEMM must not claim (e.g. no right to redistribute or resell raw tracks).
How revocation and changes of consent affect data and models.
Define “community-owned” in practice for v0:
What structure is assumed (e.g. foundation-like governance vs minimal committee).
Who decides on:
- Admission/removal of tracks.
- Policy updates.
- Use of any revenues.
Define “training-only use”:
Allowed: training generative models, evaluation, internal safety research.
Disallowed: streaming LEMM’s raw tracks, sublicensing the dataset, exposing per-track audio to third parties.
What concrete promises does v0 LEMM make, and how are they technically supported?
To contributors.
To model developers.
To end-users.

What does the v0 submission flow look like, end-to-end?

For each submission: - Inputs: - Audio file(s) + basic metadata (artist, collaborators, title, genre, language, year, explicit originality claim). - Steps: - User identity handling (even if minimal). - Consent capture: what exactly they agree to. - Confirmation/receipt.

What is the minimal identity/authenticity handling for v0?
Do we accept pseudonymous submissions with an email or account only?
Do we need optional stronger verification for certain tiers?
How do we record and store “who submitted what” in the system?
How is consent represented technically?
Per-track rights record: fields, enums, flags.
How is a change in consent (e.g. revocation) represented and propagated?
How does v0 protect smaller/independent artists?
Clear consent language.
Easy revocation or correction path.
Optionally, constraints on bulk submissions from single entities to keep the dataset diverse.
What is the minimal submission UI/UX we assume, and what backend endpoints does it call?
- Sketch the endpoints:
- POST /tracks/submit, etc.
- List required, optional, and derived fields.

3.3 Vetting, voting, and similarity detection¶

What is the v0 external similarity pipeline?
Choose 1–2 external sources (abstractly) to check against (e.g. a major catalog via fingerprinting API + a public open corpus).
For each:
- What kind of fingerprint/feature is computed (audio fingerprint, embedding, etc.)?
- What similarity score is computed?
- What thresholds are used to flag a track?

Deliverable: a simple decision table for “no hit / weak hit / strong hit” and corresponding actions.

What is the v0 internal similarity pipeline?
How do we detect:
- Exact duplicates (same audio).
- Very similar tracks (e.g. re-upload or trivial modification).
What features/fingerprints do we compute for each LEMM track?
How is the internal index stored and maintained?
What is the v0 decision flow for submissions?
Define state machine:
- States: submitted → pending checks → pending review → accepted → rejected → removed.
For each state transition:
- Trigger conditions (e.g. automated checks, community vote).
What happens when:
- External similarity is high?
- Internal similarity is high?
- Community flags a track?
How does v0 voting work?
Who can vote (contributors, reviewers, a small trusted core)?
What interface: simple approve/reject?
What thresholds (e.g. N approves, no strong objections)?
How does v0 handle disputes and errors?
False positives: process to appeal and re-evaluate a rejected track.
False negatives: process when someone later claims infringement:
- Temporarily lock track.
- Re-run checks.
- Escalation path.

3.4 Vault architecture and governance¶

What is the minimal vault data model for v0?

For each track, define: - Track ID. - Storage location. - Metadata (artist, title, etc.). - Rights record (consent version, scope, revocation state). - Similarity fingerprints/features. - Submission and decision history.

For the dataset overall: - Global indexes (by artist, genre, date). - Fingerprint index for fast similarity queries.

How is raw audio stored and protected?
Storage technology (abstract).
Access control: which services/roles can read original audio?
Logging of access events.
What are the v0 roles and permissions?
Example roles: contributor, reviewer, admin, system.
What each role can:
- Read (e.g. track metadata vs audio vs logs).
- Write/update (e.g. rights, metadata, flags).
Minimal governance rule-set:
- Who can change policies and how that is recorded.
What logging and audit trails exist in v0?
Submission and decision logs.
Rights changes.
Access to sensitive data (audio, detailed fingerprints).
Minimal structure of logs (fields, retention, searchability).

3.5 Training-only usage model¶

How are LEMM tracks used to train models in v0?
Define a simple pipeline:
- Export a training subset (internally).
- Transform audio into features/representations for model training.
Distinguish:
- LEMM-internal training experiments.
- Models that are intended to become user-facing products.
How is training-only usage enforced technically?
Which services are allowed to access raw audio vs only features.
How to prevent:
- Copying raw audio out of the core environment.
- Third parties from obtaining the dataset.
How does LEMM interact with other datasets in v0?
Can models be co-trained on LEMM + other licensed sets?
How do we keep clear metadata about:
- Which model used which combination of datasets.
- Additional constraints inherited from those other datasets.

3.6 Generation-time safety and non-reproduction¶

Define “too similar” for generated outputs in v0:
Which measures:
- Audio fingerprints?
- Embeddings?
- Symbolic similarity (if symbolic available)?
What thresholds or rules define:
- Allowable similarity (stylistic, generic patterns).
- Disallowed similarity (near-duplicate melody, identical audio fragments).
What v0 mechanisms are feasible to implement?
Training-time:
- Basic anti-memorization strategies (e.g. avoid overfitting, regularization).
Inference-time:
- Generate candidate output → compute fingerprint → compare vs:
- LEMM vault.
- One external reference index.
- Block or warn if similarity exceeds threshold.
What is the v0 user-facing behavior when outputs are too similar?
Do we:
- Reject with a message and suggest regeneration?
- Provide a high-level warning (“too similar to existing works”)?
What logging and internal notifications are triggered?
How does v0 handle prompts like “make it like [known song]” or “in the style of X”?
Policy stance for v0.
How that stance is partly enforced via:
- Moderation of prompts.
- Similarity checks on outputs.

3.7 Auditability and lineage¶

What lineage data is stored in v0?
For each model version:
- Which subset of LEMM (and other datasets) was used.
- Training run ID, date, and core parameters.
For each track:
- Which training runs included it.
How can we later reconstruct:
A proof that a model was trained only on:
- LEMM + explicitly listed other datasets.
Evidence that an output was checked against:
- LEMM vault.
- At least one external catalog.
How does v0 handle rights changes over time?
If a contributor revokes a track, what happens:
- Immediately (e.g. mark track as inactive, block from further training).
- For existing models (document policy: e.g. no retroactive retraining in v0, or retraining for certain high-risk uses).

3.8 Creator licensing and economic layer¶

What is the minimal licensing flow for generated outputs in v0?
What type of license (e.g. standard non-exclusive commercial use).
Basic terms (no need for full legal drafting, but key points and constraints).
UX: how a user “obtains” this license (e.g. click-through at download).
What assurances do we provide to users in v0?
That reasonable steps are taken to:
- Avoid reproducing tracks in LEMM.
- Avoid reproducing tracks in major external catalogs.
That we have:
- Logs and checks for generated outputs.
- Processes for addressing complaints.
How are contributors protected in v0?
We do not redistribute their original tracks.
We do not promise more than we can technically enforce.
Optionally: simple, documented path for future reward-sharing ideas, even if not implemented yet.

4. Prototype focus and constraints¶

The research must converge on a lean v0 prototype design for LEMM, constrained as follows:

Scope:
Music audio tracks only (no lyrics-only, no other media).
Reasonable initial dataset size (e.g. conceptually thousands, not millions of tracks).
External similarity:
Assume at most 1–2 external catalogs / APIs for v0.
Governance:
Minimal but clear roles (e.g. contributors, reviewers, admins).
Technical stack:
Abstracted (no specific cloud vendor required), but concrete enough to:
- Define services/components.
- Define interfaces and data models.
Safety:
Implementable v0 similarity checks for ingestion and generation, even if approximate.

5. Artifacts and deliverables¶

The research should produce:

LEMM mission and constraints document
Clear, concise statement of mission, rights, and promises, plus how they map to system behavior.
Contribution and consent pipeline spec
End-to-end flow chart.
API endpoints and data structures for submission and consent (pseudo-schemas).
Vetting and similarity detection design
Description of:
- External and internal similarity features.
- Indexing strategy.
- Threshold table for decisions.
v0 decision-state machine for track admission.
Vault architecture and governance design
Data model for tracks, rights, fingerprints, and lineage.
Role/permission matrix.
Logging and auditing outline.
Training-only usage and enforcement spec
Architecture diagram showing:
- How training jobs access LEMM.
- How raw data access is restricted.
Rules for mixing LEMM with other datasets.
Generation-time safety and non-reproduction spec
v0 similarity check pipeline for outputs.
Blocking/warning rules and user-facing behavior.
Auditability and lineage plan
Minimal metadata/logs to track model/data relationships.
Policy for handling rights changes.
v0 creator licensing flow outline
Simple licensing UX and core terms.
How this integrates with safety checks.
Risk register and minimal experiments
Ranked list of key risks.
3–6 small experiments (e.g. prototype similarity pipeline, test non-reproduction filters, test contributor UX).
v0 prototype blueprint and build-ready prompt
- A concise, implementation-oriented blueprint:
- Components, APIs, storage, indexes, background jobs.
- Prioritized implementation steps (e.g. “Phase 1: submission + vault + external similarity MVP”).
- A short “build prompt” that could be given to an engineering team or code-capable assistant, phrased like:
- “Given this architecture and data model, implement v0 of LEMM’s submission and vault system with endpoints X, Y, Z…”

6. Process guidance¶

Start with a one-page mission and constraints summary for LEMM.
Design the v0 submission and consent flow first, with concrete data fields and endpoints.
Specify v0 similarity and decision logic using simple tables and state machines.
Design the vault data model and access control with logs and roles.
Define v0 training-time and inference-time interaction with LEMM, including where similarity checks run.
Draft the v0 prototype architecture:
Services, storage, indexes, and background jobs.
Interfaces between them.
Build the risk register and minimal experiment plan.
Finally, distill everything into the build-ready implementation prompt and a clear v0 roadmap.

7. Non-goals¶

This research does not need to:

Provide country-specific legal contracts.
Implement or choose specific vendors/technologies.
Define a full economic model or revenue-sharing scheme.
Solve all long-term governance challenges.

The goal is a coherent, technically detailed, prototype-ready design for LEMM’s data, rights, and safety system, with enough specificity to begin implementation.

8. Success criteria¶

The research is successful if it yields:

A clear, internally consistent definition of LEMM’s mission, rights model, and constraints.
A realistic v0 design for:
Submission, consent, vetting, and similarity checks.
Vault architecture, governance, and auditability.
Training-only usage and generation-time safety.
A concrete v0 prototype blueprint with:
Component diagram.
Data models.
APIs and flows.
Prioritized implementation steps.
A short, explicit build prompt that could be used to drive the first LEMM prototype implementation.