Epic: Metadata Module & Cohort Discovery

(This design document supports and expands upon Point 14: Metadata Module — Annotating Data for Contrast Building from the primary roadmap.md.)

Attribute	Details
Owner	Samyak Jain
Contributors	team
Version	1.0 (March 4, 2026)

User Journey

The user wants to explore and build cohorts over the entirety of the NCBI SRA database without being restricted to pre-defined diseases. They navigate to the Cohort Builder, apply a wide range of varied filters (e.g., specific sequencing platforms, collection dates,organisms,diseases and more), and get results. Once they have their cohorts defined, they can create contrasts for analysis. They can either:

Contrast between two distinct cohorts: (e.g., a cohort of MRSA patients vs. a cohort of healthy controls).
Contrast within a single cohort: By defining a primary cohort and then sub-filtering it into arms (e.g., a single cohort of all patients, split into 'treated' and 'untreated' arms).

Please keep in mind this is still in discussion and whether we want to do this in the first iteration of metadata module or not ( the ares intergation with cohorts)

The system automatically links these logical cohort members to physical DRS objects stored in Ares, annotating them with both intrinsic (assay type, organism) and experimental (cohort arm) metadata, getting them ready for workflow execution.

Requirements

Comprehensive SRA Data Indexing: Expand the MDI database to ingest and index the entirety of the NCBI SRA database, removing the current restriction to specific diseases (Alzheimer's, Parkinson's, MRSA, etc.).
- Currently, the MDI database is restricted to specific diseases (Alzheimer's, Parkinson's, MRSA, etc.), since we are only ingesting a subset of the SRA database based on the diseases. We need to expand the MDI database to ingest and index the entirety of the NCBI SRA database.( others associated databses like Biopropject, Biosample, Pubmed, GEO etc)

The database discussion is still also up in the air: whether we stick with postgres or move to something like https://clickhouse.com/ or https://duckdb.org/

Flexible Cohort Contrasts: Upgrade the Cohorts service API and data models to explicitly support saving "Contrasts". A contrast can either be between two saved cohorts, or defined dynamically within a single cohort using sub-filters.
Performance Optimization: Since the data that we would be gatehring would be massive, we would have to implement a caching mechanism within the Cohorts service to ensure that complex, high-cardinality metadata queries return quickly (it is after all static data)
Ares Integration: Connect the logical samples filtered in Cohorts with the physical file metadata (DRS objects) managed by the Ares service, enabling proper metadata annotation for workflow execution. (still in discussion)

Scope

In Scope

Expanding MDI data ingestion pipelines to process all public SRA data.
Introducing a Contrast data model in the Cohorts service to handle inter-cohort and intra-cohort experiment designs.
Updating Cohorts service data models and query builders to support advanced filtering (e.g., date ranges, geographic origins, assay types and other metadata).
Implementing a caching layer in the Cohorts service.

Middle Ground

Building the Metadata API endpoints to annotate DRS objects in Ares with intrinsic and experimental metadata.
Establishing the data flow between Cohorts (logical contrast definition) and Ares (physical file metadata).

Out of scope

Live synchronization of private user-uploaded data with SRA public data.
Automated pipeline to update MDI database from SRA on a highly frequent or real-time basis.
- Note: We will not initially have an automated pipeline that fetches and updates the MDI database from SRA on a highly frequent or real-time basis. For the MVP, the complete SRA snapshot ingested initially will be available, but live up-to-the-minute SRA updates are deferred.
Generating samplesheets for different Nextflow modules.
- Note: We first need to map different workflows to their specific samplesheet types. This mapping work has not been completed yet, so automated samplesheet generation for varied modules is deferred for this phase.

Design Tradeoffs

Full SRA Ingestion vs. On-Demand Fetching Extracting the entire NCBI SRA database ( though only metadata) into our internal MDI Postgres instance requires significant storage ( this can span into TBs) and upfront ETL processing time. However, this tradeoff is necessary because on-demand fetching via NCBI Entrez APIs is too slow for the interactive cohort discovery experience we want to provide. By bringing the data internally, we can build efficient materialized views and indexes.

NCBI Entrez API has a limit of max 10 requests per second with an API key.

For context, the anticipated volume of data to ingest and index is approximately 149 million records:

Entity	Record Count
`bioproject`	1,021,958
`biosample`	52,924,838
`sra`	43,024,773
`gds`	8,653,935
`pubmed`	40,196,164
`assembly`	3,451,225
TOTAL	149,272,893

Implementation Plan

1. Workstream: SRA Full Ingestion (MDI)

Description: Modify the MDI ETL pipelines to systematically fetch, transform, and load the entire SRA dataset (along with associated databases like BioProject, BioSample, PubMed, GEO, etc.).

[TASK-01-A] Unconstrained Ingestion & Resiliency

Description: Update the NCBI fetchers in mDataIndexing to systematically download all available SRA/BioSample accessions, with built-in checkpointing to resume interrupted fetches.
Requirements:
- Implement a paginator that handles the NCBI Entrez API history and chunking for the entire SRA database or see if we can a "query" to get all the data at once.
- Create a checkpointing system (e.g., tracking the last successfully downloaded and ingested ID/cursor) so that if a download crashes, it resumes from the point of failure instead of re-fetching or re-ingesting data.
Constraints: Must respect NCBI API rate limits to prevent IP blocks.
Acceptance Criteria: Running the ingestion pipeline downloads a complete, unconstrained snapshot of current SRA metadata into the staging tables, and can successfully resume exactly from where it left off if interrupted.

[TASK-01-B] Scale Postgres Loaders

Description: Optimize the SQLAlchemy bulk-insert and upsert logic to handle the drastically increased data volume without memory bloat.
Requirements:
- Refactor loaders to batch transactions and optimize indices for large-scale ingestion (approx. 150M records).
- Update the Post-ETL step to generate summary tables efficiently against the full dataset.
Constraints: The ingestion process must not lock the database for extended periods, allowing dependent services to continue querying historical data.
Acceptance Criteria: The MDI database successfully houses the ~150M records, and the Post-ETL script runs to completion, generating all required materialized views for downstream consumption.

2. Workstream: Advanced Cohort Filtration

Description: Expose the rich SRA data via the Cohorts API so the UI can construct complex queries.

[TASK-02-A] Dynamic Query Builder

Description: Update the Cohorts service backend to support varied filters mapping dynamically to the MDI Postgres schema.
Requirements:
- Extend the existing query translation logic to support new fields like geographic_origin, collection_date, and platform and more.
- Ensure the resulting SQL queries or ORM calls against the MDI database use the newly created materialized views/indexes.
Constraints: Query builders must prevent SQL injection and gracefully handle null or missing metadata fields.
Acceptance Criteria: The Cohorts internal API successfully returns sample lists when provided with combination filters previously unsupported (e.g., date ranges + specific assay types).

[TASK-02-B] Filter API Endpoints (with dynamic counts)

Description: Create or enhance GET /filters endpoints to return the available distinct values and ranges for the new metadata fields.
Requirements:
- Implement endpoints that query MDI for unique categorical values (e.g., all distinct sequencing platforms) and min/max ranges for continuous variables.
- Structure the JSON response to enable the frontend to easily map these to dynamic dropdowns and slider UI components.
Constraints: These endpoints must respond quickly to ensure the UI remains snappy.
Acceptance Criteria: A GET /filters request returns a comprehensive JSON schema of all available SRA filter criteria and their valid ranges/options.

3. Workstream: Flexible Cohort Contrasts

Description: Upgrade the Cohorts service to explicitly define, save, and resolve "Contrasts" used for analysis, supporting both inter-cohort and intra-cohort comparisons.

[TASK-03-A] Contrast Data Modeling

Description: Introduce a new Contrast ORM model in cohorts/api/v1/db.py to persist experiment designs.
Requirements:
- The model must support Inter-Cohort Contrasts: linking two existing cohort_ids (Group A vs Group B).
- The model must support Intra-Cohort Contrasts: linking a single cohort_id, along with two distinct JSON filter definitions that sub-divide the primary cohort into Group A and Group B.
Constraints: Must ensure referential integrity; if a underlying Cohort is deleted, the dependent Contrasts must be cascade-deleted.
Acceptance Criteria: A user can successfully save and retrieve both types of contrasts via the API.

[TASK-03-B] Contrast Resolution Endpoint

Description: Build an endpoint that resolves a saved contrast_id into the two explicit lists of sample IDs ready for analysis.
Requirements:
- Create GET /contrasts/{contrast_id}/resolve.
- For inter-cohort contrasts, the endpoint fetches the samples for Cohort A and Cohort B independently.
- For intra-cohort contrasts, the endpoint fetches the primary cohort's samples, then applies the two sub-filters in-memory or via a refined database query to split the samples into the two arms.
Constraints: Must return the resolved data rapidly, leveraging the Cohort caching layer where applicable.
Acceptance Criteria: Given any valid contrast_id, the endpoint returns a structured JSON payload containing group_a_samples and group_b_samples.

Still under Discussion:

4. Workstream: Ares & Cohorts Integration (Metadata Module)

Description: Connect the logical cohort samples to their corresponding physical metadata, preparing them for downstream workflow execution.

[TASK-04-A] DRS Metadata Endpoints in Ares

Description: Implement the endpoints in Ares to store and retrieve metadata associated with specific DRS objects.
Requirements:
- Extend the Ares DRS schema to support arbitrary internal metadata.
- Expose API routes to store and retrieve these intrinsic properties (organism, assay_type, etc.) associated with specific DRS objects.
Constraints: Must adhere strictly to the internal API authorization (OPA) policies.
Acceptance Criteria: A valid POST request successfully attaches a JSON payload of metadata to a DRS object, which is identical when fetched via GET.

[TASK-04-B] Contrast Resolution

Description: Build the logic in the Cohorts service to bind physical file pointers to logical contrast labels.
Requirements:
- Create internal functions that take a saved cohort contrast (Group A vs Group B).
- Query Ares for the DRS URIs corresponding to the sample IDs.
- Pair these physical URIs with the cohort_arm label to prepare the data structure for the future Samplesheet Generator.
Constraints: The Cohorts service must only resolve URIs for samples the user is authorized to access.
Acceptance Criteria: Given a contrast ID, the API returns a structured list containing sample IDs, their DRS URIs, and their assigned cohort_arm, ready for downstream consumption.

Metadata Module

On this page