Metadata Module
PacificAnalytics/unified-roundtrip-roadmap
Epic: Metadata Module & Cohort Discovery
(This design document supports and expands upon Point 14: Metadata Module — Annotating Data for Contrast Building from the primary roadmap.md.)
| Attribute | Details |
|---|---|
| Owner | Samyak Jain |
| Contributors | team |
| Version | 1.0 (March 4, 2026) |
User Journey
The user wants to explore and build cohorts over the entirety of the NCBI SRA database without being restricted to pre-defined diseases. They navigate to the Cohort Builder, apply a wide range of varied filters (e.g., specific sequencing platforms, collection dates,organisms,diseases and more), and get results. Once they have their cohorts defined, they can create contrasts for analysis. They can either:
- Contrast between two distinct cohorts: (e.g., a cohort of MRSA patients vs. a cohort of healthy controls).
- Contrast within a single cohort: By defining a primary cohort and then sub-filtering it into arms (e.g., a single cohort of all patients, split into 'treated' and 'untreated' arms).
Please keep in mind this is still in discussion and whether we want to do this in the first iteration of metadata module or not ( the ares intergation with cohorts)
The system automatically links these logical cohort members to physical DRS objects stored in Ares, annotating them with both intrinsic (assay type, organism) and experimental (cohort arm) metadata, getting them ready for workflow execution.
Requirements
- Comprehensive SRA Data Indexing: Expand the MDI database to ingest and index the entirety of the NCBI SRA database, removing the current restriction to specific diseases (Alzheimer's, Parkinson's, MRSA, etc.).
- Currently, the MDI database is restricted to specific diseases (Alzheimer's, Parkinson's, MRSA, etc.), since we are only ingesting a subset of the SRA database based on the diseases. We need to expand the MDI database to ingest and index the entirety of the NCBI SRA database.( others associated databses like Biopropject, Biosample, Pubmed, GEO etc)
The database discussion is still also up in the air: whether we stick with postgres or move to something like https://clickhouse.com/ or https://duckdb.org/
- Flexible Cohort Contrasts: Upgrade the Cohorts service API and data models to explicitly support saving "Contrasts". A contrast can either be between two saved cohorts, or defined dynamically within a single cohort using sub-filters.
- Performance Optimization: Since the data that we would be gatehring would be massive, we would have to implement a caching mechanism within the Cohorts service to ensure that complex, high-cardinality metadata queries return quickly (it is after all static data)
- Ares Integration: Connect the logical samples filtered in Cohorts with the physical file metadata (DRS objects) managed by the Ares service, enabling proper metadata annotation for workflow execution. (still in discussion)
Scope
In Scope
- Expanding MDI data ingestion pipelines to process all public SRA data.
- Introducing a
Contrastdata model in the Cohorts service to handle inter-cohort and intra-cohort experiment designs. - Updating Cohorts service data models and query builders to support advanced filtering (e.g., date ranges, geographic origins, assay types and other metadata).
- Implementing a caching layer in the Cohorts service.
Middle Ground
- Building the Metadata API endpoints to annotate DRS objects in Ares with intrinsic and experimental metadata.
- Establishing the data flow between Cohorts (logical contrast definition) and Ares (physical file metadata).
Out of scope
- Live synchronization of private user-uploaded data with SRA public data.
- Automated pipeline to update MDI database from SRA on a highly frequent or real-time basis.
- Note: We will not initially have an automated pipeline that fetches and updates the MDI database from SRA on a highly frequent or real-time basis. For the MVP, the complete SRA snapshot ingested initially will be available, but live up-to-the-minute SRA updates are deferred.
- Generating samplesheets for different Nextflow modules.
- Note: We first need to map different workflows to their specific samplesheet types. This mapping work has not been completed yet, so automated samplesheet generation for varied modules is deferred for this phase.
Design Tradeoffs
Full SRA Ingestion vs. On-Demand Fetching Extracting the entire NCBI SRA database ( though only metadata) into our internal MDI Postgres instance requires significant storage ( this can span into TBs) and upfront ETL processing time. However, this tradeoff is necessary because on-demand fetching via NCBI Entrez APIs is too slow for the interactive cohort discovery experience we want to provide. By bringing the data internally, we can build efficient materialized views and indexes.
NCBI Entrez API has a limit of max 10 requests per second with an API key.
For context, the anticipated volume of data to ingest and index is approximately 149 million records:
| Entity | Record Count |
|---|---|
bioproject | 1,021,958 |
biosample | 52,924,838 |
sra | 43,024,773 |
gds | 8,653,935 |
pubmed | 40,196,164 |
assembly | 3,451,225 |
| TOTAL | 149,272,893 |
Implementation Plan
1. Workstream: SRA Full Ingestion (MDI)
Description: Modify the MDI ETL pipelines to systematically fetch, transform, and load the entire SRA dataset (along with associated databases like BioProject, BioSample, PubMed, GEO, etc.).
[TASK-01-A] Unconstrained Ingestion & Resiliency
- Description: Update the NCBI fetchers in
mDataIndexingto systematically download all available SRA/BioSample accessions, with built-in checkpointing to resume interrupted fetches. - Requirements:
- Implement a paginator that handles the NCBI Entrez API history and chunking for the entire SRA database or see if we can a "query" to get all the data at once.
- Create a checkpointing system (e.g., tracking the last successfully downloaded and ingested ID/cursor) so that if a download crashes, it resumes from the point of failure instead of re-fetching or re-ingesting data.
- Constraints: Must respect NCBI API rate limits to prevent IP blocks.
- Acceptance Criteria: Running the ingestion pipeline downloads a complete, unconstrained snapshot of current SRA metadata into the staging tables, and can successfully resume exactly from where it left off if interrupted.
[TASK-01-B] Scale Postgres Loaders
- Description: Optimize the SQLAlchemy bulk-insert and upsert logic to handle the drastically increased data volume without memory bloat.
- Requirements:
- Refactor loaders to batch transactions and optimize indices for large-scale ingestion (approx. 150M records).
- Update the Post-ETL step to generate summary tables efficiently against the full dataset.
- Constraints: The ingestion process must not lock the database for extended periods, allowing dependent services to continue querying historical data.
- Acceptance Criteria: The MDI database successfully houses the ~150M records, and the Post-ETL script runs to completion, generating all required materialized views for downstream consumption.
2. Workstream: Advanced Cohort Filtration
Description: Expose the rich SRA data via the Cohorts API so the UI can construct complex queries.
[TASK-02-A] Dynamic Query Builder
- Description: Update the
Cohortsservice backend to support varied filters mapping dynamically to the MDI Postgres schema. - Requirements:
- Extend the existing query translation logic to support new fields like
geographic_origin,collection_date, andplatformand more. - Ensure the resulting SQL queries or ORM calls against the MDI database use the newly created materialized views/indexes.
- Extend the existing query translation logic to support new fields like
- Constraints: Query builders must prevent SQL injection and gracefully handle null or missing metadata fields.
- Acceptance Criteria: The Cohorts internal API successfully returns sample lists when provided with combination filters previously unsupported (e.g., date ranges + specific assay types).
[TASK-02-B] Filter API Endpoints (with dynamic counts)
- Description: Create or enhance
GET /filtersendpoints to return the available distinct values and ranges for the new metadata fields. - Requirements:
- Implement endpoints that query MDI for unique categorical values (e.g., all distinct sequencing platforms) and min/max ranges for continuous variables.
- Structure the JSON response to enable the frontend to easily map these to dynamic dropdowns and slider UI components.
- Constraints: These endpoints must respond quickly to ensure the UI remains snappy.
- Acceptance Criteria: A
GET /filtersrequest returns a comprehensive JSON schema of all available SRA filter criteria and their valid ranges/options.
3. Workstream: Flexible Cohort Contrasts
Description: Upgrade the Cohorts service to explicitly define, save, and resolve "Contrasts" used for analysis, supporting both inter-cohort and intra-cohort comparisons.
[TASK-03-A] Contrast Data Modeling
- Description: Introduce a new
ContrastORM model incohorts/api/v1/db.pyto persist experiment designs. - Requirements:
- The model must support Inter-Cohort Contrasts: linking two existing
cohort_ids (Group A vs Group B). - The model must support Intra-Cohort Contrasts: linking a single
cohort_id, along with two distinct JSON filter definitions that sub-divide the primary cohort into Group A and Group B.
- The model must support Inter-Cohort Contrasts: linking two existing
- Constraints: Must ensure referential integrity; if a underlying Cohort is deleted, the dependent Contrasts must be cascade-deleted.
- Acceptance Criteria: A user can successfully save and retrieve both types of contrasts via the API.
[TASK-03-B] Contrast Resolution Endpoint
- Description: Build an endpoint that resolves a saved
contrast_idinto the two explicit lists of sample IDs ready for analysis. - Requirements:
- Create
GET /contrasts/{contrast_id}/resolve. - For inter-cohort contrasts, the endpoint fetches the samples for Cohort A and Cohort B independently.
- For intra-cohort contrasts, the endpoint fetches the primary cohort's samples, then applies the two sub-filters in-memory or via a refined database query to split the samples into the two arms.
- Create
- Constraints: Must return the resolved data rapidly, leveraging the Cohort caching layer where applicable.
- Acceptance Criteria: Given any valid
contrast_id, the endpoint returns a structured JSON payload containinggroup_a_samplesandgroup_b_samples.
Still under Discussion:
4. Workstream: Ares & Cohorts Integration (Metadata Module)
Description: Connect the logical cohort samples to their corresponding physical metadata, preparing them for downstream workflow execution.
[TASK-04-A] DRS Metadata Endpoints in Ares
- Description: Implement the endpoints in
Aresto store and retrieve metadata associated with specific DRS objects. - Requirements:
- Extend the Ares DRS schema to support arbitrary internal metadata.
- Expose API routes to store and retrieve these intrinsic properties (
organism,assay_type, etc.) associated with specific DRS objects.
- Constraints: Must adhere strictly to the internal API authorization (OPA) policies.
- Acceptance Criteria: A valid POST request successfully attaches a JSON payload of metadata to a DRS object, which is identical when fetched via GET.
[TASK-04-B] Contrast Resolution
- Description: Build the logic in the
Cohortsservice to bind physical file pointers to logical contrast labels. - Requirements:
- Create internal functions that take a saved cohort contrast (Group A vs Group B).
- Query
Aresfor the DRS URIs corresponding to the sample IDs. - Pair these physical URIs with the
cohort_armlabel to prepare the data structure for the future Samplesheet Generator.
- Constraints: The Cohorts service must only resolve URIs for samples the user is authorized to access.
- Acceptance Criteria: Given a contrast ID, the API returns a structured list containing sample IDs, their DRS URIs, and their assigned
cohort_arm, ready for downstream consumption.