This PR implements a new results file parser for the .oswpq file format, which contains OpenSWATH results stored in Parquet format. The implementation uses PyArrow for memory-efficient lazy evaluation without loading entire datasets into memory.

Overview

The .oswpq format is a directory containing two Parquet files:

precursors_features.parquet - Precursor-level features and scoring information
transition_features.parquet - Transition-level features and intensities

Key Features

Memory-Efficient Lazy Evaluation

Uses PyArrow datasets for lazy loading instead of pd.read_parquet()
Applies filtering and column projection at the parquet level
Only materializes data when absolutely necessary
Requires PyArrow as a mandatory dependency

Complete OpenSWATH-OSWPQ Integration

Extends GenericResultsAccess following the existing pattern
Maps the complex OSWPQ schema (90+ precursor columns, 50+ transition columns) to the unified interface
Supports filtering by q-values at precursor, peptide, and protein levels
Handles ion mobility data when present
Identifies itself as "OpenSWATH-OSWPQ" software

Enhanced ResultsLoader Integration

Automatically detects .oswpq directories
Validates presence of both required parquet files before loading
Maintains backward compatibility with existing .osw and .tsv formats

Usage Examples

from massdash.loaders.ResultsLoader import ResultsLoader

# Load OSWPQ results with automatic lazy evaluation
results_loader = ResultsLoader('/path/to/results.oswpq')

# Mix with other formats
results_loader = ResultsLoader([
    '/path/to/results.oswpq',
    '/path/to/legacy.osw',
    '/path/to/report.tsv'
])

# Direct access with efficient filtering
from massdash.loaders.access.OSWPQResultsAccess import OSWPQResultsAccess
access = OSWPQResultsAccess('/path/to/results.oswpq')
precursors = access.getIdentifiedPrecursors(qvalue=0.01)  # Filtered at parquet level

Technical Implementation

Lazy Evaluation Architecture

# PyArrow-only approach - lazy dataset with efficient querying
self.precursors_dataset = pq.ParquetDataset(precursors_file)
filtered_data = self._execute_precursor_query(
    filters=[('SCORE_MS2_Q_VALUE', '<=', 0.01)],
    columns=['MODIFIED_SEQUENCE', 'PRECURSOR_CHARGE']
)

Performance Benefits

Memory Usage: No longer loads entire parquet files into memory
Query Performance: Filtering applied at parquet level using PyArrow compute
Column Efficiency: Only loads necessary columns for each operation
Scalability: Handles large-scale OpenSWATH results efficiently
Simplified Codebase: Single implementation path focused on PyArrow

Testing

Comprehensive test coverage using snapshot testing pattern:

Matches test_OSWDataAccess structure with PandasSnapshotExtension
Tests all public methods with real data integration
Performance tests verify lazy evaluation is working
Error handling for invalid data

Dependencies

This implementation requires PyArrow as a mandatory dependency, which provides:

Superior performance for parquet operations
Native lazy evaluation capabilities
Efficient column-oriented data processing
Advanced filtering and projection features

Fixes #190.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Contents (#191)

Other

add .oswpq tests and .oswpq file
Roestlab/massdash into copilot/fix-190
refactor copilot changes to make more readable
bugs in OSWPQResultsAccess
add snapshot tests oswpq results loader
add tests for oswpq file
add software, sequence and charge info to getTransitionFeaturesDf
require only .oswpq for SpectrumLoader
Roestlab/massdash into copilot/fix-190
include tests
update resultLoader tests for oswpq

Uncategorised!

Initial plan
Implement OSWPQResultsAccess parser for .oswpq files
Add documentation and improve OSWPQ detection logic
Initial plan to address review comments
Implement lazy evaluation with pyarrow and update tests to snapshot pattern
Update API documentation for OSWPQResultsAccess with lazy evaluation details
Delete test/loaders/test_ResultsLoader_OSWPQIntegration.py
Remove pandas fallback for increased simplicity, require PyArrow
Delete docs/OSWPQ_Parser.md
minor changes
Merge branch 'dev' into copilot/fix-190
Merge branch 'dev' into copilot/fix-190
Update massdash/loaders/access/OSWPQResultsAccess.py
Merge branch 'dev' into copilot/fix-190

Aug 20 '25 13:08 Copilot

@jcharkow 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Aug 20 '25 13:08 Copilot

@copilot address reviewer comments

Aug 20 '25 13:08 jcharkow

Overall looks good.

For tests assume snapshot tests like in test_OSWDataAccess, I will convert the .osw files to .oswpq ...

Implemented both requested changes in commit fe72b97:

Lazy evaluation with pyarrow: Replaced pd.read_parquet() with pyarrow.ParquetDataset for memory-efficient loading. The implementation uses pyarrow filtering and column projection to avoid loading entire files, with graceful fallback to pandas when needed.
Snapshot tests: Converted tests from mock-based to snapshot pattern following test_OSWDataAccess structure, using PandasSnapshotExtension and real test data paths. Tests will automatically skip until .oswpq test data is provided.

The new implementation maintains all API contracts while providing significant memory efficiency improvements for large parquet files.

@copilot address reviewer comments

Addressed all reviewer comments in commit fe72b97:

Implemented lazy evaluation with pyarrow instead of loading entire parquet files in memory
Converted tests to snapshot testing pattern matching test_OSWDataAccess
Prepared test structure for real .oswpq test data conversion
Added efficient filtering and column projection capabilities
Maintained backward compatibility with graceful fallback to pandas

The implementation is ready for the .oswpq test data conversion you mentioned.

Aug 20 '25 13:08 Copilot

@copilot update API documentation

Aug 20 '25 15:08 jcharkow

@copilot update API documentation

Updated API documentation in commit 1b98015:

Added OSWPQResultsAccess to API.rst autosummary for automated sphinx documentation
Enhanced OSWPQ_Parser.md with detailed lazy evaluation features and performance benefits
Added comprehensive class docstrings for better API reference generation
Included OSWPQ_Parser documentation in main documentation index
Documented lazy evaluation API usage examples and memory efficiency improvements

The OSWPQResultsAccess class is now fully integrated into the automated API documentation system.

Aug 20 '25 15:08 Copilot

@copilot address reviwer comments

Aug 20 '25 17:08 jcharkow

@copilot address reviwer comments

Addressed all reviewer comments by removing pandas fallback for increased simplicity. The implementation now requires PyArrow and uses only lazy evaluation without fallback code paths. (4cbed34)

Aug 20 '25 17:08 Copilot

@singjc can I merge this?

Nov 27 '25 22:11 jcharkow

@copilot can you try address reviewer comments again

Nov 28 '25 01:11 singjc

@singjc can you try and get copilot to address the changes again? I lost access to my github ed account.

If not I will address the comments manually

Dec 04 '25 17:12 jcharkow

@copilot can you try address reviewer comments again

Dec 04 '25 18:12 singjc

@jcharkow I think it should be good to go. Once the CI finishes we can merge

Dec 04 '25 20:12 singjc

Thanks for finishing this up!

Dec 04 '25 22:12 jcharkow

Add memory-efficient .oswpq parser with PyArrow lazy evaluation

Overview

Key Features

Memory-Efficient Lazy Evaluation

Complete OpenSWATH-OSWPQ Integration

Enhanced ResultsLoader Integration

Usage Examples

Technical Implementation

Lazy Evaluation Architecture

Performance Benefits

Testing

Dependencies

Contents (#191)

Other

Uncategorised!