Add memory-efficient .oswpq parser with PyArrow lazy evaluation
This PR implements a new results file parser for the .oswpq file format, which contains OpenSWATH results stored in Parquet format. The implementation uses PyArrow for memory-efficient lazy evaluation without loading entire datasets into memory.
Overview
The .oswpq format is a directory containing two Parquet files:
precursors_features.parquet- Precursor-level features and scoring informationtransition_features.parquet- Transition-level features and intensities
Key Features
Memory-Efficient Lazy Evaluation
- Uses PyArrow datasets for lazy loading instead of
pd.read_parquet() - Applies filtering and column projection at the parquet level
- Only materializes data when absolutely necessary
- Requires PyArrow as a mandatory dependency
Complete OpenSWATH-OSWPQ Integration
- Extends
GenericResultsAccessfollowing the existing pattern - Maps the complex OSWPQ schema (90+ precursor columns, 50+ transition columns) to the unified interface
- Supports filtering by q-values at precursor, peptide, and protein levels
- Handles ion mobility data when present
- Identifies itself as "OpenSWATH-OSWPQ" software
Enhanced ResultsLoader Integration
- Automatically detects .oswpq directories
- Validates presence of both required parquet files before loading
- Maintains backward compatibility with existing .osw and .tsv formats
Usage Examples
from massdash.loaders.ResultsLoader import ResultsLoader
# Load OSWPQ results with automatic lazy evaluation
results_loader = ResultsLoader('/path/to/results.oswpq')
# Mix with other formats
results_loader = ResultsLoader([
'/path/to/results.oswpq',
'/path/to/legacy.osw',
'/path/to/report.tsv'
])
# Direct access with efficient filtering
from massdash.loaders.access.OSWPQResultsAccess import OSWPQResultsAccess
access = OSWPQResultsAccess('/path/to/results.oswpq')
precursors = access.getIdentifiedPrecursors(qvalue=0.01) # Filtered at parquet level
Technical Implementation
Lazy Evaluation Architecture
# PyArrow-only approach - lazy dataset with efficient querying
self.precursors_dataset = pq.ParquetDataset(precursors_file)
filtered_data = self._execute_precursor_query(
filters=[('SCORE_MS2_Q_VALUE', '<=', 0.01)],
columns=['MODIFIED_SEQUENCE', 'PRECURSOR_CHARGE']
)
Performance Benefits
- Memory Usage: No longer loads entire parquet files into memory
- Query Performance: Filtering applied at parquet level using PyArrow compute
- Column Efficiency: Only loads necessary columns for each operation
- Scalability: Handles large-scale OpenSWATH results efficiently
- Simplified Codebase: Single implementation path focused on PyArrow
Testing
Comprehensive test coverage using snapshot testing pattern:
- Matches test_OSWDataAccess structure with
PandasSnapshotExtension - Tests all public methods with real data integration
- Performance tests verify lazy evaluation is working
- Error handling for invalid data
Dependencies
This implementation requires PyArrow as a mandatory dependency, which provides:
- Superior performance for parquet operations
- Native lazy evaluation capabilities
- Efficient column-oriented data processing
- Advanced filtering and projection features
Fixes #190.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.
Contents (#191)
Other
- add .oswpq tests and .oswpq file
- Roestlab/massdash into copilot/fix-190
- refactor copilot changes to make more readable
- bugs in OSWPQResultsAccess
- add snapshot tests oswpq results loader
- add tests for oswpq file
- add software, sequence and charge info to getTransitionFeaturesDf
- require only .oswpq for SpectrumLoader
- Roestlab/massdash into copilot/fix-190
- include tests
- update resultLoader tests for oswpq
Uncategorised!
- Initial plan
- Implement OSWPQResultsAccess parser for .oswpq files
- Add documentation and improve OSWPQ detection logic
- Initial plan to address review comments
- Implement lazy evaluation with pyarrow and update tests to snapshot pattern
- Update API documentation for OSWPQResultsAccess with lazy evaluation details
- Delete test/loaders/test_ResultsLoader_OSWPQIntegration.py
- Remove pandas fallback for increased simplicity, require PyArrow
- Delete docs/OSWPQ_Parser.md
- minor changes
- Merge branch 'dev' into copilot/fix-190
- Merge branch 'dev' into copilot/fix-190
- Update massdash/loaders/access/OSWPQResultsAccess.py
- Merge branch 'dev' into copilot/fix-190
@jcharkow 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.
I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.
@copilot address reviewer comments
Overall looks good.
- For tests assume snapshot tests like in test_OSWDataAccess, I will convert the .osw files to .oswpq ...
Implemented both requested changes in commit fe72b97:
-
Lazy evaluation with pyarrow: Replaced
pd.read_parquet()withpyarrow.ParquetDatasetfor memory-efficient loading. The implementation uses pyarrow filtering and column projection to avoid loading entire files, with graceful fallback to pandas when needed. -
Snapshot tests: Converted tests from mock-based to snapshot pattern following test_OSWDataAccess structure, using
PandasSnapshotExtensionand real test data paths. Tests will automatically skip until .oswpq test data is provided.
The new implementation maintains all API contracts while providing significant memory efficiency improvements for large parquet files.
@copilot address reviewer comments
Addressed all reviewer comments in commit fe72b97:
- Implemented lazy evaluation with pyarrow instead of loading entire parquet files in memory
- Converted tests to snapshot testing pattern matching test_OSWDataAccess
- Prepared test structure for real .oswpq test data conversion
- Added efficient filtering and column projection capabilities
- Maintained backward compatibility with graceful fallback to pandas
The implementation is ready for the .oswpq test data conversion you mentioned.
@copilot update API documentation
@copilot update API documentation
Updated API documentation in commit 1b98015:
- Added OSWPQResultsAccess to API.rst autosummary for automated sphinx documentation
- Enhanced OSWPQ_Parser.md with detailed lazy evaluation features and performance benefits
- Added comprehensive class docstrings for better API reference generation
- Included OSWPQ_Parser documentation in main documentation index
- Documented lazy evaluation API usage examples and memory efficiency improvements
The OSWPQResultsAccess class is now fully integrated into the automated API documentation system.
@copilot address reviwer comments
@copilot address reviwer comments
Addressed all reviewer comments by removing pandas fallback for increased simplicity. The implementation now requires PyArrow and uses only lazy evaluation without fallback code paths. (4cbed34)
@singjc can I merge this?
@copilot can you try address reviewer comments again
@singjc can you try and get copilot to address the changes again? I lost access to my github ed account.
If not I will address the comments manually
@copilot can you try address reviewer comments again
@jcharkow I think it should be good to go. Once the CI finishes we can merge
Thanks for finishing this up!