Siegfried Gessulat
Siegfried Gessulat
The PsmSchema definition is currently implemented via dataclasses. It's great to have the ability to validate a dataframe with a schema! Pydantic is a library for defining schemas and validation...
Mokapot is a workflow that broadly consists of the following steps - data preprocessing: optionally subsetting the input data and then doing a 3-fold split to tho generate training data...
For very large datasets, single-threaded IO operations are currently a speed bottleneck. Pyarrow datasets natively support: - partitioning a dataframe - [multi-threaded read](https://arrow.apache.org/cookbook/py/io.html#reading-partitioned-data) - [multi-threaded](https://github.com/apache/arrow/blob/main/python/pyarrow/dataset.py#L873C1-L875C62) [write](https://arrow.apache.org/cookbook/py/io.html#writing-partitioned-datasets) - [specifying number of...
@sambenfredj 's pull requests introduces streaming at several places of the workflow but those intermediary file formats are not specified and documented yet. In addition, switching to a binary format...
As [discussed here](https://github.com/wfondrie/mokapot/pull/119#discussion_r1646591660), currently when the model is loaded, a StandardScaler is not loaded by newly instantiated. It would be desirable to give the user the flexibility to decide what...
The issue was started by this [discussion](https://github.com/wfondrie/mokapot/pull/119#discussion_r1646580990) regarding MSAID's streaming branch. For streaming several chunk sizes are defined and currently hard-coded. It would be desirable to modify them. Best would...
This issue stems from [this discussion](https://github.com/wfondrie/mokapot/pull/119#discussion_r1646603295). MSAID's streaming branch introduces the OnDiskPSMDataset class which is enables the chunk-wise streaming of a PSMDataset and returns this chunks as LinearPSMDataset. To make...
MSAID's streaming feature introduced a regression so that the suffixes of the psms output files are not `psms.txt` anymore but only `psms`. This is not desirable because it makes it...
This https://github.com/wfondrie/mokapot/pull/119 brings SQLite support for MSAID's internal format. This should live outside of (the core) Mokapot. Options: - move this code to MSAID's repos - move this code to...