alphabase
alphabase copied to clipboard
remove_unused_fragments() disrupts the order of other dataframe attributes like `nAA`
Describe the bug
In compress_fragment_indices(), the frag_start_col has already been sorted in ascending order. Therefore, there is no need to use sort_values() before calling compress_fragment_indices. Doing so may disrupt the order of nAA or other sequential attributes.
Similarly, using sort_index() might compromise the ordered nature of frag_start_col. To prevent this, it should be replaced with reset_index().
changes
- fix bug
- adding test and test data
Hi, can you give a minimal example which shows the bug. How I understand it, we pass a precursor df in a certain order and it's returned in a different order?
OK [GeorgWa](https://github.com/GeorgWa), I will clarify this bug in more detail.
When running the following code:
precursor_df = refine_precursor_df(precursor_df)
print(f"precursor_df.frag_start_idx.is_monotonic_increasing: {precursor_df.frag_start_idx.is_monotonic_increasing}\n",
f"(precursor_df.frag_start_idx.iloc[1:].values == precursor_df.frag_stop_idx.iloc[:-1].values).all(): {(precursor_df.frag_start_idx.iloc[1:].values == precursor_df.frag_stop_idx.iloc[:-1].values).all()}\n",
f"precursor_df.nAA.is_monotonic_increasing: {precursor_df.nAA.is_monotonic_increasing}.")
precursor_df, _ = remove_unused_fragments(precursor_df, [fragment_intensity_df])
print(f"precursor_df.frag_start_idx.is_monotonic_increasing: {precursor_df.frag_start_idx.is_monotonic_increasing}\n",
f"(precursor_df.frag_start_idx.iloc[1:].values == precursor_df.frag_stop_idx.iloc[:-1].values).all(): {(precursor_df.frag_start_idx.iloc[1:].values == precursor_df.frag_stop_idx.iloc[:-1].values).all()}\n",
f"precursor_df.nAA.is_monotonic_increasing: {precursor_df.nAA.is_monotonic_increasing}.")
The results are as follows:
False
False
True
True
True
False
However, the desired results should be:
False
False
True
True
True
True
This is just a specific and straightforward example. The key issue is that the remove_unused_fragments() function should maintain the order of the nAA column in precursor_df for any input, regardless of whether the nAA column is ordered or unordered initially.
Here’s a professionally translated and clarified version of your description:
The bug was initially discovered during testing of AlphaPeptDeep. When the model performs inference, there are two possible ways to store the prediction results:
- Assigning the entire block at once
- Assigning results batch by batch
For the entire block assignment to work, the following conditions must be met:
(len(precursor_df) == 0) or (
(precursor_df.index.values[0] == 0)
and precursor_df.frag_start_idx.is_monotonic_increasing
and (precursor_df.frag_start_idx.iloc[1:].values == precursor_df.frag_stop_idx.iloc[:-1].values).all()
and precursor_df.nAA.is_monotonic_increasing
and np.all(np.diff(precursor_df.index.values) == 1)
)
For further details, you can refer to the pDeepModel._set_batch_predict_data() function in the AlphaPeptDeep repository:
https://github.com/MannLabs/alphapeptdeep/blob/main/peptdeep/model/ms2.py.
I hope my understanding is correct, and I apologize if my explanation has caused you any confusion.🥺
The Bug
At the moment remove_unused_fragments works as intended. In alphaBase, there is no requirement that the dataframe needs to be "refined". Refinement can be useful for some applications of alphaBase like peptideep, but if we would make sure that everything is refined after every operation, this would be very compute intensive.
Instead, a dataframe can be refined by the user which allows functions to use optimizations. The function for this is refine_precursor_df. We currently do not have a function to refine a fragment df.
Furthermore, remove_unused_fragments works as intended as it should return the dataframe in the same order as it was passed. This is why we do the sorting first and reset it later. Also, remove_precursor_df currently does not create the fragment df in continuous manner, because this wasn't implemented yet.
Your Solution
First, remove_unused_fragments should always return the precursor df in the same order. I think the better solution would be to call refine_precursor_df first (which orders nAA) and call remove_unused_fragments second.
The solution you provide does not make sure that frag_start_idx is monotonic BEFORE we pass it to compress_fragment_indices, although this is an optimized function which requires it. So it most likely won't work.
See docstring:
@nb.njit()
def compress_fragment_indices(frag_idx):
"""
recalculates fragment indices to remove unused fragments. Can be used to compress a fragment library.
Expects fragment indices to be ordered by increasing values (!!!).
It should be O(N) runtime with N being the number of fragment rows.
"""
Lastly, we need a unittest which checks if the fragment-precursor pairing is still valid. Otherwise, we can't be sure that the fragments are still the same. Instead of using random fragment samples, for example assign 123, 456, 789 etc. and test if the changed frag_start_idx and frag_stop_idx still point to the same numbers.
Suggested Solution
Let's split this into two tasks. First, the user (or peptideep) calls refine_precursor_df which orders nAA, then we call remove_unused_fragments.
We also have to change remove_unused_fragments and compress_fragment_indices so it does not only remove unused fragments but also reorders the fragments while maintaining the mapping.
Thx, GeorgWa
- As mentioned earlier, this is just a specific and straightforward example. I believe it is a minimal example. My primary use of
remove_unused_fragments()is to obtain aprecursor_dfsorted bynAA. However, I noticed thatremove_unused_fragments()modifies the input order, as shown in the following line:
precursor_df = precursor_df.sort_values([frag_start_col], ascending=True)
This step illustrates that the input order is altered during the process.
-
I have reviewed the docstring in
compress_fragment_indices()and noticed the statement: "Expects fragment indices to be ordered by increasing values." However, I am unclear why this strict requirement is necessary for the function. Could you provide a simple example to demonstrate the potential issues or dangers if this condition is not met? -
Splitting this process into two tasks is undoubtedly the correct approach. I have attempted to combine
refine_precursor_df()with the originalremove_unused_fragments(), but it did not seem to work as expected. Perhaps there is a specific detail I may have overlooked?
- It should have no effect on the input order as we restore it afterwards.
# order by frag frag_start_col
# calling reset_index would be even better to make sure index is clean
precursor_df = precursor_df.sort_values([frag_start_col], ascending=True)
# ... some code
# restore old input order
precursor_df = precursor_df.sort_index()
- Yes, this is why I was concerned about the implementation. From my knowledge this is necessary (given the explicit mentioning in the docstring). If you think it's not, let's remove it from the docstring and add tests which "prove" it. If you think the reordering is not necessary this would solve the problem.
What do you think about:
- Add a unittest to check if
compress_fragment_indices()works without orderedfrag_start_idx - Add a unittest on the fragment level which checks if
remove_unused_fragmentsworks without orderedfrag_start_idx.
Sorry for the confusion, I hope this way we can address it and merge your PR :)