[bug] when loading mgf files without metadata key "scans"
Hi,
This is part of a function in the fileloading.py module. I believe there is a bug when reading MGF files without the metadata key 'scans' because it does not use the same scan number for one spectrum. The running number is overwritten, which causes the problem. I added comments behind the responsible code lines.
def _load_data_mgf(input_filename): file = load_from_mgf(input_filename)
ms2mz_list = []
for i, spectrum in enumerate(file): #A: scan number == i, here it is correctly defined
if len(spectrum.peaks.mz) == 0:
continue
mz_list = list(spectrum.peaks.mz)
i_list = list(spectrum.peaks.intensities)
i_max = max(i_list)
i_sum = sum(i_list)
for i in range(len(mz_list)): #B: here the scan number i is overwritten
if i_list[i] == 0:
continue
peak_dict = {}
peak_dict["i"] = i_list[i]
peak_dict["i_norm"] = i_list[i] / i_max
peak_dict["i_tic_norm"] = i_list[i] / i_sum
peak_dict["mz"] = mz_list[i]
# Handling malformed mgf files
try:
peak_dict["scan"] = spectrum.metadata["scans"] # this works correctly because it uses the same spectrum object
except:
peak_dict["scan"] = i + 1 # here the scan number is assigned. But it is based on B and not on A how it should be I believe
....
This is a good point. Iteratir needs renaming. Could you throw a PR for this and an example file and we can write a small unit test to make sure we don’t have this issue anymore
Ming
On Fri, Feb 21, 2025 at 11:39 PM Jonas Dietrich @.***> wrote:
Hi,
This is part of a function in the fileloading.py module. I believe there is a bug when reading MGF files without the metadata key 'scans' because it does not use the same scan number for one spectrum. The running number is overwritten, which causes the problem. I added comments behind the responsible code lines.
def _load_data_mgf(input_filename): file = load_from_mgf(input_filename)
ms2mz_list = [] for i, spectrum in enumerate(file): #A: scan number == i, here it is correctly defined if len(spectrum.peaks.mz) == 0: continue
mz_list = list(spectrum.peaks.mz) i_list = list(spectrum.peaks.intensities) i_max = max(i_list) i_sum = sum(i_list) for i in range(len(mz_list)): #B: here the scan number i is overwritten if i_list[i] == 0: continue peak_dict = {} peak_dict["i"] = i_list[i] peak_dict["i_norm"] = i_list[i] / i_max peak_dict["i_tic_norm"] = i_list[i] / i_sum peak_dict["mz"] = mz_list[i] # Handling malformed mgf files try: peak_dict["scan"] = spectrum.metadata["scans"] # this works correctly because it uses the same spectrum object except: peak_dict["scan"] = i + 1 # here the scan number is assigned. But it is based on B and not on A how it should be I believe....
— Reply to this email directly, view it on GitHub https://github.com/mwang87/MassQueryLanguage/issues/249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXSEAJT4CWO35LBGN3S4D2RASSXAVCNFSM6AAAAABXUXBY5KVHI2DSMVQWIX3LMV43ASLTON2WKOZSHA3TANJXHAYTIMI . You are receiving this because you are subscribed to this thread.Message ID: @.***> [image: j-a-dietrich]j-a-dietrich created an issue (mwang87/MassQueryLanguage#249) https://github.com/mwang87/MassQueryLanguage/issues/249
Hi,
This is part of a function in the fileloading.py module. I believe there is a bug when reading MGF files without the metadata key 'scans' because it does not use the same scan number for one spectrum. The running number is overwritten, which causes the problem. I added comments behind the responsible code lines.
def _load_data_mgf(input_filename): file = load_from_mgf(input_filename)
ms2mz_list = [] for i, spectrum in enumerate(file): #A: scan number == i, here it is correctly defined if len(spectrum.peaks.mz) == 0: continue
mz_list = list(spectrum.peaks.mz) i_list = list(spectrum.peaks.intensities) i_max = max(i_list) i_sum = sum(i_list) for i in range(len(mz_list)): #B: here the scan number i is overwritten if i_list[i] == 0: continue peak_dict = {} peak_dict["i"] = i_list[i] peak_dict["i_norm"] = i_list[i] / i_max peak_dict["i_tic_norm"] = i_list[i] / i_sum peak_dict["mz"] = mz_list[i] # Handling malformed mgf files try: peak_dict["scan"] = spectrum.metadata["scans"] # this works correctly because it uses the same spectrum object except: peak_dict["scan"] = i + 1 # here the scan number is assigned. But it is based on B and not on A how it should be I believe....
— Reply to this email directly, view it on GitHub https://github.com/mwang87/MassQueryLanguage/issues/249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXSEAJT4CWO35LBGN3S4D2RASSXAVCNFSM6AAAAABXUXBY5KVHI2DSMVQWIX3LMV43ASLTON2WKOZSHA3TANJXHAYTIMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Sure, but I already forked this repository for another project and therefore cannot throw a PR from a fork.