MassQueryLanguage [bug] when loading mgf files without metadata key "scans"

Hi,

This is part of a function in the fileloading.py module. I believe there is a bug when reading MGF files without the metadata key 'scans' because it does not use the same scan number for one spectrum. The running number is overwritten, which causes the problem. I added comments behind the responsible code lines.

def _load_data_mgf(input_filename): file = load_from_mgf(input_filename)

ms2mz_list = []
for i, spectrum in enumerate(file): #A: scan number == i, here it is correctly defined
    if len(spectrum.peaks.mz) == 0:
        continue

    mz_list = list(spectrum.peaks.mz)
    i_list = list(spectrum.peaks.intensities)
    i_max = max(i_list)
    i_sum = sum(i_list)

    for i in range(len(mz_list)): #B: here the scan number i is overwritten
        if i_list[i] == 0:
            continue

        peak_dict = {}
        peak_dict["i"] = i_list[i]
        peak_dict["i_norm"] = i_list[i] / i_max
        peak_dict["i_tic_norm"] = i_list[i] / i_sum
        peak_dict["mz"] = mz_list[i]

        # Handling malformed mgf files
        try:
            peak_dict["scan"] = spectrum.metadata["scans"] # this works correctly because it uses the same spectrum object
        except:
            peak_dict["scan"] = i + 1 # here the scan number is assigned. But it is based on B and not on A how it should be I believe

....

Feb 22 '25 07:02 j-a-dietrich

This is a good point. Iteratir needs renaming. Could you throw a PR for this and an example file and we can write a small unit test to make sure we don’t have this issue anymore

Ming

On Fri, Feb 21, 2025 at 11:39 PM Jonas Dietrich @.***> wrote:

Hi,

This is part of a function in the fileloading.py module. I believe there is a bug when reading MGF files without the metadata key 'scans' because it does not use the same scan number for one spectrum. The running number is overwritten, which causes the problem. I added comments behind the responsible code lines.

def _load_data_mgf(input_filename): file = load_from_mgf(input_filename)

ms2mz_list = [] for i, spectrum in enumerate(file): #A: scan number == i, here it is correctly defined if len(spectrum.peaks.mz) == 0: continue
mz_list = list(spectrum.peaks.mz)
i_list = list(spectrum.peaks.intensities)
i_max = max(i_list)
i_sum = sum(i_list)

for i in range(len(mz_list)): #B: here the scan number i is overwritten
    if i_list[i] == 0:
        continue

    peak_dict = {}
    peak_dict["i"] = i_list[i]
    peak_dict["i_norm"] = i_list[i] / i_max
    peak_dict["i_tic_norm"] = i_list[i] / i_sum
    peak_dict["mz"] = mz_list[i]

    # Handling malformed mgf files
    try:
        peak_dict["scan"] = spectrum.metadata["scans"] # this works correctly because it uses the same spectrum object
    except:
        peak_dict["scan"] = i + 1 # here the scan number is assigned. But it is based on B and not on A how it should be I believe
....

— Reply to this email directly, view it on GitHub https://github.com/mwang87/MassQueryLanguage/issues/249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXSEAJT4CWO35LBGN3S4D2RASSXAVCNFSM6AAAAABXUXBY5KVHI2DSMVQWIX3LMV43ASLTON2WKOZSHA3TANJXHAYTIMI . You are receiving this because you are subscribed to this thread.Message ID: @.***> [image: j-a-dietrich]j-a-dietrich created an issue (mwang87/MassQueryLanguage#249) https://github.com/mwang87/MassQueryLanguage/issues/249

Hi,

This is part of a function in the fileloading.py module. I believe there is a bug when reading MGF files without the metadata key 'scans' because it does not use the same scan number for one spectrum. The running number is overwritten, which causes the problem. I added comments behind the responsible code lines.

def _load_data_mgf(input_filename): file = load_from_mgf(input_filename)

ms2mz_list = [] for i, spectrum in enumerate(file): #A: scan number == i, here it is correctly defined if len(spectrum.peaks.mz) == 0: continue
mz_list = list(spectrum.peaks.mz)
i_list = list(spectrum.peaks.intensities)
i_max = max(i_list)
i_sum = sum(i_list)

for i in range(len(mz_list)): #B: here the scan number i is overwritten
    if i_list[i] == 0:
        continue

    peak_dict = {}
    peak_dict["i"] = i_list[i]
    peak_dict["i_norm"] = i_list[i] / i_max
    peak_dict["i_tic_norm"] = i_list[i] / i_sum
    peak_dict["mz"] = mz_list[i]

    # Handling malformed mgf files
    try:
        peak_dict["scan"] = spectrum.metadata["scans"] # this works correctly because it uses the same spectrum object
    except:
        peak_dict["scan"] = i + 1 # here the scan number is assigned. But it is based on B and not on A how it should be I believe
....

— Reply to this email directly, view it on GitHub https://github.com/mwang87/MassQueryLanguage/issues/249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXSEAJT4CWO35LBGN3S4D2RASSXAVCNFSM6AAAAABXUXBY5KVHI2DSMVQWIX3LMV43ASLTON2WKOZSHA3TANJXHAYTIMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Feb 22 '25 19:02 mwang87

Sure, but I already forked this repository for another project and therefore cannot throw a PR from a fork.

Feb 24 '25 15:02 j-a-dietrich