pymzML icon indicating copy to clipboard operation
pymzML copied to clipboard

ParseError with single-line files stored as io.BytesIO

Open william-watson-swri opened this issue 1 year ago • 0 comments

Version pymzml: 2.5.10 Python: 3.11.7

Description I'm receiving mzML files as bytes, wrapping these in io.BytesIO, and then passing that to pymzml.run.Reader:

reader = pymzml.run.Reader(io.BytesIO(mzml_bytes))

This sometimes raises the following exception:

ParseError: no element found: line 1, column 0

Why Some of the mzML files I'm using do not have line breaks - i.e. they are all on a single line, and the _guess_encoding function breaks these. Looking at the pymzml source, the io.BytesIO objects travel through this line, which in turn calls the culprit, _guess_encoding:

match = regex_patterns.FILE_ENCODING_PATTERN.search(mzml_file.readline())

After the .readline(), there's no data left in the BytesIO if the file has no line breaks, and thus the later XML parsing fails.

Workaround/fix I'm current inserting a line break at the start of the XML data before passing it to pymzml:

data = re.sub(br'(<\?xml[^>]+>)', br'\1\n', mzml_bytes, count=1)

I believe this could also be fixed by just adding mzml_file.seek(0) after the offending line.

william-watson-swri avatar Sep 26 '24 18:09 william-watson-swri