fido icon indicating copy to clipboard operation
fido copied to clipboard

Fido hanging on skeleton stream (fmt/1000)

Open ross-spencer opened this issue 4 years ago • 3 comments

Bug spotted while writing docs, the following will hang when calling identify_stream(...).

# -*- coding: utf-8 -*-

"""Reference for Fido-based format identification

1. Create a byte-stream with a known magic number.
2. Call identify_stream(...) to identify the file against Fido's known formats.
"""

from __future__ import print_function

import io

from fido import fido

# Magic-number for fmt/1000.
BIN = b"\x5A\x58\x54\x61\x70\x65\x21\x1A\x01"

# Create the stream object with the known magic-number.
fstream = io.BytesIO(BIN)

# Create a Fido instance and call identify_stream. The identify_stream function
# will work on the stream as-is. This could be an open file handle that the
# caller is managing for itself.
f = fido.Fido()
f.identify_stream(fstream, "filename to display", extension=False)

I haven't looked at the code myself, but Carl has and identified we're getting to the end of the data during a loop and then it's blocking as we've nothing left to consume, but no way to exit. And it sounds like he has a fix, so that's good!

ross-spencer avatar May 14 '20 01:05 ross-spencer

Of interest to https://github.com/openpreserve/fido/issues/94

ross-spencer avatar May 14 '20 01:05 ross-spencer

The source of the error is in the blocking_read() function, and the problem is in this loop:

        while bytes_read < bytes_to_read:
            readbuffer = file.read(bytes_to_read - bytes_read)
            buffer += readbuffer
            bytes_read = len(buffer)
            # break out if EOF is reached.
            if readbuffer == '':
                break
        return buffer

specifically with the exit condition if readbuffer == '': which is never satisfied, I believe but haven't tested, that the test should be for None. What's happening is caused by the stream to identify's length been (much) shorter than the default buffer size, bytes_to_read = 131072. The first read slurps all five bytes, but that still leaves bytes_read < bytes_to_read by a margin of 131067. The loop then sticks, and this is demonstrated by this Travis job that hangs for both Python 3 tasks. The Python 2 version doesn't hang, which probably explains why the issue's appeared. It also demonstrates that we're a few tests short of a reliable build process.

I have a working fix for the tests but note that these tests are deficient, which segues into the link to #94 that @ross-spencer alluded to. I believe that the identify_file and identify_stream methods should return the matched set of results. They currently return nothing and pass the results to the CLI display function further down the line. A real API would allow these tests to examine some basic identification behaviour and check the results. A nice payoff from unit tests is that they force you to write testable code, which is usually well designed.

carlwilson avatar May 14 '20 23:05 carlwilson

Hackathon 2023 Review: Selected, @darrendignam @sromkey this should be fixable from here I think.

carlwilson avatar Jul 17 '23 13:07 carlwilson