fido
fido copied to clipboard
Fido hanging on skeleton stream (fmt/1000)
Bug spotted while writing docs, the following will hang when calling identify_stream(...)
.
# -*- coding: utf-8 -*-
"""Reference for Fido-based format identification
1. Create a byte-stream with a known magic number.
2. Call identify_stream(...) to identify the file against Fido's known formats.
"""
from __future__ import print_function
import io
from fido import fido
# Magic-number for fmt/1000.
BIN = b"\x5A\x58\x54\x61\x70\x65\x21\x1A\x01"
# Create the stream object with the known magic-number.
fstream = io.BytesIO(BIN)
# Create a Fido instance and call identify_stream. The identify_stream function
# will work on the stream as-is. This could be an open file handle that the
# caller is managing for itself.
f = fido.Fido()
f.identify_stream(fstream, "filename to display", extension=False)
I haven't looked at the code myself, but Carl has and identified we're getting to the end of the data during a loop and then it's blocking as we've nothing left to consume, but no way to exit. And it sounds like he has a fix, so that's good!
Of interest to https://github.com/openpreserve/fido/issues/94
The source of the error is in the blocking_read()
function, and the problem is in this loop:
while bytes_read < bytes_to_read:
readbuffer = file.read(bytes_to_read - bytes_read)
buffer += readbuffer
bytes_read = len(buffer)
# break out if EOF is reached.
if readbuffer == '':
break
return buffer
specifically with the exit condition if readbuffer == '':
which is never satisfied, I believe but haven't tested, that the test should be for None
. What's happening is caused by the stream to identify's length been (much) shorter than the default buffer size, bytes_to_read = 131072
. The first read slurps all five bytes, but that still leaves bytes_read < bytes_to_read
by a margin of 131067. The loop then sticks, and this is demonstrated by this Travis job that hangs for both Python 3 tasks. The Python 2 version doesn't hang, which probably explains why the issue's appeared. It also demonstrates that we're a few tests short of a reliable build process.
I have a working fix for the tests but note that these tests are deficient, which segues into the link to #94 that @ross-spencer alluded to. I believe that the identify_file
and identify_stream
methods should return the matched set of results. They currently return nothing and pass the results to the CLI display function further down the line. A real API would allow these tests to examine some basic identification behaviour and check the results. A nice payoff from unit tests is that they force you to write testable code, which is usually well designed.
Hackathon 2023 Review: Selected, @darrendignam @sromkey this should be fixable from here I think.