binaryornot
binaryornot copied to clipboard
Add kwarg for chunk length to is_binary
Added a kwarg for the number of bytes to read from a file to check to is_binary
, which is passed along to helpers.get_starting_chunk
. I've found that for some of the files I was checking, doubling the default to 2048 bytes improved the results of the heuristics.
Sorry for taking a while to get back.
- that is, of course, a good point. Unfortunately, when I created this PR I was not yet experienced enough to know that the first thing you do when you encounter a problem is to specify and/or implement a test that reliably reproduces it. I've dug out my notes from back in the day and the troublesome files appear to have been binary dumps of Fortran arrays. Unfortunately, all of the files in question I still have access to are correctly identified as binary and I wasn't able to newly produce one which isn't either. Closing this as "cannot reproduce" may, therefore, be the right step a this point.
- I had (and have) no way of telling whether the increase always improves the result :-). I wasn't checking huge amounts of vastly different files after all. It would surprise me, if the performance of the heuristic were to decrease with more data but I don't feel qualified to call a shot like that.