cctbx_project
cctbx_project copied to clipboard
embeded: pycbf and SWIG_PYTHON_STRICT_BYTE_CHAR
Hello, while preparing the dials Debian package, which contain cctbx, I discovered this issue.
The pycbf embeded is compiled with SWIG_PYTHON_STRICT_BYTE_CHAR. whereas the official one is not.
due to this difference I have this sort of error message when running the tests.
_______________________________________________________________________________________ test_beam _______________________________________________________________________________________
def test_beam():
dxtbx_dir = dxtbx.__path__[0]
image = os.path.join(dxtbx_dir, "tests", "phi_scan_001.cbf")
> assert BeamFactory.imgCIF(image)
tests/test_beam.py:13:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
model/beam.py:230: in imgCIF
cbf_handle.read_widefile(cif_file.encode(), pycbf.MSG_DIGEST)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pycbf.cbf_handle_struct; proxy of <Swig Object of type 'handle *' at 0x7fe45de7dc30> >, filename = b'/usr/lib/python3/dist-packages/dxtbx/tests/phi_scan_001.cbf', headers = 8
def read_widefile(self, filename, headers):
"""
Returns :
*args : String filename,Integer headers
C prototype: int cbf_read_widefile (cbf_handle handle, FILE *file, int flags);
CBFLib documentation:
DESCRIPTION
cbf_read_file reads the CBF or CIF file file into the CBF object
specified by handle, using the CIF 1.0 convention of 80 character
lines. cbf_read_widefile reads the CBF or CIF file file into the CBF
object specified by handle, using the CIF 1.1 convention of 2048
character lines. A warning is issued to stderr for ascii lines over
the limit. No test is performed on binary sections.
Validation is performed in three ways levels: during the lexical
scan, during the parse, and, if a dictionary was converted, against
the value types, value enumerations, categories and parent-child
relationships specified in the dictionary.
flags controls the interpretation of binary section headers, the
parsing of brackets constructs and the parsing of treble-quoted
strings.
MSG_DIGEST: Instructs CBFlib to check that the digest
of the binary section matches any header digest value. If the digests
do not match, the call will return CBF_FORMAT. This evaluation and
comparison is delayed (a "lazy " evaluation) to ensure maximal
processing efficiency. If an immediately evaluation is required, see
MSG_DIGESTNOW, below. MSG_DIGESTNOW: Instructs CBFlib to
check that the digest of the binary section matches any header
digeste value. If the digests do not match, the call will return
CBF_FORMAT. This evaluation and comparison is performed during
initial parsing of the section to ensure timely error reporting at
the expense of processing efficiency. If a more efficient delayed (
"lazy ") evaluation is required, see MSG_DIGEST, above.
MSG_DIGESTWARN: Instructs CBFlib to check that the digest
of the binary section matches any header digeste value. If the
digests do not match, a warning message will be sent to stderr, but
processing will attempt to continue. This evaluation and comparison
is first performed during initial parsing of the section to ensure
timely error reporting at the expense of processing efficiency. An
mismatch of the message digest usually indicates a serious error, but
it is sometimes worth continuing processing to try to isolate the
cause of the error. Use this option with caution. MSG_NODIGEST:
Do not check the digest (default). PARSE_BRACKETS:
Accept DDLm bracket-delimited [item,item,...item] or
{item,item,...item} or (item,item,...item) constructs as valid,
stripping non-quoted embedded whitespace and comments. These
constructs may span multiple lines. PARSE_LIBERAL_BRACKETS: Accept
DDLm bracket-delimited [item,item,...item] or {item,item,...item} or
(item,item,...item) constructs as valid, stripping embedded
non-quoted, non-separating whitespace and comments. These constructs
may span multiple lines. In this case, whitespace may be used as an
alternative to the comma. PARSE_TRIPLE_QUOTES: Accept DDLm
triple-quoted " " "item,item,...item " " " or
'''item,item,...item''' constructs as valid, stripping embedded
whitespace and comments. These constructs may span multiple lines. If
this flag is set, then ''' will not be interpreted as a quoted
apoptrophe and " " " will not be interpreted as a quoted double
quote mark and PARSE_NOBRACKETS: Do not accept DDLm
bracket-delimited [item,item,...item] or {item,item,...item} or
(item,item,...item) constructs as valid, stripping non-quoted
embedded whitespace and comments. These constructs may span multiple
lines. PARSE_NOTRIPLE_QUOTES: No not accept DDLm triple-quoted "
" "item,item,...item " " " or '''item,item,...item''' constructs
as valid, stripping embedded whitespace and comments. These
constructs may span multiple lines. If this flag is set, then '''
will be interpreted as a quoted apostrophe and " " " will be
interpreted as a quoted double quote mark.
CBFlib defers reading binary sections as long as possible. In the
current version of CBFlib, this means that:
1. The file must be a random-access file opened in binary mode (fopen
( ,
"""
> return _pycbf.cbf_handle_struct_read_widefile(self, filename, headers)
E TypeError: in method 'cbf_handle_struct_read_widefile', argument 2 of type 'char *'
../pycbf.py:3285: TypeError
I can solve this issue by recompiling the cbflib on Debian with this parameter but since this is not the default of cbflib, I think that this is wrong. All other projects depending on pycbf would be affected by this.
From what I understand, bytes should be used when a string is expected to represent a filepath.
So I am wondering if a way to solve this issue would to use the cbflib upsteam default, and convert the string to bytes when calling pycbf methods which expect a filepath.
I would like your opinion about this.
thanks
Frederic
Frederic
Hi, without reading too carefully the specific issue here, I can point to the commit where I added this flag:
https://github.com/cctbx/dxtbx/commit/44b6d7240ff21e07f06e1972dc471362e8ba57e2
That commit has a lot of explanation that might be helpful.
Ok, if I understand this you changed the string behaviour for this call
image_string = cbf.get_realarray_as_string()
right ?
Right, that call gave mangled results on Py3, if I recall correctly.
In that case, maybe the best solution would be to ask pycbf upstream to provide as_bytes instead of as_string for Python3. this is a real array of bytes after all. do you know the upstream of pycbf ?
My suspicion is that PyString
is used where PyBytes
should be used, and simply replacing the former with the latter may resolve the issue. Whether the function is called as_bytes
or as_string
then doesn't really matter (although I would rename it to as_bytes
and leave as_string
as a deprecated alias).
I have no idea of swig (and modifying a generated file is pointless), so I'll leave the experimentation to someone else.
Indeed it would be necessary to regenerate the swig files after adding a new function, not something I know how to do. Generally @yayahjb is open to pull requests though.
For what it's worth, I don't know what the official pycbf build is you are referring to. Are you saying on Debian one can install pycbf on the command line? What does that look like?
According to the documentation for swig, the default is always to encode strings as UTF-8 on python 3, which cannot be useful for pycbf which reads data files. It seems to me that anyone using pycbf on py3 would have hit this problem and would have presumably solved it the same way we did, by recompiling with SWIG_PYTHON_STRICT_BYTE_CHAR. It's possible the right solution is to add this flag to the cbflib Makefile.
Yes you can in Debian unstable[1] you can do this
apt install pyton3-pycbf
I agreed that the purpose of pycbf is to read images, so not to return utf8 encoded strings :). This is just that sometime string are real string. I do not know in the pycbf API is there is real strings.
and I discoverd recently that on linux file names are in fact bytestring [2].
So it would be great if the pycbf default would suite our use case. I agreed also that there is problem with pycbf ans Python3 due to this string conversion. so the as_string method should return bytes and not string.
I know nothing about swing in order to solve this issue, but it seems to ma that the real fix should be a little bit more specific that using thi big hamer SWIG_PYTHON_STRICT_BYTE_CHAR :).
[1] https://tracker.debian.org/pkg/cbflib [2] https://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux
Interesting that debian has ported this. They have an interesting stack of patches.
In any case, they seem to have definitely taken the "don't use SWIG_PYTHON_STRICT_BYTE_CHAR
" approach, which makes the basic string-calls work without .encode
but means that any actual binary data you request gets returned as a mangled unicode string.
It's worth noting that the debian-unstable build is still somewhat broken, and running the pycbf_test4.py
test actually causes python to SEGV - you can get the mangled string, but passing it back into cbf is a bit more dicey. But this test is currently broken in the same way on normal builds, so it's no more broken than others ways of building.
This behaviour made perfect sense before python 3 because "string" was the de-facto binary representation, without getting into more complicated representations like memory buffers.
Fixing this would require fixing cbflib (e.g. https://github.com/yayahjb/cbflib/pull/19 is one proposed way, there are others) AND fixing dxtbx e.g. ndevenish/dxtbx:string_cbf, which now relies on this behaviour.