cctbx_project embeded: pycbf and SWIG_PYTHON_STRICT_BYTE

Hello, while preparing the dials Debian package, which contain cctbx, I discovered this issue.

The pycbf embeded is compiled with SWIG_PYTHON_STRICT_BYTE_CHAR. whereas the official one is not.

due to this difference I have this sort of error message when running the tests.

_______________________________________________________________________________________ test_beam _______________________________________________________________________________________

    def test_beam():
        dxtbx_dir = dxtbx.__path__[0]
    
        image = os.path.join(dxtbx_dir, "tests", "phi_scan_001.cbf")
>       assert BeamFactory.imgCIF(image)

tests/test_beam.py:13: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
model/beam.py:230: in imgCIF
    cbf_handle.read_widefile(cif_file.encode(), pycbf.MSG_DIGEST)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pycbf.cbf_handle_struct; proxy of <Swig Object of type 'handle *' at 0x7fe45de7dc30> >, filename = b'/usr/lib/python3/dist-packages/dxtbx/tests/phi_scan_001.cbf', headers = 8

    def read_widefile(self, filename, headers):
        """
    
        Returns :
        *args   : String filename,Integer headers
    
        C prototype: int cbf_read_widefile (cbf_handle handle, FILE *file, int flags);
    
        CBFLib documentation:
        DESCRIPTION
        cbf_read_file reads the CBF or CIF file file into the CBF object
        specified by handle, using the CIF 1.0 convention of 80 character
        lines. cbf_read_widefile reads the CBF or CIF file file into the CBF
        object specified by handle, using the CIF 1.1 convention of 2048
        character lines. A warning is issued to stderr for ascii lines over
        the limit. No test is performed on binary sections.
        Validation is performed in three ways levels: during the lexical
        scan, during the parse, and, if a dictionary was converted, against
        the value types, value enumerations, categories and parent-child
        relationships specified in the dictionary.
        flags controls the interpretation of binary section headers, the
        parsing of brackets constructs and the parsing of treble-quoted
        strings.
        MSG_DIGEST:               Instructs CBFlib to check that the digest
        of the binary section matches any header digest value. If the digests
        do not match, the call will return CBF_FORMAT. This evaluation and
        comparison is delayed (a  "lazy " evaluation) to ensure maximal
        processing efficiency. If an immediately evaluation is required, see
        MSG_DIGESTNOW, below. MSG_DIGESTNOW:            Instructs CBFlib to
        check that the digest of the binary section matches any header
        digeste value. If the digests do not match, the call will return
        CBF_FORMAT. This evaluation and comparison is performed during
        initial parsing of the section to ensure timely error reporting at
        the expense of processing efficiency. If a more efficient delayed (
        "lazy ") evaluation is required, see MSG_DIGEST, above.
        MSG_DIGESTWARN:           Instructs CBFlib to check that the digest
        of the binary section matches any header digeste value. If the
        digests do not match, a warning message will be sent to stderr, but
        processing will attempt to continue. This evaluation and comparison
        is first performed during initial parsing of the section to ensure
        timely error reporting at the expense of processing efficiency. An
        mismatch of the message digest usually indicates a serious error, but
        it is sometimes worth continuing processing to try to isolate the
        cause of the error. Use this option with caution. MSG_NODIGEST:
              Do not check the digest (default). PARSE_BRACKETS:
        Accept DDLm bracket-delimited [item,item,...item] or
        {item,item,...item} or (item,item,...item) constructs as valid,
        stripping non-quoted embedded whitespace and comments. These
        constructs may span multiple lines. PARSE_LIBERAL_BRACKETS:   Accept
        DDLm bracket-delimited [item,item,...item] or {item,item,...item} or
        (item,item,...item) constructs as valid, stripping embedded
        non-quoted, non-separating whitespace and comments. These constructs
        may span multiple lines. In this case, whitespace may be used as an
        alternative to the comma. PARSE_TRIPLE_QUOTES:      Accept DDLm
        triple-quoted  " " "item,item,...item " " " or
        '''item,item,...item''' constructs as valid, stripping embedded
        whitespace and comments. These constructs may span multiple lines. If
        this flag is set, then ''' will not be interpreted as a quoted
        apoptrophe and  " " " will not be interpreted as a quoted double
        quote mark and PARSE_NOBRACKETS:         Do not accept DDLm
        bracket-delimited [item,item,...item] or {item,item,...item} or
        (item,item,...item) constructs as valid, stripping non-quoted
        embedded whitespace and comments. These constructs may span multiple
        lines. PARSE_NOTRIPLE_QUOTES:    No not accept DDLm triple-quoted  "
        " "item,item,...item " " " or '''item,item,...item''' constructs
        as valid, stripping embedded whitespace and comments. These
        constructs may span multiple lines. If this flag is set, then '''
        will be interpreted as a quoted apostrophe and  " " " will be
        interpreted as a quoted double quote mark.
        CBFlib defers reading binary sections as long as possible. In the
        current version of CBFlib, this means that:
        1. The file must be a random-access file opened in binary mode (fopen
        ( ,
    
        """
>       return _pycbf.cbf_handle_struct_read_widefile(self, filename, headers)
E       TypeError: in method 'cbf_handle_struct_read_widefile', argument 2 of type 'char *'

../pycbf.py:3285: TypeError

I can solve this issue by recompiling the cbflib on Debian with this parameter but since this is not the default of cbflib, I think that this is wrong. All other projects depending on pycbf would be affected by this.

From what I understand, bytes should be used when a string is expected to represent a filepath.

So I am wondering if a way to solve this issue would to use the cbflib upsteam default, and convert the string to bytes when calling pycbf methods which expect a filepath.

I would like your opinion about this.

thanks

Frederic

May 12 '20 13:05 picca

Hi, without reading too carefully the specific issue here, I can point to the commit where I added this flag:

https://github.com/cctbx/dxtbx/commit/44b6d7240ff21e07f06e1972dc471362e8ba57e2

That commit has a lot of explanation that might be helpful.

May 13 '20 18:05 phyy-nx

Ok, if I understand this you changed the string behaviour for this call

image_string = cbf.get_realarray_as_string()

right ?

May 13 '20 18:05 picca

Right, that call gave mangled results on Py3, if I recall correctly.

May 13 '20 19:05 phyy-nx

In that case, maybe the best solution would be to ask pycbf upstream to provide as_bytes instead of as_string for Python3. this is a real array of bytes after all. do you know the upstream of pycbf ?

May 14 '20 07:05 picca

My suspicion is that PyString is used where PyBytes should be used, and simply replacing the former with the latter may resolve the issue. Whether the function is called as_bytes or as_string then doesn't really matter (although I would rename it to as_bytes and leave as_string as a deprecated alias). I have no idea of swig (and modifying a generated file is pointless), so I'll leave the experimentation to someone else.

May 14 '20 07:05 Anthchirp

Indeed it would be necessary to regenerate the swig files after adding a new function, not something I know how to do. Generally @yayahjb is open to pull requests though.

For what it's worth, I don't know what the official pycbf build is you are referring to. Are you saying on Debian one can install pycbf on the command line? What does that look like?

According to the documentation for swig, the default is always to encode strings as UTF-8 on python 3, which cannot be useful for pycbf which reads data files. It seems to me that anyone using pycbf on py3 would have hit this problem and would have presumably solved it the same way we did, by recompiling with SWIG_PYTHON_STRICT_BYTE_CHAR. It's possible the right solution is to add this flag to the cbflib Makefile.

May 14 '20 15:05 phyy-nx

Yes you can in Debian unstable[1] you can do this

apt install pyton3-pycbf

I agreed that the purpose of pycbf is to read images, so not to return utf8 encoded strings :). This is just that sometime string are real string. I do not know in the pycbf API is there is real strings.

and I discoverd recently that on linux file names are in fact bytestring [2].

So it would be great if the pycbf default would suite our use case. I agreed also that there is problem with pycbf ans Python3 due to this string conversion. so the as_string method should return bytes and not string.

I know nothing about swing in order to solve this issue, but it seems to ma that the real fix should be a little bit more specific that using thi big hamer SWIG_PYTHON_STRICT_BYTE_CHAR :).

[1] https://tracker.debian.org/pkg/cbflib [2] https://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux

May 14 '20 16:05 picca

Interesting that debian has ported this. They have an interesting stack of patches.

In any case, they seem to have definitely taken the "don't use SWIG_PYTHON_STRICT_BYTE_CHAR" approach, which makes the basic string-calls work without .encode but means that any actual binary data you request gets returned as a mangled unicode string.

It's worth noting that the debian-unstable build is still somewhat broken, and running the pycbf_test4.py test actually causes python to SEGV - you can get the mangled string, but passing it back into cbf is a bit more dicey. But this test is currently broken in the same way on normal builds, so it's no more broken than others ways of building.

This behaviour made perfect sense before python 3 because "string" was the de-facto binary representation, without getting into more complicated representations like memory buffers.

Fixing this would require fixing cbflib (e.g. https://github.com/yayahjb/cbflib/pull/19 is one proposed way, there are others) AND fixing dxtbx e.g. ndevenish/dxtbx:string_cbf, which now relies on this behaviour.

Nov 25 '20 17:11 ndevenish

cctbx_project
cctbx_project copied to clipboard

embeded: pycbf and SWIG_PYTHON_STRICT_BYTE_CHAR

cctbx_project cctbx_project copied to clipboard

embeded: pycbf and SWIG_PYTHON_STRICT_BYTE_CHAR

cctbx_project
cctbx_project copied to clipboard