reciprocalspaceship
reciprocalspaceship copied to clipboard
Add Support for Reading DIALS `refl` Files
As we discussed extensively on the DIALS Slack channel, it is now relatively easy to parse DIALS .refl
files without cctbx/DIALS
. Newer versions of DIALS encode reflection tables using msgpack
which seems a relatively innocuous dependency to add.
To this end @ndevenish has built a parser that decodes refl tables using numpy
. It's nearly complete but may be missing column types. We can find a full list of types in this block. It should be easy to build this into the rs.io
submodule as I've done here for example.
There remains the issue of DIALS reflection tables potentially containing some fairly exotic objects (shoeboxes, vectors, matrices). The safest (sadly slowest) thing to do for a first pass is to just default them to objects. We can think about clever solutions later.
Parsing legacy pickle
based reflection tables is an open question. For the time being, I think we just can't support them. @ndevenish suggests looking here for clues though.
@JBGreisman, let's chat about this early next week and get it up and running. I think this is already mostly there!
@PrinceWalnut , you may want to help here too.
I've updated it to handle the missing types, and a basic pytest test (it can be run refl_loader.py --write-test
inside a cctbx environment to write it's test file). std::string
is also supposed to be handled by the msgpack writer but is rather broken - https://github.com/dials/dials/issues/1858 - so am pretty sure that's not "in the wild" anywhere.
wacky. we'll certainly be skipping any string columns for now at least...
This looks like a pretty solid start. There are a few columns that can be stored in a DataFrame
/DataSet
, but cannot be written to an MTZ file. I think a decision we will have to make is whether those columns should be skipped, or whether they should be parsed and included in the DataSet
.
It is always possible to add new dtypes if that would improve the behavior of anything (for strings, pandas already has us covered: StringDtype). It is also possible to add multidimensional numpy arrays as columns, but they seem to really end up more as "lists of arrays"