pyreadstat Add support for reading Multiple Response from sav (WIP, DO NOT MERGE)

This PR implements reading multiple response metadata from sav format. It's intended to be based on a change to the readstat library. The code is included here for convenience, but should eventually be rebased, once the readstat lib is updated.

I'll keep posting here regularly, as updates happen on the readstat side.

Here's also an example on how to use the new functionality:

[ins] In [10]: import pyreadstat
[ins] In [11]: _, metadata = pyreadstat.read_sav('./simple_alltypes.sav')
[ins] In [12]: metadata.mr_sets

which gives the following output:

{
  'categorical_array': {
    'type': 'C',
    'is_dichotomy': False,
    'counted_value': None,
    'label': '',
    'variable_list': ['ca_subvar_1', 'ca_subvar_2', 'ca_subvar_3']
 },
 'mymrset': {
    'type': 'D',
    'is_dichotomy': True,
    'counted_value': 1,
    'label': 'My multiple response set',
    'variable_list': ['bool1', 'bool2', 'bool3']
  }
}

you can test it with the following (very basic) file: simple_alltypes.sav.zip but you have to unzip it (GitHub won't allow direct upload of the *.sav extension).

Apr 24 '24 07:04 slobodan-ilic

Hey @slobodan-ilic, tried to install your version with poetry and got these two errors:

  ./src/spss/readstat_sav_read.c:1878:40: error: call to undeclared library function 'toupper' with type 'int (int)'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                      sv_name_upper[c] = toupper((unsigned char) mr.subvariables[j][c]);

  ./src/spss/readstat_sav_read.c:176:51: error: call to undeclared library function 'isdigit' with type 'int (int)'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
              for (int i = 0; i < internal_count && isdigit(*next_part); i++) {

The exact command: poetry add git+ssh://[email protected]:slobodan-ilic/pyreadstat.git#ISS-25-support-mr-in-sav

May 02 '24 10:05 joaquimadraz

@joaquimadraz fixed this, and also some other issues... Will report here when I update the readstat as well...

May 07 '24 17:05 slobodan-ilic

hi there, thanks a lot for this PR, highly appreciated! I am not familiar with this SPSS feature. It would be super nice if at some point (before final merge) you could provide a SPSS file with a toy example of this Multiple response, and even better a test to be added to the unit test file.

PS: I see that in the corresponding issue there is a sample file, so I guess we can use that one.

May 21 '24 15:05 ofajardo

Hi @ofajardo , thanks for your comment! This work is likely to go through at least 1 or 2 more changes, where I'd need to make it more in-line with the rest of the design of readstat library. I will add the test file here, but you can use the one from the issue in the meantime, to test. You can also do some comparison with pyspssio library, they should provide very similar results with regards to the MR data.

May 21 '24 15:05 slobodan-ilic

Hi @ofajardo! 👋

We've done some work on this, fixed bugs and tested on bunch of real-life survey data. We also have our pr on readstat ready. Waiting for Evan to provide another round of 👀 on that.

Would you be able to provide some more info about the process your releases take? Do you always need to be in sync with readstat? How are the files copied over to pyreadstat, is it manually or a submodule or something else? Any info is more than welcome, so we can plan our roadmap accordingly (whether we fork or wait basically).

All the best!

@joaquimadraz fyi ☝️

Jun 11 '24 16:06 slobodan-ilic

hi, thanks a lot for working on this!

What I would like to do is the following:

Merging your PR into a new branch on the repo.

I would take a look at your code and test what you did (sorry, have not had time for that yet).
Then I would like to get it through the CD/CI pipeline to make sure that the code runs smoothly on all supported platforms, wheels are built correctly etc.

Once Evan merges into Readstat (meaning the code is final), I would be happy to merge the branch into the main pyreadstat branch and make a release, no need to wait for him to release on Readstat first. Later I would do another release once he releases, if he has added more features.
These are the two basic things that have to happen. I would be fine with whatever order they happen, i.e. if you prefer to wait until Evan finishes reviewing and merges, and then we start here, or if you prefer maybe certain things can already be initiated, like testing the code on the CI/CD.
I copy Readstat files manually to the src folder in pyreadstat, this is to make possible that people can do a pip install from the github repo or from the source package in pypi, a submodule approach would make that difficult (or not possible).

Hope that makes it clearer. Let me know if you have further questions!

Jun 12 '24 07:06 ofajardo

@ofajardo Hi, thanks for the quick response. I'd definitely like to go forward with CI path first, as we might discover new issues more quickly this way. I guess the communication on the readstat is going to happen with higher latency than here.

Thanks for the explanation about copying files, that's what I had supposed.

Btw, do you know how to run the fuzzy tests locally? I had tried doing that (on readstat lib) but didn't have success, even after following instructions from Evan's repo. I see that the CI there relies on VS17, which seems old (and can't run it since I'm on mac).

Any advice is really welcome! Thanks!

Jun 12 '24 14:06 slobodan-ilic

@ofajardo Hi, thanks for the quick response. I'd definitely like to go forward with CI path first, as we might discover new issues more quickly this way. I guess the communication on the readstat is going to happen with higher latency than here.

Thanks for the explanation about copying files, that's what I had supposed.

Btw, do you know how to run the fuzzy tests locally? I had tried doing that (on readstat lib) but didn't have success, even after following instructions from Evan's repo. I see that the CI there relies on VS17, which seems old (and can't run it since I'm on mac).

Any advice is really welcome! Thanks!

Ok, through combination of perseverance and GPT, I was able to build and run fuzzers locally, even though the instructions on readstat don't seem to be applicable to (at least my own) setup. Found another bug and fixed it (since it was immediately clear what it was). Now it's in better shape then it was :)

Jun 12 '24 16:06 slobodan-ilic

Short update: After fixing all CI issues in readstat, Evan did a thorough review. Other than a couple minor malloc changes, it seemed as if it's gonna be finished quick, until I saw the proposition to completely rewrite the parsing part in ragel. I've been on it ever since, should be done soon-ish.

Jun 20 '24 08:06 slobodan-ilic

hi

would you mind adding a few short lines in this file describing the new field in the metadata object? also subfields if applicable.

Regarding trying the pyreadstat CI/CD ... is this something you want to do on your side, or you prefer me doing it? If me, I would need to merge this PR to a dedicated new branch to launch the CI/CD pointing there. We can wait a bit also until everything is ready on your side.

Jul 02 '24 09:07 ofajardo

Hi @ofajardo , we've finally managed to get the PR merged on the ReadStat side: https://github.com/WizardMac/ReadStat/pull/313 ! Lmk how we can proceed about merging the functionality into pyreadstat as well... Any outstanding issues that need to be addressed? I haven't worked on this in a while :)

Sep 22 '24 13:09 slobodan-ilic

Hi @slobodan-ilic , congratulations!

I have pushed a couple of minor changes into the dev branch. Please merge those into your PR to make sure there is no conflict. Please add a few short lines in this file describing the new field in the metadata object? also subfields if applicable. I will rerender the documentation later. Once these two minor points are solved, I will merge into a new branch created from dev. I will test the CI/CD pipeline. If it fails we have to work on that. If success, I will merge into dev and then into master and do the release. Thanks!

Sep 22 '24 14:09 ofajardo

Thanks @ofajardo ! I've rebased my branch on dev branch (with your latest changes). Also added the explanation about the new metadata field. Pls lmk if anything needs my attention!

Sep 23 '24 09:09 slobodan-ilic

Thanks! I have merged into the branch dev_mrset. I will go now to test the CI/CD pipeline!

Sep 23 '24 09:09 ofajardo

Thanks @ofajardo !

Sep 23 '24 11:09 slobodan-ilic

good news! the CI/CD was successful for all the 25 runs! I have now merged into the dev branch. I think everything is ready for a release, except that right now the conda packages are failing in the current version 1.2.7 because of some migration they are having. I am trying to get that solved first, once it is solved (they told me it will take some days), I will do the release of the new version 1.2.8 that includes this new feature.

Sep 23 '24 13:09 ofajardo