ReadStat icon indicating copy to clipboard operation
ReadStat copied to clipboard

Add functionality for MR metadata reading from SAV

Open slobodan-ilic opened this issue 1 year ago • 14 comments

This PR adds functionality for reading multiple response metadata from sav files. It's been tested on a simple file that we use for PoC in Crunch.io. It's a work in progress, I'm available for any updates and changes that need to be done to it.

UPDATE: Running this function with a test file succeeds a small portion of the tires. However it throws segfaults on most tries, I can't really track it down, so any guidance on that more than welcome as well.

Here's the example on how I tried testing it:

include <stdlib.h>
#include "readstat.h"

typedef struct
{
    const mr_set_t *sets;
    int count;
} mr_sets_context_t;

int handle_metadata(readstat_metadata_t *metadata, void *ctx)
{
    mr_sets_context_t *mr_ctx = (mr_sets_context_t *)ctx; // Cast to non-const
    mr_ctx->count = readstat_get_multiple_response_sets_length(metadata);
    mr_ctx->sets = readstat_get_mr_sets(metadata);
    return READSTAT_HANDLER_OK;
}

int main(int argc, char *argv[])
{
    if (argc != 2)
    {
        printf("Usage: %s <filename>\n", argv[0]);
        return 1;
    }
    readstat_error_t error = READSTAT_OK;
    readstat_parser_t *parser = readstat_parser_init();
    readstat_set_metadata_handler(parser, &handle_metadata);

    // Processing
    mr_sets_context_t *mr_ctx = malloc(sizeof(mr_sets_context_t));
    error = readstat_parse_sav(parser, argv[1], mr_ctx);
    printf("Found %d records\n", mr_ctx->count);
    for (int i = 0; i < mr_ctx->count; i++)
    {
        printf("MR set %d name: %s\n", i + 1, mr_ctx->sets[i].name);
        printf("type: %c\n", mr_ctx->sets[i].type);
        printf("is dichotomy: %d\n", mr_ctx->sets[i].is_dichotomy);
    }

    // Cleanup
    readstat_parser_free(parser);
    if (error != READSTAT_OK)
    {
        printf("Error processing %s: %d\n", argv[1], error);
        return 1;
    }
    return 0;
}

And here's the example file: simple_alltypes.sav.zip

This is the output from when it succeeds:

➜  ReadStat git:(ISS-229-add-mr-metadata-support-for-sav) ✗ DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav
count: 0
label: 

Final counted value is: 1
count: 24
label: My multiple response set
Found 2 records
MR set 1 name: categorical_array
type: C
is dichotomy: 0
MR set 2 name: mymrset
type: D
is dichotomy: 1

and this one is when it fails (which happens more often):

➜  ReadStat git:(ISS-229-add-mr-metadata-support-for-sav) ✗ DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav
count: 0
label: 

Final counted value is: 1
count: 24
label: My multiple response set
[1]    86961 segmentation fault  DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav

slobodan-ilic avatar Apr 24 '24 09:04 slobodan-ilic

See failing builds also

src/spss/readstat_sav_read.c: In function ‘parse_mr_line’:
src/spss/readstat_sav_read.c:176:51: error: implicit declaration of function ‘isdigit’ [-Wimplicit-function-declaration]
  176 |             for (int i = 0; i < internal_count && isdigit(*next_part); i++) {
      |                                                   ^~~~~~~
src/spss/readstat_sav_read.c:26:1: note: include ‘<ctype.h>’ or provide a declaration of ‘isdigit’
   25 | #include "readstat_zsav_read.h"
  +++ |+#include <ctype.h>
   26 | #endif
src/spss/readstat_sav_read.c: In function ‘readstat_parse_sav’:
src/spss/readstat_sav_read.c:1882:40: error: implicit declaration of function ‘toupper’ [-Wimplicit-function-declaration]
 1882 |                     sv_name_upper[c] = toupper((unsigned char) mr.subvariables[j][c]);
      |                                        ^~~~~~~
src/spss/readstat_sav_read.c:1882:40: note: include ‘<ctype.h>’ or provide a declaration of ‘toupper’
make: *** [Makefile:2419: src/spss/libreadstat_la-readstat_sav_read.lo] Error 1

evanmiller avatar May 04 '24 12:05 evanmiller

@evanmiller I think it's fixed now.

slobodan-ilic avatar May 05 '24 15:05 slobodan-ilic

@slobodan-ilic Thanks for addressing the build issues. However, it looks like CI Fuzzer uncovered a segfault. From a cursory read of the code, it appears that strtol performs an unprotected memory read. There is also a Windows build issue that will need to be addressed.

evanmiller avatar May 05 '24 16:05 evanmiller

Hi @evanmiller, thanks for all the input. We've done a couple of iterations and a bunch of testing on real-life survey data. All of the small bugs are taken care of, no more nasty segfaults, etc. Are you available to do one more round of review and provide some guidance?

slobodan-ilic avatar Jun 11 '24 15:06 slobodan-ilic

Another short update: managed to run fuzzers locally (even though the documentation didn't work in a straightforward path). After managing to produce a crash locally - identified and fixed the bug (which seemed obvious once discovered). Should be in much better shape now.

slobodan-ilic avatar Jun 12 '24 16:06 slobodan-ilic

Hi, CI is still producing a fuzz failure. Also please see the Windows build failure (looks simple).

evanmiller avatar Jun 13 '24 11:06 evanmiller

Hi, CI is still producing a fuzz failure. Also please see the Windows build failure (looks simple).

I've just detected the other fuzz failure as you were writing... Should be fixed now. About the windows tho, are you referring to the errors with readstat_sav_date, the ones that mostly have VS17 in the paths of the files? If so, I've tried reverting my PR back to dev branch, but these errors are still present in the CI. I thought I'd avoid trying to get VS up and running, since I'm on mac.

Maybe I should try opening the PR against master?

update: Well I just tried with master too (as a separate commit which I later deleted). Was the same error about sav date, all red in the appveyor run, but it said the tests passed. Unknown land to me :)

slobodan-ilic avatar Jun 13 '24 17:06 slobodan-ilic

Found another issue with fuzzer, this time it's an OOM. On it, will ping when done.

slobodan-ilic avatar Jun 13 '24 19:06 slobodan-ilic

For Windows I am referring to the failed CI

In file included from src/spss/readstat_sav_read.c:11:
src/spss/readstat_sav_read.c: In function 'parse_mr_counted_value':
src/spss/readstat_sav_read.c:167:55: error: array subscript has type 'char' [-Werror=char-subscripts]
  167 |         for (int i = 0; i < internal_count && isdigit(*(*next_part)); i++) {
      |                                                       ^~~~~~~~~~~~~
src/spss/readstat_sav_read.c: In function 'parse_mr_line':
src/spss/readstat_sav_read.c:210:20: error: array subscript has type 'char' [-Werror=char-subscripts]
  210 |     while (isdigit(*next_part)) {
      |                    ^~~~~~~~~~

evanmiller avatar Jun 14 '24 11:06 evanmiller

I think the spawnv/_spawnv Windows build issue is a problem in master so you don't need to worry about it!

evanmiller avatar Jun 16 '24 10:06 evanmiller

I think the spawnv/_spawnv Windows build issue is a problem in master so you don't need to worry about it!

I was just exploring that failure. Thanks for the message. I just pushed a version that previously had successful cfuzz build. I had a rogue commit somewhere in the middle, and due to the long-lived nature of the PR I had failed to notice it. Anyways, all should be good now, with both the builds and the functionality. I'll need to run a squash, so commits look nicer, and remove obvious commented out code. I can do that today, but the structure shouldn't change. So if you want to go over it and provide further guidance, I'm all ears.

P.S. I implemented some tests in the pyreadstat version of this work, but was not able to implement them here. As far as I can see, the elaborate structure of the tests, filling the buffers and preparing parsers, mostly focuses on data (and not metadata which this PR focuses on). If you have a clear path forward for me to implement this, pls let me know which steps I should take.

slobodan-ilic avatar Jun 16 '24 10:06 slobodan-ilic

I've made the changes with Ragel @evanmiller

I'm not sure how the CI is invoked, but sometimes it starts immediately after a commit, sometimes after an hour, and sometimes doesn't start. So getting the builds right is difficult. But I think I got it in the last commit. Had to add some vcproj files etc. I wasn't able to replace strtok though, I only found strtok_r and strtok_s but the build was messed up, and didn't want to get too deep into that before checking up with you.

Anyways, if you can take a look at my latest changes, and provide feedback (at least if this is moving in the right direction) that would be really awesome.

Learning Ragel was fun :)

P.S. I'd love to add some tests, so any advice on how to move there would also be gr8.

slobodan-ilic avatar Jun 21 '24 16:06 slobodan-ilic

Looking much better and more maintainable with the Ragel code!

As for tests, we currently use a table-driven suite that roundtrips files (writes then reads and checks for expected error values). This approach would require added a multiple-response write API in addition to the read API. Generally writing is easier than reading, at least for purposes of testing, though often there are snares around getting written files to open properly in SPSS.

As for strtok problems I'd need to see the errors. Going full Ragel is one solution, but it's a matter of how you want to spend your time.

evanmiller avatar Jun 22 '24 11:06 evanmiller

@evanmiller I believe I've addressed all your concerns, with the latest one being the renaming of the function for fetching metadata for multiple responses, and the one before that the elimination of strtok (by using pure ragel in parsing).

What's the next step for being able to merge this (i.e. what's the absolute minimum that I have to do on top of this)?

We already started using this in our development, from a forked repo, but would need to use the official repo, the sooner the better.

Thanks!

slobodan-ilic avatar Aug 16 '24 09:08 slobodan-ilic

Thanks. It looks like the CIFuzz GitHub action is failing because one of the dependencies needs to be updated:

Error: This request has been automatically failed because it uses a deprecated version of `actions/upload-artifact: v1`.
Learn more: https://github.blog/changelog/2024-02-13-deprecation-notice-v1-and-v2-of-the-artifact-actions/

I will try to get to this later today (but then dev will need to be merged to your branch)

evanmiller avatar Sep 11 '24 11:09 evanmiller

Please merge dev and we'll see if CI Fuzzer works now

evanmiller avatar Sep 11 '24 17:09 evanmiller

It looks like the fuzzer detected a memory leak

https://github.com/WizardMac/ReadStat/actions/runs/10946324658/job/30395085028?pr=313

evanmiller avatar Sep 19 '24 20:09 evanmiller

Thx, will check over the weekend...

slobodan-ilic avatar Sep 20 '24 12:09 slobodan-ilic

Merged!

evanmiller avatar Sep 22 '24 11:09 evanmiller

wow!!! Thanks @evanmiller ❤️

slobodan-ilic avatar Sep 22 '24 11:09 slobodan-ilic