Add functionality for MR metadata reading from SAV
This PR adds functionality for reading multiple response metadata from sav files. It's been tested on a simple file that we use for PoC in Crunch.io. It's a work in progress, I'm available for any updates and changes that need to be done to it.
UPDATE: Running this function with a test file succeeds a small portion of the tires. However it throws segfaults on most tries, I can't really track it down, so any guidance on that more than welcome as well.
Here's the example on how I tried testing it:
include <stdlib.h>
#include "readstat.h"
typedef struct
{
const mr_set_t *sets;
int count;
} mr_sets_context_t;
int handle_metadata(readstat_metadata_t *metadata, void *ctx)
{
mr_sets_context_t *mr_ctx = (mr_sets_context_t *)ctx; // Cast to non-const
mr_ctx->count = readstat_get_multiple_response_sets_length(metadata);
mr_ctx->sets = readstat_get_mr_sets(metadata);
return READSTAT_HANDLER_OK;
}
int main(int argc, char *argv[])
{
if (argc != 2)
{
printf("Usage: %s <filename>\n", argv[0]);
return 1;
}
readstat_error_t error = READSTAT_OK;
readstat_parser_t *parser = readstat_parser_init();
readstat_set_metadata_handler(parser, &handle_metadata);
// Processing
mr_sets_context_t *mr_ctx = malloc(sizeof(mr_sets_context_t));
error = readstat_parse_sav(parser, argv[1], mr_ctx);
printf("Found %d records\n", mr_ctx->count);
for (int i = 0; i < mr_ctx->count; i++)
{
printf("MR set %d name: %s\n", i + 1, mr_ctx->sets[i].name);
printf("type: %c\n", mr_ctx->sets[i].type);
printf("is dichotomy: %d\n", mr_ctx->sets[i].is_dichotomy);
}
// Cleanup
readstat_parser_free(parser);
if (error != READSTAT_OK)
{
printf("Error processing %s: %d\n", argv[1], error);
return 1;
}
return 0;
}
And here's the example file: simple_alltypes.sav.zip
This is the output from when it succeeds:
➜ ReadStat git:(ISS-229-add-mr-metadata-support-for-sav) ✗ DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav
count: 0
label:
Final counted value is: 1
count: 24
label: My multiple response set
Found 2 records
MR set 1 name: categorical_array
type: C
is dichotomy: 0
MR set 2 name: mymrset
type: D
is dichotomy: 1
and this one is when it fails (which happens more often):
➜ ReadStat git:(ISS-229-add-mr-metadata-support-for-sav) ✗ DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav
count: 0
label:
Final counted value is: 1
count: 24
label: My multiple response set
[1] 86961 segmentation fault DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav
See failing builds also
src/spss/readstat_sav_read.c: In function ‘parse_mr_line’:
src/spss/readstat_sav_read.c:176:51: error: implicit declaration of function ‘isdigit’ [-Wimplicit-function-declaration]
176 | for (int i = 0; i < internal_count && isdigit(*next_part); i++) {
| ^~~~~~~
src/spss/readstat_sav_read.c:26:1: note: include ‘<ctype.h>’ or provide a declaration of ‘isdigit’
25 | #include "readstat_zsav_read.h"
+++ |+#include <ctype.h>
26 | #endif
src/spss/readstat_sav_read.c: In function ‘readstat_parse_sav’:
src/spss/readstat_sav_read.c:1882:40: error: implicit declaration of function ‘toupper’ [-Wimplicit-function-declaration]
1882 | sv_name_upper[c] = toupper((unsigned char) mr.subvariables[j][c]);
| ^~~~~~~
src/spss/readstat_sav_read.c:1882:40: note: include ‘<ctype.h>’ or provide a declaration of ‘toupper’
make: *** [Makefile:2419: src/spss/libreadstat_la-readstat_sav_read.lo] Error 1
@evanmiller I think it's fixed now.
@slobodan-ilic Thanks for addressing the build issues. However, it looks like CI Fuzzer uncovered a segfault. From a cursory read of the code, it appears that strtol performs an unprotected memory read. There is also a Windows build issue that will need to be addressed.
Hi @evanmiller, thanks for all the input. We've done a couple of iterations and a bunch of testing on real-life survey data. All of the small bugs are taken care of, no more nasty segfaults, etc. Are you available to do one more round of review and provide some guidance?
Another short update: managed to run fuzzers locally (even though the documentation didn't work in a straightforward path). After managing to produce a crash locally - identified and fixed the bug (which seemed obvious once discovered). Should be in much better shape now.
Hi, CI is still producing a fuzz failure. Also please see the Windows build failure (looks simple).
Hi, CI is still producing a fuzz failure. Also please see the Windows build failure (looks simple).
I've just detected the other fuzz failure as you were writing... Should be fixed now. About the windows tho, are you referring to the errors with readstat_sav_date, the ones that mostly have VS17 in the paths of the files? If so, I've tried reverting my PR back to dev branch, but these errors are still present in the CI. I thought I'd avoid trying to get VS up and running, since I'm on mac.
Maybe I should try opening the PR against master?
update: Well I just tried with master too (as a separate commit which I later deleted). Was the same error about sav date, all red in the appveyor run, but it said the tests passed. Unknown land to me :)
Found another issue with fuzzer, this time it's an OOM. On it, will ping when done.
For Windows I am referring to the failed CI
In file included from src/spss/readstat_sav_read.c:11:
src/spss/readstat_sav_read.c: In function 'parse_mr_counted_value':
src/spss/readstat_sav_read.c:167:55: error: array subscript has type 'char' [-Werror=char-subscripts]
167 | for (int i = 0; i < internal_count && isdigit(*(*next_part)); i++) {
| ^~~~~~~~~~~~~
src/spss/readstat_sav_read.c: In function 'parse_mr_line':
src/spss/readstat_sav_read.c:210:20: error: array subscript has type 'char' [-Werror=char-subscripts]
210 | while (isdigit(*next_part)) {
| ^~~~~~~~~~
I think the spawnv/_spawnv Windows build issue is a problem in master so you don't need to worry about it!
I think the spawnv/_spawnv Windows build issue is a problem in master so you don't need to worry about it!
I was just exploring that failure. Thanks for the message. I just pushed a version that previously had successful cfuzz build. I had a rogue commit somewhere in the middle, and due to the long-lived nature of the PR I had failed to notice it. Anyways, all should be good now, with both the builds and the functionality. I'll need to run a squash, so commits look nicer, and remove obvious commented out code. I can do that today, but the structure shouldn't change. So if you want to go over it and provide further guidance, I'm all ears.
P.S. I implemented some tests in the pyreadstat version of this work, but was not able to implement them here. As far as I can see, the elaborate structure of the tests, filling the buffers and preparing parsers, mostly focuses on data (and not metadata which this PR focuses on). If you have a clear path forward for me to implement this, pls let me know which steps I should take.
I've made the changes with Ragel @evanmiller
I'm not sure how the CI is invoked, but sometimes it starts immediately after a commit, sometimes after an hour, and sometimes doesn't start. So getting the builds right is difficult. But I think I got it in the last commit. Had to add some vcproj files etc. I wasn't able to replace strtok though, I only found strtok_r and strtok_s but the build was messed up, and didn't want to get too deep into that before checking up with you.
Anyways, if you can take a look at my latest changes, and provide feedback (at least if this is moving in the right direction) that would be really awesome.
Learning Ragel was fun :)
P.S. I'd love to add some tests, so any advice on how to move there would also be gr8.
Looking much better and more maintainable with the Ragel code!
As for tests, we currently use a table-driven suite that roundtrips files (writes then reads and checks for expected error values). This approach would require added a multiple-response write API in addition to the read API. Generally writing is easier than reading, at least for purposes of testing, though often there are snares around getting written files to open properly in SPSS.
As for strtok problems I'd need to see the errors. Going full Ragel is one solution, but it's a matter of how you want to spend your time.
@evanmiller I believe I've addressed all your concerns, with the latest one being the renaming of the function for fetching metadata for multiple responses, and the one before that the elimination of strtok (by using pure ragel in parsing).
What's the next step for being able to merge this (i.e. what's the absolute minimum that I have to do on top of this)?
We already started using this in our development, from a forked repo, but would need to use the official repo, the sooner the better.
Thanks!
Thanks. It looks like the CIFuzz GitHub action is failing because one of the dependencies needs to be updated:
Error: This request has been automatically failed because it uses a deprecated version of `actions/upload-artifact: v1`.
Learn more: https://github.blog/changelog/2024-02-13-deprecation-notice-v1-and-v2-of-the-artifact-actions/
I will try to get to this later today (but then dev will need to be merged to your branch)
Please merge dev and we'll see if CI Fuzzer works now
It looks like the fuzzer detected a memory leak
https://github.com/WizardMac/ReadStat/actions/runs/10946324658/job/30395085028?pr=313
Thx, will check over the weekend...
Merged!
wow!!! Thanks @evanmiller ❤️