ReadStat extract_metadata error: not a number: d

I've attached a valid STATA file to this PR that crashes on both extract_metadata and WizardMac. Unfortunately, I cannot really give you more information, as the error is rather opaque.

extract_metadata output

Tested with both 1.1.4 and 1.1.5.

parsing dta file
pass 1 done
parsing dta file
extracting value labels for IDNUM
extracting value labels for PARTICIP
src/bin/extract_metadata.c:59 not a number: d

WizardMac error

Test files

files.zip

Note that this file has been converted to SPSS and it works.

Dec 14 '20 10:12 basgys

Thanks, I'll look at this later today.

Dec 14 '20 13:12 evanmiller

These look like two separate errors. I realize the error messages aren't especially helpful.

The first error is a shortcoming in extract_metadata, which only recognizes one Stata date format (%td). As you may have gathered by now, it's not a particularly robust tool. I can probably add a couple of workarounds to get you going in the right direction.

The second error (in Wizard) is an encoding error of some kind processing the DTA file. The problem also shows up, without an error message, in the SPSS file. Here's a screenshot of the SPSS file, column a1_foccu if you're following at home:

My guess here is that this file uses a system encoding that is not reported in the DTA file, probably Latin-1. Whatever program converted it to SPSS happily crunched through the bad encoding, but ReadStat will refuse. By the way, the file encoding can be supplied manually to the ReadStat library, but the CLI tools do not support this option.

Dec 14 '20 14:12 evanmiller

Hi Evan!

Thanks for spending time looking into this issue.

Yes, extract_metadata is indeed not that robust. However, both readstat and extract_metadata are saving us a lot of time in our project. Our other option is to use cgo to import the readstat package as a library in our Go project, but it would be a significant increase in development time.

Is there a chance to include more flags to the readstat CLI to support encoding? Because if readstat cannot open certain files, I am not exactly sure what use cases it covers.

It would be good for our project to know a bit more about the readstat roadmap, so we can make a sound decision on how we want to continue the development.

Thanks again for being available to discuss those issues.

Cheers

Dec 15 '20 19:12 basgys

Is there a chance to include more flags to the readstat CLI to support encoding? Because if readstat cannot open certain files, I am not exactly sure what use cases it covers.

It would be "only" a few lines of code (plus documentation, ...). The underlying machinery is all there, but the CLI tool needs to call the readstat_set_file_encoding C function before processing. DTA is weird because the pre-118 formats don't indicate anywhere in the file what the encoding is; the user is simply expected to gather it from the data owner. (The SPSS and SAS file types all indicate their encoding, and are handled by readstat very well.)

Poking around your file, I believe the encoding is MacRoman. I can point you to a couple places in the source where you can hardcode the encoding if it gets you moving again. I can also lay out the general strategy for adding a CLI flag if someone on your team wants to tackle it.

ReadStat's GitHub issues are as good a roadmap as any. I actively fix bugs reported in the C library and am happy to accept patches to the CLI tools. Other than that I sit around and wait for Stata and SPSS to add new formats, data types, compression schemes, etc in new versions of their software.

Dec 15 '20 19:12 evanmiller

Thanks for being responsive!

Currently, I'm the only developer working on this project, so I would be happy to try implementing this flag on readstat if you point me where to start.

Also, would it be possible to document the different encoding available? Or do you have already a list somewhere? Because I haven't seen it in the documentation.

ReadStat's GitHub issues are as good a roadmap as any. I actively fix bugs reported in the C library and am happy to accept patches to the CLI tools. Other than that I sit around and wait for Stata and SPSS to add new formats, data types, compression schemes, etc in new versions of their software.

Thanks for clarifying where you stand.

Dec 15 '20 20:12 basgys

Currently, I'm the only developer working on this project, so I would be happy to try implementing this flag on readstat if you point me where to start.

Sure, the main function where flag processing happens is here:

https://github.com/WizardMac/ReadStat/blob/master/src/bin/readstat.c#L451

Probably this should be converted to use getopt, but you should be able to figure something out. You'll want to call this function on the parser object:

https://github.com/WizardMac/ReadStat/blob/02562413ded25e920b96bbbfc4d87ed062aacec8/src/readstat.h#L368-L371

Also, would it be possible to document the different encoding available?

As the above code snippet indicates, the available encodings are whatever is supported by iconv on your system. The command iconv -l will give you a complete list.

Dec 15 '20 21:12 evanmiller

Related to the first error, I had a similar error for Stata dates, which have a format "%d", while Stata date-times have a format "%td". A quick fix would be:

modified   src/bin/extract_metadata.c
@@ -119,7 +119,7 @@ static int handle_variable_dta(int index, readstat_variable_t *variable, const c
     if (readstat_variable_get_type_class(variable) == READSTAT_TYPE_CLASS_STRING) {
         type = "STRING";
     } else if (readstat_variable_get_type_class(variable) == READSTAT_TYPE_CLASS_NUMERIC) {
-        if (format && strcmp(format, "%td") == 0) {
+      if (format && (strcmp(format, "%td") == 0 || strcmp(format, "%d") == 0)) {
             type = "DATE";
         } else {
             type = "NUMERIC";

However, this is less than optimal, as there should be separate types (or formats) for dates and date-times. Possibly of less relevance, I note from https://json-schema.org/understanding-json-schema/reference/string.html#built-in-formats that the date and date-time formats are represented as "type" : "numeric", with "format" : "date" and "format" : "date-time", respectively (I am not certain whether the meta-data should follow json-schema).

-- Mark

Dec 18 '20 08:12 mclements

@evanmiller

As a side note related to my point on the PR.

// All-or-nothing is probably not the best strategy for data type extraction.
// When SPSS/STATA introduces new types, metadata extraction could fail.
// It would be wiser to simply label the field as "UNKNOWN".

I know that this all-or-nothing strategy brings people here to open an issue, but when I see errors thrown to end-users such as "invalid byte sequence", I believe it is tradeoff of that design.

This strategy to fail at low-level does not help to give meaningful error messages to the end-user. I discussed the encoding issue with the project manager and he told me they sometimes have encoding issues when they open files on STATA/SPSS and they quickly notice weird characters. I think these kinds of errors like an unknown field or unknown encoding should not result in a low-level error such as "invalid byte sequence".

Most of the time, I think we can let the user decide whether a bad encoding or an unknown data type field is an issue. Sometimes they open files with hundreds of columns, and only a few are interesting for them. So, is it worth it to crash for that?

However, sometimes there is an error that prevents the code from continuing, so in this case, the closer we are (in the level of abstraction) to the end-user, the better the error message.

Dec 18 '20 15:12 basgys

@basgys I don't disagree with you – the current "strategy" is a thin veil over laziness. Designing good error messages and recovery options for cases like this is a hard problem. From the point of view of library design, it would be a good idea to indicate an encoding error to the client application (i.e. return a special value for each erroneous row/column) so that the application could decide whether to bail out, present the error to the user, etc. The current error messages are designed for developers, rather than end users, but we all know how that story ends.

Without better recovery mechanisms in place, my preference has been to err on the side of "ReadStat can't open certain files" rather than "ReadStat sometimes produces incorrect data". But it's just that, a preference.

Dec 18 '20 15:12 evanmiller

I definitely understand why/how this can happen and I don't try to minimise the problem or your efforts. Since you mentioned on the PR that you prefer this all-or-nothing design to bring people here, I just wanted to point out the side effects.

"ReadStat can't open certain files" rather than "ReadStat sometimes produces incorrect data"

I understand this point, but that is assuming the whole content is useful for the end-user. Oftentimes, only some parts are needed.

I think deferring the decision to fail further up in the stack with a flag on a row to say that it is invalid or unsupported is a good start. Quietly failing would bring other problems. It's always a tradeoff.

Thanks for being open to discuss about it. I know it can be annoying to have a random developer stopping by to pontificate about how to "correctly" design software.

Dec 18 '20 15:12 basgys

Apologies for the noise: I did not see the pull request. -- Mark

Dec 20 '20 09:12 mclements

ReadStat ReadStat copied to clipboard

extract_metadata error: not a number: d

extract_metadata output

WizardMac error

Test files

ReadStat
ReadStat copied to clipboard