liac-arff icon indicating copy to clipboard operation
liac-arff copied to clipboard

categorical values represented as '?' are saved to arff file as missing values

Open sebhrusen opened this issue 4 years ago • 1 comments

Hi, I understand that this may look llike an expected behaviour but this can lead to unexpected results in the following scenario:

  • arff file with quoted question marks as categorical values and data: e.g. @attribute feat1 {'?', 'A', 'B', 'C'}
  • arff.load() reads those '?' as strings.
  • arff.write() (for example after sampling the original data) then writes the '?' from loaded data without quotes: @attribute feat1 {?, A, B, C}
  • arff.load() the last file interpretes ? as missing value (None).

see https://github.com/openml/automlbenchmark/pull/209 for a hack implemented locally to prevent this, but this hack also means that it would not be possible anymore to represent missing values as ? in arff files saved with the library.

Suugesting to add a param to arff.dump signature, for example:

def dump(obj, fp, missing_values=[None, '?']):
    pass

allowing user to call arff.dump(o, f, missing_values=[None]) when ? should not be considered as a missing value, and therefore be quoted.

sebhrusen avatar Dec 04 '20 17:12 sebhrusen

I don't see what's wrong with your fix by changing _RE_QUOTE_CHARS. I don't think it should stop missing values working, but I've not tested it.

jnothman avatar Feb 22 '21 11:02 jnothman