liac-arff
liac-arff copied to clipboard
categorical values represented as '?' are saved to arff file as missing values
Hi, I understand that this may look llike an expected behaviour but this can lead to unexpected results in the following scenario:
- arff file with quoted question marks as categorical values and data: e.g.
@attribute feat1 {'?', 'A', 'B', 'C'}
-
arff.load()
reads those'?'
as strings. -
arff.write()
(for example after sampling the original data) then writes the'?'
from loaded data without quotes:@attribute feat1 {?, A, B, C}
-
arff.load()
the last file interpretes?
as missing value (None
).
see https://github.com/openml/automlbenchmark/pull/209 for a hack implemented locally to prevent this, but this hack also means that it would not be possible anymore to represent missing values as ?
in arff files saved with the library.
Suugesting to add a param to arff.dump
signature, for example:
def dump(obj, fp, missing_values=[None, '?']):
pass
allowing user to call arff.dump(o, f, missing_values=[None])
when ?
should not be considered as a missing value, and therefore be quoted.
I don't see what's wrong with your fix by changing _RE_QUOTE_CHARS
. I don't think it should stop missing values working, but I've not tested it.