datatable fread erroneously guesses sep=' '

https://archive.ics.uci.edu/ml/machine-learning-databases/badges/badges.data

Seems like reasonable data set that needs better white space detection -- similar to datetime, here is firstnameinitiallastname and dt gets confused when name format changes slightly. As you said, if human can parse very readily, should be possible for dt to parse too with a few extra rules.

tmp/seeds/72552/test_random_1508903108.2884815.stderr:RuntimeError: Line 2 from sampling jump 0 starting "+ David W." has more than the expected 3 fields. Separator 3 occurs at position 11 which is character 3 of the last field: "W. Aha". Consider setting 'comment.char=' if there is a trailing comment to be ignored.

Oct 25 '17 06:10 pseudotensor

The file looks as follows:

+ Naoki Abe
- Myriam Abramson
+ David W. Aha
+ Kamal M. Ali
- Eric Allender
+ Dana Angluin
- Chidanand Apte
+ Minoru Asada
+ Lars Asker
+ Javed Aslam
+ Haralabos Athanassiou
+ Jose L. Balcazar
+ Timothy P. Barber
+ Michael W. Barley
- Cristina Baroglio
+ Peter Bartlett
- Eric Baum
.
.
.

A human would probably identify this as 2-column file, with first column containing +/-, and the second column is a person's name. This implies sep=" ", however spaces are also used in the second column, which is unquoted. This file may also be viewed as a fixed-width format, with first field having 1 character, and the second column having a variable-width name. However such treatment is problematic, because we want to see at least 2-spaces-wide gap between the columns (especially if they are both of "string" type). Lastly, we can interpret this file as having a single string column.

Currently we detect this file as having sep=' '. The logic can be improved by saying that a single space must not separate 2 unquoted string fields.

Nov 07 '17 20:11 st-pasha

The logic can be improved by saying that a single space must not separate 2 unquoted string fields.

By that logic, wouldn't the following with a single space between the string fields end up getting parsed as a 2*2 table while it should be a 2*4 table?

+ ab bc ca
- de ef fd

Oct 27 '20 10:10 pradkrish

Yes, the idea is that a single space can be used as a word separator in a sentence, whereas double-space in a sentence is an anomaly. Similar to your example above, consider:

* no to 45
* he is mad

This pretty clearly should be parsed as a 2x1 table, rather than 2x4. What makes it clear to us is that there is a structure sentence here, but the parser can see only the punctuation structure. So for the parser, my example is indistinguishable from yours.

Oct 27 '20 16:10 st-pasha

Yeah, I agree, so the logic you proposed single space must not separate 2 unquoted string fields may solve badges.data but it will incorrectly parse other datasets like the example I gave. Correct me if I am wrong ...

Oct 27 '20 20:10 pradkrish

Well, we are firmly in the grey area where it isn't at all clear what the "correct parsing of a dataset" is. But yes, the example you gave 3 comments above will be parsed as a 2x1 dataset rather than 2x4. Which is fine as long as the user has the ability to override the default behavior.

Oct 27 '20 20:10 st-pasha

Okay, sorry for reiterating. In the example you gave,

* no to 45
* he is mad

Letting fread autodetect sep on this one will return a 2x4 table. This behaviour can be overridden by providing sep='\n', which results in a 2x1 table. So that flexibility is already baked in. What do you think we need to change within in the scope of this task? Thanks.

Oct 28 '20 15:10 pradkrish

So, the issue in this task is that the file "badges.data" cannot be parsed by fread at all. Specifically, when fread tries to scan the file, it detects sep=' ', and then throws a RuntimeError when it sees that the number of fields is different from row to row. A better guess in this case would have been sep='' (i.e. single-column mode), but I'm not sure this possibility is even considered during the sep detection phase.

Thus, the scope of this issue is to make changes to the sep detection logic such that "badges.data" would be parsed automatically (i.e. without user having to specify sep explicitly).

However, which exactly changes are needed in order to achieve that goal, is also a task on its own. In this case it might be good to figure out a collection of examples (which we'll turn into tests later on) of how different fragments need to be parsed under different settings.

For example, the original file

+ Naoki Abe
- Myriam Abramson
+ David W. Aha
+ Kamal M. Ali

should most certainly be detected as a single-column under the default settings.

But what if the file didn't contain people with middle names?

+ Naoki Abe
- Myriam Abramson
+ David Copperfield
+ Kamala Harris

I would prefer that this would also be detected as a single-column, however the reasoning in this case might hinge upon the single-space / multi-spaces distinction.

But what about this:

a b c d
e f g h
i j k l

This looks like a 4-column sample, but only because individual tokens have same width and do not make up a sentence. If you replace these letters with small words that do create a sentence (as in the "no to 45" example I posted above), then it suddenly starts looking like a single-column again.

An important use case is the following:

2020-10-28 10:22:13
2020-10-29 08:17:59

This clearly should be a single-column file, but it is only clear to us because we recognize these records as date-times. The sep detector, however, should not peek into the content of individual fields. So the logic that would detect this as a single-column file must rely on something else, such as single- vs double- spaces distinction.

It is also important to consider how the detection logic would change in the presence of parameter fill=True (i.e. allow uneven number of fields per line).

Oct 28 '20 19:10 st-pasha

Relying on single- vs double space will solve the examples you gave above but I am afraid it will end up breaking a lot of other examples that we currently solve. I tried a quick and dirty implementation based on single- or double-space and these are some of the failures I saw in our test suite. Some examples are ....

src = "ls -l"
src = "a b 2\nc d 3\n\ne f 4\n"
src = "A B C \n10 11 12 \n"
src = "A B\n1 2\n3 4 \n5 6\n"

# Loading any of the above
df = dt.fread(src)

Clearly, at least according to me, those values in the src are not a single column type. I would love to hear your thoughts on it.

Oct 30 '20 18:10 pradkrish

datatable datatable copied to clipboard

fread erroneously guesses sep=' '

datatable
datatable copied to clipboard