datatable
datatable copied to clipboard
fread should distinguish between fill set explicitly or not set
Currently the default value for fill is False. However when sep = ' ' we change fill to True. This shouldn't be the case if the user asked fill=False explicitly.
This FR proposes that default value for fill was None, and then fread can auto-detect the most appropriate value for this parameter.
See also test_fread_issues.py::test_issue_R2535x.
So what should be the expected output for
src = "a b 2\nc d 3\n\ne f 4\n"
dt.fread(src, fill=False)
Since skip_blank_lines=False, parsing src should produce an error.
So, according to the test_issue_R2535x, running with skip_blank_lines=False and fill=False should give
| C0 C1 C2
-- + -- -- --
0 | a b 2
1 | c d 3
How Is that the expected output? What happened to the row e f 4\n? :thinking:
Yeah, I would say the test is incorrect. Reading that input with skip_blank_lines=False ought to raise an error
well, it does raise an error because running it on the main branch gives
| C0 C1 C2
-- + -- -- --
0 | a b 2
1 | c d 3
2 | NA NA NA
3 | e f 4
and it fails at assert d3.to_list() == [list("ac"), list("bd"), [2, 3]]. The expected value doesn't seem to be correct, perhaps you can create a PR for it?
One thing that is not clear to me from this issue is what should happen when src = "a b 2\nc d 3\n\ne f 4\n" and we want to read it with skip_blank_lines=False and fill=False. Clearly, we are setting fill to False and we don't want to skip blank lines either, should the output be something like this?
| C0 C1 C2
-- + -- -- --
0 | a b 2
1 | c d 3
2 |
3 | e f 4
Under skip_blank_lines=True any empty line (which i think includes lines containing whitespace only) should be simply skipped. Conversely, when skip_blank_lines=False, an empty line must not be ignored. Its interpretation then depends on other parse parameters. For example, if fill=True, then an empty line can be interpreted as a line of NAs. Or if the file has only one column, then a blank line is just an empty-string value ''. Similarly, when the separator is ' ' (space), and the fill parameter is not specified, then we automatically set fill=True based on the idea that when space is a separator then any NA values at the end of a line are visually indistinguishable from the line being empty (and in fact a text editor might trim those whitespaces when saving the file).
So, in your example
fread("a b 2\nc d 3\n\ne f 4\n", skip_blank_lines=False, fill=False)
should produce an IOError because there is no way to interpret that blank line as valid input.
By "valid input", I mean a data frame, together with some rules for serializing that data into csv, that would ultimately produce the src string.