datatable icon indicating copy to clipboard operation
datatable copied to clipboard

fread should distinguish between fill set explicitly or not set

Open st-pasha opened this issue 7 years ago • 7 comments

Currently the default value for fill is False. However when sep = ' ' we change fill to True. This shouldn't be the case if the user asked fill=False explicitly.

This FR proposes that default value for fill was None, and then fread can auto-detect the most appropriate value for this parameter.

See also test_fread_issues.py::test_issue_R2535x.

st-pasha avatar Mar 02 '18 21:03 st-pasha

So what should be the expected output for

src = "a b 2\nc d 3\n\ne f 4\n"
dt.fread(src, fill=False)

pradkrish avatar Oct 05 '20 15:10 pradkrish

Since skip_blank_lines=False, parsing src should produce an error.

st-pasha avatar Oct 05 '20 16:10 st-pasha

So, according to the test_issue_R2535x, running with skip_blank_lines=False and fill=False should give

   | C0  C1  C2
-- + --  --  --
 0 | a   b    2
 1 | c   d    3

How Is that the expected output? What happened to the row e f 4\n? :thinking:

pradkrish avatar Nov 05 '20 19:11 pradkrish

Yeah, I would say the test is incorrect. Reading that input with skip_blank_lines=False ought to raise an error

st-pasha avatar Nov 05 '20 20:11 st-pasha

well, it does raise an error because running it on the main branch gives

  | C0  C1  C2
-- + --  --  --
 0 | a   b    2
 1 | c   d    3
 2 | NA  NA  NA
 3 | e   f    4

and it fails at assert d3.to_list() == [list("ac"), list("bd"), [2, 3]]. The expected value doesn't seem to be correct, perhaps you can create a PR for it?

pradkrish avatar Nov 05 '20 20:11 pradkrish

One thing that is not clear to me from this issue is what should happen when src = "a b 2\nc d 3\n\ne f 4\n" and we want to read it with skip_blank_lines=False and fill=False. Clearly, we are setting fill to False and we don't want to skip blank lines either, should the output be something like this?

   | C0  C1  C2
-- + --  --  --
 0 | a   b    2
 1 | c   d    3
 2 | 
 3 | e   f    4

pradkrish avatar Nov 06 '20 10:11 pradkrish

Under skip_blank_lines=True any empty line (which i think includes lines containing whitespace only) should be simply skipped. Conversely, when skip_blank_lines=False, an empty line must not be ignored. Its interpretation then depends on other parse parameters. For example, if fill=True, then an empty line can be interpreted as a line of NAs. Or if the file has only one column, then a blank line is just an empty-string value ''. Similarly, when the separator is ' ' (space), and the fill parameter is not specified, then we automatically set fill=True based on the idea that when space is a separator then any NA values at the end of a line are visually indistinguishable from the line being empty (and in fact a text editor might trim those whitespaces when saving the file).

So, in your example

fread("a b 2\nc d 3\n\ne f 4\n", skip_blank_lines=False, fill=False)

should produce an IOError because there is no way to interpret that blank line as valid input.

By "valid input", I mean a data frame, together with some rules for serializing that data into csv, that would ultimately produce the src string.

st-pasha avatar Nov 09 '20 21:11 st-pasha