datatable
datatable copied to clipboard
Add parameter `na_string=` to .to_csv()
I encountered an unexpected behavior of to_csv() and fread() regarding its handling of 'NA' string.
When I ran the following code,
import datatable as dt
data = dt.Frame(['a', 'NA'])
print(data[1,0], type(data[1,0]))
data.to_csv("NA-test.csv")
data = dt.fread("NA-test.csv")
print(data[1,0], type(data[1,0]))
I had expected that I would get the same string 'NA <class str>' and 'NA <class str>' from both print statements, because, after all, all I did was just to serialize data to a file and read it back. Unfortunately, the actual result is 'NA <class str>' and 'None <class NoneType>'.
The current behavior is "understandable" given that 'NA' is one of the default strings to represent an NA value. But the combined default behaviors of to_csv() and fread() led to a surprise that I did not expect at all.
I can work around this problem by adding the parameter na_strings=[""] to fread() like the following:
import datatable as dt
data = dt.Frame(['a', 'NA'])
print(data[1,0], type(data[1,0]))
data.to_csv("NA-test.csv")
data = dt.fread("NA-test.csv", na_strings=[""])
print(data[1,0], type(data[1,0]))
But this workaround won't work if I have a true NA value in another column.
Will it be possible to "escape" the output of the string 'NA' in to_csv() by default somehow, so that when it is read back, it remains to be the string 'NA', not the NA value?
The limitation of the CSV format is that it is not type-preserving. For example, if you have frame dt.Frame(["1.5", "2.0"]) of type str32, then serializing it to CSV and reading back will produce a column of type float64. There are even no NAs involved here.
The problem is that there is no notion of "type" in a CSV file; all fields are essentially strings. However, this is not very practical from user's perspective -- which is why we try to guess column types based on their content. The guess is usually good enough, but cannot be always perfect.
And even though I agree that having a round-trip datatable->CSV->datatable produce the same result as the original is a very desirable property to have, I do not think it is possible in practice.
There may be a solution in writing to a modified CSV format -- say, to ARFF or CSVY, or even JSON -- but we don't have those writers implemented yet (however, it's probably not that hard).
@junghoocho saving your dataset to CSV is a must or you can also consider saving it to a binary format? In the latter case you won't have any effects like you describe, because it will save an internal representation of a frame.
For my current need, na_strings = [""], is a good enough workaround because I don't have any NA value in any column. I was quite surprised with this issue when I was writing some data manipulation code, so I just wanted to report what I observed.
With my current needs aside, the main limitation that I perceive is that currently to_csv() does not allow me to specify the string representation of an NA value. If to_csv() takes an optional parameter like na_string="#NAVALUE", I expect it will handle most people's need, since we can pick a particular string that does not appear in our data when we call to_csv() and fread().
Adding the na_string=... parameter into .to_csv() sounds like a good idea.