krangl icon indicating copy to clipboard operation
krangl copied to clipboard

Consider adding option to specify custom NA values while reading a CSV file by passing a list/array of strings

Open harshit3610 opened this issue 4 years ago • 4 comments

As per API docs, CSVFormat class from apache commons is used to add custom null values while reading files. I am new to Krangl so may not know of any simple workarounds. Is it possible to add a function similar to na_values of pandas to make the read operation little simple?

harshit3610 avatar Apr 12 '21 19:04 harshit3610

What about

DataFrame.readCSV(File("foo.csv"), CSVFormat.DEFAULT.withNullString("MISSING"))

?

Technically a dedicated argument could be added, but I'm not sure if this would bloat the method signatures in the long run.

A known limitation of the underlying apache commons API is that you can only provide a single null string and not a collection.

holgerbrandl avatar Apr 14 '21 18:04 holgerbrandl

Is there a way to remove apache commons as a requirement? Can we provide a mechanism to replace all occurrences of a custom null string with "NA" value while reading the file ? The limitation of only accepting a single null string will be a huge limitation in the long run and may affect the adoption rate of the library by enthusiasts. Are there any technical limitations as to why apache commons must be used?

harshit3610 avatar Apr 15 '21 09:04 harshit3610

Why would we want to replace apache-commons-csv? What would be a better alternative?

I've chosen apache-commons-csv initially here because I could not find any better alternatives.

I see the point that having just a single NA string is limiting, but I don't think its a major problem.

holgerbrandl avatar Apr 21 '21 20:04 holgerbrandl

In pandas API, a typical read_csv function allows adding multiple custom NA values in the following way

pd.read_csv("data.txt",na_values = [ 'na', 'Not available', "", "-"])

In many data sets, we have data that's not up to the mark and multiple strings for NA data exist. I was hoping if there would be a way to add such an argument(na_values) to krangl API's read functions with option of passing an array of strings or a list of strings similar to how pandas makes it work. The added trouble of adding apache commons as a dependency only to get 1 single NA string option is too much effort in my opinion

harshit3610 avatar Apr 22 '21 14:04 harshit3610