dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

readCSV fails for *.zip

Open koperagen opened this issue 2 years ago • 3 comments

This file extension is treated in a special way: there's a isCompressed method, and depending on it readCSV wraps InputStream. But it doesn't work for *.zip because InputStream is wrapped in a GZIPInputStream. Apparently it's also not enough to just wrap an InputStream, because ZIP has more complex structure and you need to call methods of ZipInputStream:

val zipInputStream = ZipInputStream(
    File("data.csv.zip").inputStream(),
    Charsets.UTF_8
)
zipInputStream.nextEntry
val df1 = DataFrame.readCSV(zipInputStream)
zipInputStream.closeEntry()

koperagen avatar Oct 14 '23 12:10 koperagen

Another issue is that file ending with *.gz can be *.tar.gz, and we cannot read it properly without some special handling. So, i suggest to either support it or at least provide an exception message that file should be just an archive and not a *.tar

koperagen avatar Oct 14 '23 12:10 koperagen

After the fix it needs to be mentioned in the docs

koperagen avatar Oct 14 '23 12:10 koperagen

There's actually a lot of places where DataFrame assumes a type based on the file extension, but we should avoid that, as file extensions can be changed while the contents of the file are not.

Jolanrensen avatar Oct 16 '23 09:10 Jolanrensen

Will be solved in the new CSV implementation: "dataframe-csv". I will probably also migrate its new Compression class to the :core module in the future to solve reading zips from other read functions too.

Jolanrensen avatar Nov 25 '24 12:11 Jolanrensen