dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

Enhancement: Support new-line delimited JSON format/JSON lines

Open JohannesOehm opened this issue 2 years ago • 5 comments

Thank you for the great project. It looks very promising to me.

I currently use val df = DataFrame.readJsonStr(File("foo.ndjson").readLines().joinToString(",", "[", "]")), to read new-line delimited JSON files, which works quite well. However, it would be much more convinient if the API would offer such a function directly. It would be also nice if it would work directly on InputStreams, because readLines() is already reading the entire file under the hood.

JohannesOehm avatar Jul 08 '22 07:07 JohannesOehm

Hi! Can you provide a small sample of such JSON?

koperagen avatar Jul 08 '22 11:07 koperagen

Sure: foo.ndjson.txt

(had to change extension due to github extension issues).

JohannesOehm avatar Jul 08 '22 13:07 JohannesOehm

https://codebeautify.org/json-decode-online

file is not json spec. is Json have "new-line delimited" spec variant? You try to read all file and convert it in memory. Its very huge and slow. Json parser read file by many smal parts (buffer size).

slavonnet avatar Nov 27 '22 13:11 slavonnet

Yes, I'm aware, that is true. My file is not valid JSON, however, this format is commonly used in BigData environments. The specification is available here: http://ndjson.org/

JohannesOehm avatar Nov 27 '22 14:11 JohannesOehm