cyavro icon indicating copy to clipboard operation
cyavro copied to clipboard

cannot read avro files without data

Open ghukill opened this issue 7 years ago • 1 comments

I just started using cyavro today, and it's wonderful so far. It precisely fills a need to parse a directory of avro files -- quickly -- into a panda dataframe.

However, running into a problem with directories that contain avro files without any rows.

The avro files I'm attemping to read by path are generated by Spark. Whether the total rows written to avro are 100, 1k, 100k, it splits them into a handful of files. I won't pretend to know why or how exactly, but I do fairly commonly see 4 avro files in a given directory.

The python spark code that writes these avro files looks somewhat like this:

.write.format("com.databricks.spark.avro").save('/path/to/avros')

The result is a structure like this:

drwxr-xr-x  12  408B Sep 13 15:22 .
drwxr-xr-x   3  102B Sep 13 15:21 ..
-rw-r--r--   1    8B Sep 13 15:22 ._SUCCESS.crc
-rw-r--r--   1  1.3K Sep 13 15:22 .part-r-00000-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro.crc
-rw-r--r--   1   20B Sep 13 15:22 .part-r-00001-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro.crc
-rw-r--r--   1   20B Sep 13 15:22 .part-r-00002-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro.crc
-rw-r--r--   1   28B Sep 13 15:22 .part-r-00003-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro.crc
-rw-r--r--   1    0B Sep 13 15:22 _SUCCESS
-rw-r--r--   1  164K Sep 13 15:22 part-r-00000-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro
-rw-r--r--   1  1.3K Sep 13 15:22 part-r-00001-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro
-rw-r--r--   1  1.3K Sep 13 15:22 part-r-00002-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro
-rw-r--r--   1  2.2K Sep 13 15:22 part-r-00003-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro

As you can see, one of the avro file part-r-00000 is 164k, and contains the majority (if not all) of the rows. This is loaded quickly and without issue. But attempting to parse the entire directory with .read_avro_path_as_dataframe fails with the error:

Exception: Can't read file : Cannot read 1 bytes from file

Afraid it was these "empty" avro files, confirmed that attempting to read the files part-r-00001 or part-r-00002 individually result in the same error. And this makes sense if the .read_avro_path_as_dataframe is really just opening them up individually, and then concatenating.

For what it's worth, I have parsed these "empty" avro files successfully with the python avro library, where iterating over the reader just results in nothing.

As mentioned, cyavro looks like a really great solution to our need of quickly parsing a path of avro files into a dataframe, but I'm afraid we can't avoid having these "empty" avro files present as well. Any thoughts would be much appreciated.

OS: Mac OS (will eventually build in Ubuntu 16.04) Build: conda build, then local install to conda environment

ghukill avatar Sep 13 '17 19:09 ghukill

If helpful, here are the bytes of a problematic avro file (believe it is compressed with the snappy codec):

Obj\x01\x04\x16avro.schema\xd2\x14{"type":"record","name":"topLevelRecord","fields":[{"name":"set","type":[{"type":"record","name":"set","fields":[{"name":"id","type":["string","null"]},{"name":"document","type":["string","null"]},{"name":"setSource","type":[{"type":"record","name":"setSource","fields":[{"name":"queryParams","type":[{"type":"map","values":["string","null"]},"null"]},{"name":"url","type":["string","null"]},{"name":"text","type":["string","null"]}]},"null"]}]},"null"]},{"name":"record","type":[{"type":"record","name":"record","fields":[{"name":"id","type":["string","null"]},{"name":"document","type":["string","null"]},{"name":"setIds","type":[{"type":"array","items":["string","null"]},"null"]},{"name":"recordSource","type":[{"type":"record","name":"recordSource","fields":[{"name":"queryParams","type":[{"type":"map","values":["string","null"]},"null"]},{"name":"url","type":["string","null"]},{"name":"text","type":["string","null"]}]},"null"]}]},"null"]},{"name":"error","type":[{"type":"record","name":"error","fields":[{"name":"message","type":["string","null"]},{"name":"errorSource","type":[{"type":"record","name":"errorSource","fields":[{"name":"queryParams","type":[{"type":"map","values":["string","null"]},"null"]},{"name":"url","type":["string","null"]},{"name":"text","type":["string","null"]}]},"null"]}]},"null"]}]}\x14avro.codec\x0csnappy\x00%\xae\xecs\xfb\xbc`\xf4F\xc7\xf5\x9cL\xf5\x92\xb0

... and base64 encoded:

T2JqAQQWYXZyby5zY2hlbWHSFHsidHlwZSI6InJlY29yZCIsIm5hbWUiOiJ0b3BMZXZlbFJlY29yZCIsImZpZWxkcyI6W3sibmFtZSI6InNldCIsInR5cGUiOlt7InR5cGUiOiJyZWNvcmQiLCJuYW1lIjoic2V0IiwiZmllbGRzIjpbeyJuYW1lIjoiaWQiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoiZG9jdW1lbnQiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoic2V0U291cmNlIiwidHlwZSI6W3sidHlwZSI6InJlY29yZCIsIm5hbWUiOiJzZXRTb3VyY2UiLCJmaWVsZHMiOlt7Im5hbWUiOiJxdWVyeVBhcmFtcyIsInR5cGUiOlt7InR5cGUiOiJtYXAiLCJ2YWx1ZXMiOlsic3RyaW5nIiwibnVsbCJdfSwibnVsbCJdfSx7Im5hbWUiOiJ1cmwiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoidGV4dCIsInR5cGUiOlsic3RyaW5nIiwibnVsbCJdfV19LCJudWxsIl19XX0sIm51bGwiXX0seyJuYW1lIjoicmVjb3JkIiwidHlwZSI6W3sidHlwZSI6InJlY29yZCIsIm5hbWUiOiJyZWNvcmQiLCJmaWVsZHMiOlt7Im5hbWUiOiJpZCIsInR5cGUiOlsic3RyaW5nIiwibnVsbCJdfSx7Im5hbWUiOiJkb2N1bWVudCIsInR5cGUiOlsic3RyaW5nIiwibnVsbCJdfSx7Im5hbWUiOiJzZXRJZHMiLCJ0eXBlIjpbeyJ0eXBlIjoiYXJyYXkiLCJpdGVtcyI6WyJzdHJpbmciLCJudWxsIl19LCJudWxsIl19LHsibmFtZSI6InJlY29yZFNvdXJjZSIsInR5cGUiOlt7InR5cGUiOiJyZWNvcmQiLCJuYW1lIjoicmVjb3JkU291cmNlIiwiZmllbGRzIjpbeyJuYW1lIjoicXVlcnlQYXJhbXMiLCJ0eXBlIjpbeyJ0eXBlIjoibWFwIiwidmFsdWVzIjpbInN0cmluZyIsIm51bGwiXX0sIm51bGwiXX0seyJuYW1lIjoidXJsIiwidHlwZSI6WyJzdHJpbmciLCJudWxsIl19LHsibmFtZSI6InRleHQiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX1dfSwibnVsbCJdfV19LCJudWxsIl19LHsibmFtZSI6ImVycm9yIiwidHlwZSI6W3sidHlwZSI6InJlY29yZCIsIm5hbWUiOiJlcnJvciIsImZpZWxkcyI6W3sibmFtZSI6Im1lc3NhZ2UiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoiZXJyb3JTb3VyY2UiLCJ0eXBlIjpbeyJ0eXBlIjoicmVjb3JkIiwibmFtZSI6ImVycm9yU291cmNlIiwiZmllbGRzIjpbeyJuYW1lIjoicXVlcnlQYXJhbXMiLCJ0eXBlIjpbeyJ0eXBlIjoibWFwIiwidmFsdWVzIjpbInN0cmluZyIsIm51bGwiXX0sIm51bGwiXX0seyJuYW1lIjoidXJsIiwidHlwZSI6WyJzdHJpbmciLCJudWxsIl19LHsibmFtZSI6InRleHQiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX1dfSwibnVsbCJdfV19LCJudWxsIl19XX0UYXZyby5jb2RlYwxzbmFwcHkAJa7sc/u8YPRGx/WcTPWSsA==

ghukill avatar Sep 18 '17 16:09 ghukill