parquet-python
parquet-python copied to clipboard
Two different errors when reading two different files
I'm using parquet on Windows 10 and I have two different parquet files for testing, one is snappy-compressed, one is not compressed.
Simple test code for reading:
with open(filename,'r') as f:
for row in parquet.reader(f):
print row
The uncompressed file throws this error:
File "E:/PythonDir/Diverses/DataTest.py", line 23, in <module>
for row in parquet.reader(f):
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
dict_items)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 275, in read_data_page
raw_bytes = _read_page(fo, page_header, column_metadata)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 244, in _read_page
page_header.uncompressed_page_size)
AssertionError: found 87 raw bytes (expected 367)
Reading the compressed file like that gives:
File "E:/PythonDir/Diverses/DataTest.py", line 23, in <module>
for row in parquet.reader(f):
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 393, in reader
footer = _read_footer(fo)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 71, in _read_footer
footer_size = _get_footer_size(fo)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 64, in _get_footer_size
tup = struct.unpack("<i", fo.read(4))
error: unpack requires a string argument of length 4
I can open both files with fastparquet 0.0.5 just fine so there's nothing wrong with the files.
What am I doing wrong? Do I have to explicitely uncompress the data with snappy or is parquet doing that by itself? Can you in general provide some more documentation on the basic usage?
Hi @Khris777 - what version of python are on?
I'm using Python 2.7.
@Khris777 can you try opening the files in binary mode? i.e. with open(filename,'rb')
?
Using binary mode leads to the script not finishing at all.
It does not lock up, it just runs on and on. The two files are both less than 1 MB, so this is odd.
When killing the process after several minutes it throws the usual KeyboardInterrupt
and gives out the line at which it was, and the output is variable, some examples:
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
dict_items)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 344, in read_data_page
dict_values_io_obj, bit_width, len(dict_values_bytes))
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\encoding.py", line 213, in read_rle_bit_packed_hybrid
debug_logging = logger.isEnabledFor(logging.DEBUG)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\logging\__init__.py", line 1366, in isEnabledFor
return level >= self.getEffectiveLevel()
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\logging\__init__.py", line 1355, in getEffectiveLevel
if logger.level:
===============================
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
dict_items)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 343, in read_data_page
values = encoding.read_rle_bit_packed_hybrid(
===============================
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
dict_items)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 344, in read_data_page
dict_values_io_obj, bit_width, len(dict_values_bytes))
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\encoding.py", line 222, in read_rle_bit_packed_hybrid
while io_obj.tell() < length:
Any chance you could try to reproduce this on the 1.2 release that was just published?
I will once I figure out why the latest python-snappy version fails to install.
Okay, now things are like this.
I installed parquet and snappy into my Python 3.6 environment and there parquet works flawlessly, I can read everything just like I can using fastparquet. I did a fresh install, fetching a precompiled snappy-wheel from http://www.lfd.uci.edu/~gohlke/pythonlibs/ and getting the latest parquet with pip.
On Python 2.7 however it still doesn't work. I updated the parquet package normally using pip after also installing the precompiled snappy-wheel for 2.7.
I have the same data in three different formats, uncompressed, snappy-compressed, and gzip-compressed. All three always throw the same error so it doesn't seem to be a compression problem.
My testing code:
r1 = []
filename = "E:\\Temp\\uncompressedParquetFile.parquet"
with open(filename,'rb') as f:
for row in parquet.reader(f):
r1.append(row)
throws this error:
Traceback (most recent call last):
File "<ipython-input-9-bb9230901f59>", line 1, in <module>
runfile('E:/PythonDir/Diverses/parquetTest.py', wdir='E:/PythonDir/Diverses')
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "E:/PythonDir/Diverses/parquetTest.py", line 22, in <module>
for row in parquet.reader(f):
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
dict_items)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 344, in read_data_page
dict_values_io_obj, bit_width, len(dict_values_bytes))
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\encoding.py", line 227, in read_rle_bit_packed_hybrid
res += read_bitpacked(io_obj, header, width, debug_logging)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\encoding.py", line 146, in read_bitpacked
b = raw_bytes[current_byte]
IndexError: list index out of range
Without binary mode with open(filename,'r') as f:
it's this error:
Traceback (most recent call last):
File "<ipython-input-10-bb9230901f59>", line 1, in <module>
runfile('E:/PythonDir/Diverses/parquetTest.py', wdir='E:/PythonDir/Diverses')
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "E:/PythonDir/Diverses/parquetTest.py", line 22, in <module>
for row in parquet.reader(f):
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 393, in reader
footer = _read_footer(fo)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 78, in _read_footer
fmd.read(pin)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\thrift.py", line 112, in read
iprot.read_struct(self)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 267, in read_struct
val = self.read_val(ftype, fspec)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 299, in read_val
result.append(self.read_val(v_type, v_spec))
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 335, in read_val
self.read_struct(obj)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 267, in read_struct
val = self.read_val(ftype, fspec)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 299, in read_val
result.append(self.read_val(v_type, v_spec))
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 335, in read_val
self.read_struct(obj)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 267, in read_struct
val = self.read_val(ftype, fspec)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 335, in read_val
self.read_struct(obj)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 267, in read_struct
val = self.read_val(ftype, fspec)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 335, in read_val
self.read_struct(obj)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 250, in read_struct
fname, ftype, fid = self.read_field_begin()
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 181, in read_field_begin
return None, self._get_ttype(type), fid
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 134, in _get_ttype
return TTYPES[byte & 0x0f]
KeyError: 14
Oh interesting. I'd love to try to recreate this issue. How are you generating the parquet file that it fails on?
The files are generated on a Cloudera Hadoop Cluster version 5.4.4 in Java by a colleague. I asked him for some code and he gave me the parts that write the parquet file, it's part of a larger file though:
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.IndexedRecord;
import org.apache.avro.reflect.ReflectData;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.fs.Path;
import parquet.avro.AvroSchemaConverter;
import parquet.avro.AvroWriteSupport;
import parquet.hadoop.ParquetWriter;
import parquet.hadoop.metadata.CompressionCodecName;
import parquet.schema.MessageType;
public static final WriterVersion DEFAULT_WRITER_VERSION = WriterVersion.PARQUET_1_0;
Schema avroSchema = new Schema.Parser().parse(avroSchemaFile);
MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);
AvroWriteSupport writeSupport = new AvroWriteSupport(parquetSchema, avroSchema);
File parquetFile = new File("parquetFile.parquet");
Path parquetFilePath = new Path(parquetFile.toURI());
try (ParquetWriter<IndexedRecord> parquetFileWriter =
new ParquetWriter<IndexedRecord>(parquetFilePath, writeSupport, CompressionCodecName.SNAPPY, ParquetWriter.DEFAULT_BLOCK_SIZE, ParquetWriter.DEFAULT_PAGE_SIZE))
{
for (UploadedXmlDTO uploadedXML : uploadedXMLs)
{
GenericRecord record = new GenericData.Record(avroSchema);
record.put("date", uploadedXML.getDate());
record.put("xml", ByteBuffer.wrap(uploadedXML.getXml()));
parquetFileWriter.write(record);
}
}
Maybe this helps a little, I can't provide you with the files because of company policies.