parquetjs
parquetjs copied to clipboard
Wrong value from a DOUBLE column with repeating values
I have a parquet file that has 10k records in it. It has 7 columns that are strings and 1 is Double. When I read this file and convert them to a sql query(batch insert) I realized that somewhere in the file, it starts to give a different value for this double column. My iteration code is very simple;
while (record = await cursor.next()) {
count++;
if (queryData) {
queryData += ',';
}
queryData += `("${record.someId}","${record.someId2}","${record.someId3}","${record.someId4}","${record.readDate}",${record.readValue},"${record.unit}")`;
}
record.readValue
is the double
column. Parquet file is written with parquet-mr version 1.10.1
. I couldn't find a clear correlation about wrong values. Here is screenshot from a diff of the result of same parquet file with has been read with a different reader and parquetjs-lite reader.
data:image/s3,"s3://crabby-images/fc1b3/fc1b30b7ee3f54d36c23c61a1f93ddb807ce5814" alt="image"
When same value starts repeating in actual data, parquetjs-lite reader starts using a different value(1542.3070...) then correct one. And that value is not a "random" value actually. It one of the values from document, but from another index(somewhere in previous rows).
I hope I could explain the issue. I tried to debug this problem in last 12 hours but couldn't find a clear cause yet. I only feel that this is something about repetition levels but can not confirm. It's an issue on our production currently. Even I started to write this function with Python just because of this. I hope this can be addressed properly and I can return back to JavaScript.
Also interested in whether this can be avoided via passed-in options or such. Including an aab file in an archive and then unzipping results in the aab archive (essentially a nested zip file) also being unzipped in the extracted directory location.