parquetjs icon indicating copy to clipboard operation
parquetjs copied to clipboard

fruits.parquet generated by test/integration.js is unreadable by Hadoop parquet-tools 1.9.0

Open drauschenbach opened this issue 8 years ago • 5 comments

Build parquet-mr/parquet-tools per these instructions.

Then run its cat command to dump the fruits.parquet file that is generated:

$ java -jar target/parquet-tools-1.9.0.jar cat parquetjs/fruits.parquet 

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/Users/davidr/workspaces/parquet-mr/parquet-tools/target/parquet-tools-1.9.0.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Could not read footer: java.io.IOException: Could not read footer for file DeprecatedRawLocalFileStatus{path=file:/Users/davidr/workspaces/parquetjs/fruits.parquet; isDirectory=false; length=1411554; replication=1; blocksize=33554432; modification_time=1512831680000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

Using parquetjs v0.8.0.

drauschenbach avatar Dec 09 '17 15:12 drauschenbach

I'm getting Error:TypeError: Cannot read property 'num_values' of null when trying to read 'fruits.parquet' using the read functionality of the module

eliezershindler avatar Dec 27 '17 11:12 eliezershindler

I can read it fine when using @drauschenbach tool above

eliezershindler avatar Dec 27 '17 14:12 eliezershindler

I get the num_values error when writing fields with null values. Not writing those fields when their value is null avoided the issue.

sfescape avatar May 01 '18 20:05 sfescape

You might want to check out this PR here https://github.com/ironSource/parquetjs/pull/56 which has some fixes to RLE encoding and does verification of the generated files with parquet-mr.

I think you should be able to install this branch simply by:

npm install zjonsson/parquetjs#0c7948d4fa64acf76e481256422c6f4a6ba56815

ZJONSSON avatar May 01 '18 22:05 ZJONSSON

Also - if you want to avoid the headache of building and configuring parquet-tools you can simply add this to your .bashrc (or paste in console) and use docker to take care of everything.

parquet-tools() { docker run -w /home -v ${PWD}:/home nathanhowell/parquet-tools $@; }

You have to be in the same directory as the parquet file you want to inspect (since current directory will be mounted to the docker as /home). You can then use the tools directly on any parquet file, i.e.:

parquet-tools dump fruits.parquet

ZJONSSON avatar May 02 '18 00:05 ZJONSSON