deephaven-core icon indicating copy to clipboard operation
deephaven-core copied to clipboard

Add support to read parquet file metadata through deephaven

Open malhotrashivam opened this issue 1 year ago • 2 comments

This will help with remotely debugging and understanding the parquet file structure. We can follow the similar API spec as duck_db: https://duckdb.org/docs/data/parquet/overview

  • read_parquet
  • parquet_file_metadata
  • parquet_kv_metadata
  • parquet_schema

malhotrashivam avatar Sep 25 '24 15:09 malhotrashivam

One approach that @rcaudy suggested in the meanwhile:

If you have a raw source table in groovy, you should be able to:

  1. .initialize() it
  2. Get its columnSourceManager field.
  3. Get the Table result of the CSM’s locationTable()
  4. Get the K-V metadata for each file by applying an update("KV = ((io.deephaven.parquet.table.location.ParquetTableLocation) _TableLocation).getParquetKey().getMetadata().getFileMetaData().getKeyValueMetaData()")

malhotrashivam avatar Sep 25 '24 15:09 malhotrashivam

It may be useful to write a little standalone utility to print out the FileMetaData as JSON; I've found this little script helpful:

        try (final TMemoryBuffer buffer = new TMemoryBuffer(128)) {
            fileMetaData.write(new TSimpleJSONProtocol(buffer));
            buffer.flush();
            System.out.println(buffer.toString(StandardCharsets.UTF_8));
        } catch (TException e) {
            // ignore
        }

devinrsmith avatar Sep 25 '24 19:09 devinrsmith