deephaven-core
deephaven-core copied to clipboard
Add support to read parquet file metadata through deephaven
This will help with remotely debugging and understanding the parquet file structure. We can follow the similar API spec as duck_db: https://duckdb.org/docs/data/parquet/overview
- read_parquet
- parquet_file_metadata
- parquet_kv_metadata
- parquet_schema
One approach that @rcaudy suggested in the meanwhile:
If you have a raw source table in groovy, you should be able to:
- .initialize() it
- Get its columnSourceManager field.
- Get the Table result of the CSM’s locationTable()
- Get the K-V metadata for each file by applying an update("KV = ((io.deephaven.parquet.table.location.ParquetTableLocation) _TableLocation).getParquetKey().getMetadata().getFileMetaData().getKeyValueMetaData()")
It may be useful to write a little standalone utility to print out the FileMetaData as JSON; I've found this little script helpful:
try (final TMemoryBuffer buffer = new TMemoryBuffer(128)) {
fileMetaData.write(new TSimpleJSONProtocol(buffer));
buffer.flush();
System.out.println(buffer.toString(StandardCharsets.UTF_8));
} catch (TException e) {
// ignore
}