presto icon indicating copy to clipboard operation
presto copied to clipboard

Please improve parquet connector to not require hive

Open mariusa opened this issue 6 years ago • 14 comments

Please improve parquet connector to not require hive, ideally also not require hdfs

Working directly with parquet files would make Presto much easier to get adopted by data scientists.

Related: https://github.com/prestodb/presto/issues/11955

mariusa avatar Jun 16 '19 16:06 mariusa

@mariuss Marius, is this something you'd like to work on?

mbasmanova avatar Jun 18 '19 14:06 mbasmanova

I don't have the skills :(

We're looking for how to query large CSV or Parquet files, Presto seems a great solution. However, it's too complex now to get started with Parquet, due to all hadoop/hive deps & steps, vs. being able to connect to Parquet directly: https://scientific-software.netlify.com/howto/how-to-query-big-csv-files

Thanks, Maria!

mariusa avatar Jun 18 '19 14:06 mariusa

@mariusa u dont need hadoop/hdfs. all u need is S3 and hive (hive can run without hdfs, namenode..etc)

tooptoop4 avatar Jun 23 '19 14:06 tooptoop4

even the standalone hive metastore needs Hadoop specific environment variables/jars, maybe implement the Presto Thrift connector to communicate to S3? I will give it a go :-)

DennisRutjes avatar Oct 18 '19 21:10 DennisRutjes

I would be very interested in this connector also - I believe (though obviously I haven't tried) that it will be significantly more efficient to query large data from Parquet files directly without requiring the data to be between hive and presto over thrift

jeromeof avatar Jan 23 '20 08:01 jeromeof

@jeromeof the data does not come from hive, just metadata does

tooptoop4 avatar May 23 '20 23:05 tooptoop4

I haven't tried but have heard there is an experimental setting (hive.metastore=file) that could support reading from parquet directly - https://github.com/prestodb/presto/tree/master/presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/file

tooptoop4 avatar May 24 '20 00:05 tooptoop4

You can indeed [ab]use hive.metastore=file to avoid setting up a real Hive metastore. You can then run MinIO to serve your local files over "S3".

I put together a docker container with the above setup; it seems to work well for ad-hoc analysis of local files, though I've not used extensively yet so YMMV.

docker run -it --mount source=/<data-dir>/,destination=/parquet,type=bind floatingwindow/presto-local-parquet

The above will start Presto and the necessary bits to read local files, then give you a Presto shell.

More info: https://github.com/floating-window/presto-local-parquet

At the moment you still need to define the schemas manually. I'm part way through writing a script to scan a set of parquet files and automagically set up corresponding schema & file mappings in Presto - I'll update the above container when it's finished.

louiseightsix avatar Sep 17 '20 05:09 louiseightsix

This would be really helpful when using Superset to do analysis

johnnytshi avatar Nov 25 '21 19:11 johnnytshi

CC: @majetideepak

mbasmanova avatar Nov 29 '21 14:11 mbasmanova

Actually i was wondering to not use hdinsight, spark and for hdfs just use azure data lake genv2. Insert data into the storage using some c# mini app and query data with presto. The most strange thing is this requirement for updating the partitions each time i create a new partition

MartinKosicky avatar Aug 20 '24 22:08 MartinKosicky

@MartinKosicky Are you able to share the mini app?

majetideepak avatar Aug 22 '24 14:08 majetideepak

Actually not because i couldnt find in documentation a way how to do it. I wanted to try an architecture on azure to manually ingest parquet files with not hadoop (no spark or flinq, just a plain c# app) into azure data lake. Since azure is decoupling compute and storage, i thought that if i use presto for query that i dont need spark , hive and other technologies at all. But then there is this metastore which is a complication, that i have to deploy some fake hive metastore. I would rather create some configuration static file that describes the schema than running a service

MartinKosicky avatar Aug 23 '24 06:08 MartinKosicky

Presto has unofficial support for file-based metastore: https://github.com/prestodb/presto/issues/19112 I use this way for development. For writing data, I recently wrote a small Velox app that writes a single Parquet file to storage. https://github.com/majetideepak/velox/commit/8ef1bf35de7c7c71c3481d8f6b2817d865ea8a3e It does not support partitions yet. But this could be a starting point.

majetideepak avatar Aug 23 '24 14:08 majetideepak