snappydata icon indicating copy to clipboard operation
snappydata copied to clipboard

External tables partitioning support for hdfs / s3 - Ingestion guidelines for datalake

Open fire opened this issue 7 years ago • 1 comments

The goal is to have a firehose stream of data from the web. Let's assume it's json and have it be query-able from SnappyData without manually merging tables. What do you recommend?

See a similar feature in hive: https://resources.zaloni.com/blog/partitioning-in-hive

A possible workflow is using a collector that writes to s3/hdfs every hour (some interval, maybe 5 minutes?) and then in SnappyData read it as a unit. The opposite direction is also useful. Having snappydata through the jdbc driver write to s3/hdfs. What do you think?

Since the hdfs external table support was deprecated and slated to be removed, what alternatives is there to read multiple prefix partitioned files as one table?

fire avatar Apr 30 '18 15:04 fire

@fire Do you have a hive meta-store that you want to point SD to, or do you want to directly connect? Does the Spark way of hive access work or you are expecting something more (https://spark.apache.org/docs/2.1.1/sql-programming-guide.html#hive-tables)? If you want to use hive meta-store directly, then use a SparkSession with hive-site.xml to read it, then load into Snappy like:

val result = spark.sql("select * from hiveTable where timestamp > ...")
snappy.createDataFrame(result.rdd, result.schema).write.format("column").insertInto(...)

Or more efficient will be to convert to RDD[InternalRow] using "result.queryExecution.toRdd" and internalCreateDataFrame which is package level access (so perhaps a helper function in that package to access it) that will avoid conversions between Row and InternalRow.

However, in terms of overall workflow what we have been driving towards is using a common layer like Kafka. Publish these events on kafka and have both hive and snappydata consume the events from it. We will also be adding common event classes for updates/deletes shortly, but till then one can have own event representation for those types.

sumwale avatar Sep 03 '18 06:09 sumwale