snappydata
snappydata copied to clipboard
External tables partitioning support for hdfs / s3 - Ingestion guidelines for datalake
The goal is to have a firehose stream of data from the web. Let's assume it's json and have it be query-able from SnappyData without manually merging tables. What do you recommend?
See a similar feature in hive: https://resources.zaloni.com/blog/partitioning-in-hive
A possible workflow is using a collector that writes to s3/hdfs every hour (some interval, maybe 5 minutes?) and then in SnappyData read it as a unit. The opposite direction is also useful. Having snappydata through the jdbc driver write to s3/hdfs. What do you think?
Since the hdfs external table support was deprecated and slated to be removed, what alternatives is there to read multiple prefix partitioned files as one table?
@fire Do you have a hive meta-store that you want to point SD to, or do you want to directly connect? Does the Spark way of hive access work or you are expecting something more (https://spark.apache.org/docs/2.1.1/sql-programming-guide.html#hive-tables)? If you want to use hive meta-store directly, then use a SparkSession with hive-site.xml to read it, then load into Snappy like:
val result = spark.sql("select * from hiveTable where timestamp > ...")
snappy.createDataFrame(result.rdd, result.schema).write.format("column").insertInto(...)
Or more efficient will be to convert to RDD[InternalRow] using "result.queryExecution.toRdd" and internalCreateDataFrame which is package level access (so perhaps a helper function in that package to access it) that will avoid conversions between Row and InternalRow.
However, in terms of overall workflow what we have been driving towards is using a common layer like Kafka. Publish these events on kafka and have both hive and snappydata consume the events from it. We will also be adding common event classes for updates/deletes shortly, but till then one can have own event representation for those types.