dione
dione copied to clipboard
Dione - a Spark and HDFS indexing library
We can implement a simple, Java-native, standalone HTTP server that gets requests for key(s) and returns the payload (as json, etc.). Then, users can implement their own web server to...
## Summary Currently we don't support file split in all formats. Relates to #10
## Summary Does this library support delta files (https://github.com/delta-io/delta/) Links to #PR/#Issue ## Detailed Description what is the problem? how can we solve it?
### Summary Adding the option to have Parquet index, instead of Avro btree. This is for batch-only use cases, where fetches are rare or not used at all. In such...
# Summary currently we hard code in `createIndex`, for example: `partitioned by` when we create the index table
Add IT test script that generates some big table, index it, and asserts the basic functionalities. Such Spark script should work smoothly on any Hadoop cluster (with HDFS).
Try to recognize common path prefixes on runtime, and trim them. For example, files in a standard table might look like: ``` hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0000.parquet hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0001.parquet ... ``` On read, before the...
To support ignoring "old" partitions. Alternatively, can add some partition "blacklist" or min value to the index metadata
Spark 2.4+ DataSource v2 is much more powerfull the in Spark 2.3. the main issue in Spark 2.3 is basically you need to implement everything yourself. and it is a...
current filesDF is both ugly and inefficient in terms of data locality. we should try to switch to something like HadoopRDD/NewHadoopRDD or something more natural to leverage the preferred locations...