dione icon indicating copy to clipboard operation
dione copied to clipboard

Dione - a Spark and HDFS indexing library

Results 24 dione issues
Sort by recently updated
recently updated
newest added

We can implement a simple, Java-native, standalone HTTP server that gets requests for key(s) and returns the payload (as json, etc.). Then, users can implement their own web server to...

## Summary Currently we don't support file split in all formats. Relates to #10

## Summary Does this library support delta files (https://github.com/delta-io/delta/) Links to #PR/#Issue ## Detailed Description what is the problem? how can we solve it?

### Summary Adding the option to have Parquet index, instead of Avro btree. This is for batch-only use cases, where fetches are rare or not used at all. In such...

# Summary currently we hard code in `createIndex`, for example: `partitioned by` when we create the index table

Add IT test script that generates some big table, index it, and asserts the basic functionalities. Such Spark script should work smoothly on any Hadoop cluster (with HDFS).

Try to recognize common path prefixes on runtime, and trim them. For example, files in a standard table might look like: ``` hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0000.parquet hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0001.parquet ... ``` On read, before the...

To support ignoring "old" partitions. Alternatively, can add some partition "blacklist" or min value to the index metadata

Spark 2.4+ DataSource v2 is much more powerfull the in Spark 2.3. the main issue in Spark 2.3 is basically you need to implement everything yourself. and it is a...

current filesDF is both ugly and inefficient in terms of data locality. we should try to switch to something like HadoopRDD/NewHadoopRDD or something more natural to leverage the preferred locations...