rubix
rubix copied to clipboard
Macro based whitelisting of locations allowed to cache
This is useful particularly in case of partitioned tables. Today, whitelisting is a regex e.g. if user wants to whitelist two tables
which appear in location reviews
and bookings
under same s3 prefix like s3://mybuckets/tables
then the could add this to config:
hadoop.cache.data.location.whitelist=.*mybuckets/tables/(reviews|bookings).*
The problem with this is that if say bookings is partitioned by month and has data for many months while user only wants to cache the data for last two months, user will have to keep updating this config everytime the month change. To solve that, we should provide a macro based input to this config. E.g. if reviews are partitioned yearly and booking monthly and user wants to enable caching for only last 5 years of reviews and last 2 months of bookings, this should be possible:
hadoop.cache.data.location.whitelist=.*mybuckets/tables/(reviews/year=$lastFiveYears$|bookings/month=$lastTwoMonthsNames$).*
Rubix should evaluate the macros $lastFiveYears$ and $lastTwoMonthsNames$ at runtime and come up with the whitelisting config as:
hadoop.cache.data.location.whitelist=.*mybuckets/tables/(reviews/year=(2016|2015|2014|2013|2012)|bookings/month=(October|September)).*
Rubix should provide some of the common functions out of the box and the system should be extendable for user defined macros. E.g. if a particular user has data partitioned by store location as s3://mybucket/tables/stores/location=xyz and wants to only cache data for stores in Bangalore and Pune, he should be able to write a custom function to do it, add that jar and use it in whitelist as:
hadoop.cache.data.location.whitelist=.*mybuckets/tables/stores/location=$com.myCompany.rubix.myCustomStoreSelector$.*
Is there any updates on this feature?
Is there any plan to implement this feature in Rubix ?