datawave
datawave copied to clipboard
Feature/dynamic shards
Add the ability for define custom ShardIdGenerators that can be added to ShardIdFactory and used to override shard id generation for any records considered applicable for the generators.
Closes https://github.com/NationalSecurityAgency/datawave/issues/2632
Tested this via quickstart. Adding the following to warehouse/ingest-configuration/src/main/resources/config/shard-ingest-config.xml:
<property>
<name>shardIdFactory.generator.1</name>
<value>datawave.ingest.mapreduce.handler.shard.ShiftOnDay</value>
</property>
<property>
<name>shardIdFactory.generator.1.datatypes</name>
<value>myjson</value>
</property>
<property>
<name>shardIdFactory.generator.1.begin</name>
<value>20060101</value>
</property>
<property>
<name>shardIdFactory.generator.1.end</name>
<value>20130331</value>
</property>
Results in the following shards in the shard table after installing and initializing datawave:
20070924_1 tvmaze\x00-r5sogy.30uc2y.a9zr4g:ORIG_FILE\x00tvmaze-api.json|5|0 [PRIVATE|(BAR&FOO)]
20070924_1 tvmaze\x00-r5sogy.30uc2y.a9zr4g:PREMIERED\x002007-09-24 [PRIVATE|(BAR&FOO)]
20070924_1 tvmaze\x00-r5sogy.30uc2y.a9zr4g:RATING_AVERAGE.RATING_0.AVERAGE_0\x008.2 [PRIVATE|(BAR&FOO)]
20070924_1 tvmaze\x00-r5sogy.30uc2y.a9zr4g:RUNTIME\x0030 [PRIVATE|(BAR&FOO)]
20070924_1 tvmaze\x00-r5sogy.30uc2y.a9zr4g:SCHEDULE_DAYS.SCHEDULE_0.DAYS_0\x00Thursday [PRIVATE|(BAR&FOO)]
20070924_1 tvmaze\x00-r5sogy.30uc2y.a9zr4g:SCHEDULE_TIME.SCHEDULE_0.TIME_0\x0020:00 [PRIVATE|(BAR&FOO)]
Compared to the original:
20070924_0 tvmaze\x00-r5sogy.30uc2y.a9zr4g:ORIG_FILE\x00tvmaze-api.json|5|0 [PRIVATE|(BAR&FOO)]
20070924_0 tvmaze\x00-r5sogy.30uc2y.a9zr4g:PREMIERED\x002007-09-24 [PRIVATE|(BAR&FOO)]
20070924_0 tvmaze\x00-r5sogy.30uc2y.a9zr4g:RATING_AVERAGE.RATING_0.AVERAGE_0\x008.2 [PRIVATE|(BAR&FOO)]
20070924_0 tvmaze\x00-r5sogy.30uc2y.a9zr4g:RUNTIME\x0030 [PRIVATE|(BAR&FOO)]
20070924_0 tvmaze\x00-r5sogy.30uc2y.a9zr4g:SCHEDULE_DAYS.SCHEDULE_0.DAYS_0\x00Thursday [PRIVATE|(BAR&FOO)]
20070924_0 tvmaze\x00-r5sogy.30uc2y.a9zr4g:SCHEDULE_TIME.SCHEDULE_0.TIME_0\x0020:00 [PRIVATE|(BAR&FOO)]
Note that the value of num.shards was set to 1.