Dmitry Goldenberg
Dmitry Goldenberg
This ticket would provide a way to place the results of processing on a Kafka topic. Noting that large documents placed in a Kafka topic is an anti-pattern, the recommended...
This ticket would add a scanner implementation that read documents from a kafka topic as a consumer. When documents are large it would be expected that the item read is...
We see two types of use-cases out there. A CSV splitter is a common one: split CSV such that: - the first line is (optionally) the header - any empty...
We see both types of use-cases out there. JSON splitter is a common one: split JSON using a given JsonPath which identifies the start of a "document" within the JSON....
You'll want a command-line interface for the ETL framework in order to make it usable by DevOps and for ease of integration and testing in general, as an alternative to...
This can be seen in e.g. JdbcScannerTester (to be checked in). In the output below, notice how the same document data is processed by multiple threads. Due to scan frequency...
The common functionality to refactor into the interface could be - the 'build' method - maybe handle getObj / setObj in a unified way - a 'validate' method to validate...
Rather than us sprinkling the code with string literals it's best if we have a class or interface with the common field names. This has the advantages of avoiding misspellings...
A few thoughts on this: - Terminology: "data provenance" rather than "FTI / fault tolerant indexing" - Use domain driven development methodology and push 3rd party dependencies to the edges...
We've agreed that we'll need two distinct "system level" fields to maintain the content size: - the "original_content_size" - the size of an input file as it got pushed into...