Dmitry Goldenberg issues

Results 10 issues of


                                            Dmitry Goldenberg

Implement Kafka Sender

This ticket would provide a way to place the results of processing on a Kafka topic. Noting that large documents placed in a Kafka topic is an anti-pattern, the recommended...

enhancement

Implement Kafka scanner

This ticket would add a scanner implementation that read documents from a kafka topic as a consumer. When documents are large it would be expected that the item read is...

enhancement

Implement CSV splitter, CSV scanner

We see two types of use-cases out there. A CSV splitter is a common one: split CSV such that: - the first line is (optionally) the header - any empty...

enhancement

Implement JSON splitter, JSON scanner

We see both types of use-cases out there. JSON splitter is a common one: split JSON using a given JsonPath which identifies the start of a "document" within the JSON....

enhancement

Command-line interface

You'll want a command-line interface for the ETL framework in order to make it usable by DevOps and for ease of integration and testing in general, as an alternative to...

enhancement

Different worker threads work on the same data in a scanner

This can be seen in e.g. JdbcScannerTester (to be checked in). In the output below, notice how the same document data is processed by multiple threads. Due to scan frequency...

Needs Clarification

Interface for the processor Builder classes, input validation

The common functionality to refactor into the interface could be - the 'build' method - maybe handle getObj / setObj in a unified way - a 'validate' method to validate...

enhancement

Add class or interface with all common field names

Rather than us sprinkling the code with string literals it's best if we have a class or interface with the common field names. This has the advantages of avoiding misspellings...

enhancement

Thoughts on data provenance

A few thoughts on this: - Terminology: "data provenance" rather than "FTI / fault tolerant indexing" - Use domain driven development methodology and push 3rd party dependencies to the edges...

enhancement

help wanted

discussion

Reconcile handling of the "file size"

We've agreed that we'll need two distinct "system level" fields to maintain the content size: - the "original_content_size" - the size of an input file as it got pushed into...

enhancement

discussion