kafka-connect-spooldir icon indicating copy to clipboard operation
kafka-connect-spooldir copied to clipboard

enhancement: Make InputFileDequeue overridable.

Open crichton opened this issue 5 years ago • 6 comments

Trying to leverage some of the nice features related to file management by customizing this connector. The capability to override the file selection process via the InputFileDequeue is one way to do it but some of the methods and fields are private or package scope making this difficult.

crichton avatar Sep 27 '19 19:09 crichton

@crichton What problems are you running into? Changing visibility of some of this could be a possibility. Can you outline what you are trying to accomplish? Also please make sure that you are targeting the 2.0 branch. This is the latest version that will be shipped soon.

jcustenborder avatar Sep 27 '19 19:09 jcustenborder

Hi JC. I'm trying to adapt the connector to read a redshift extract. So I'll create a new connector by creating the required class triad extending AbstractSourceTask etc. Since the layout of a Redshift extract is more complex than what spool-dir can handle out of the box ( has sub-directors and file naming conventions based on a schema) , I need a more functional way to control how files are located and queued.

2 ideas :

  1. Override/replace InputFileDequeue with a custom InputFileDequeue that can read up the more complex file layout. In order to do this cleanly, the instantiation of AbstractSourceTask.inputFileDequeue should be easier to override. inputFileDequeue is at package scope. A protected access specifier would be better or even better yet a public/protected getInputFileDequeue method would be prefered.

  2. add a callback mechanism within InputFileDequue itself that optionally asks for the list of files from an external implementation . Maybe this could be implemented as a lambda or plain interface .

does that make sense ?

crichton avatar Sep 28 '19 01:09 crichton

That context makes things a bit better. Is this an in house process or a standard process for RedShift. If it's standard can you point me to some docs so I can read up on it?

jcustenborder avatar Sep 28 '19 18:09 jcustenborder

I'll see what I can find. Its basically a database schema export with the schema placed into an json file having a table objects with column subobjects each having a data type. These describe the headers for the tab delimited table data files. Each table data file is in a subdirectory having the name of the table. That is why I'm trying to commandeer the spool-dir csv connector.

crichton avatar Oct 03 '19 19:10 crichton

That would be good to understand. I'd love to see what we can do to assist you on this one. Off the top of my head I would want to use the json file as a control file but from there we'd need to do something around the other files. If spooldir isn't the right path, I'm happy to give you guidance on where to go next.

jcustenborder avatar Oct 03 '19 20:10 jcustenborder

Yes. I'm halfway through coding this using the schema file. Its a learning management system called lmsCanvas. So the documentation is fairly fragmented, but here and here are 2 spots with info. They seem to provide tools to import to redshift, so its not exactly a redshift 1 to 1 import. I think redshift imports by table not by schema.

So net-net I think it still would be cool to make spool-dir more extendable if possible by refactoring a few things unless there is a connector that is a better fit ?

Also, will the latest spool-dir build get put into maven central or some other public repo that mvn/gradle can read ? Thks. -Tito

crichton avatar Oct 03 '19 22:10 crichton