streamline icon indicating copy to clipboard operation
streamline copied to clipboard

Add TCP source and TCP sink to simplify steps to test simple topology

Open HeartSaVioR opened this issue 6 years ago • 2 comments

Any testing cases (not meant to test mode) require setup with external storage (for source/sink) and also Schema Registry. Whenever I spin the cluster for testing, I need to…

  1. find the server which SR is installed
  2. login remotely via SSH
  3. cd to example directory in SR
  4. modify properties if needed (the default value of prop doesn’t conform to HDF)
  5. find how to run example jar and run the example multiple times
  6. iterate 4-5 per topic (source/sink)

Which test I would like to do doesn’t matter. Any small test/verification requires above step. We can’t test/verify issue in test mode since I couldn’t even compose topology app properly before setting up environment. For test mode, there’s a workaround (import existing topology to skip filling information for source/sink) but it requires target topology to be composed and being exported previously.

Even we are doing manual test with local environment via manual cluster, this also requires my local to run at least SR and Kafka as well as Storm, and setup similar step.

We could add TCP source as well as TCP sink to make steps fairly simple.

TCP Source:

  • connect to TCP server and read from socket (it should be handy with “nc” command).
  • schema can be defined with SR, but just taking schema information from SR and not requiring data format to be Avro.
  • It would be ideal if we could even not require SR in this specific case and define schema in somewhat handy way.

TCP Sink:

  • don’t require schema information: just write event to TCP server.

After adding the source and sink, we just need to save pre-defined events in JSON format (as well as schema JSON for SR if necessary) to file, and setup done. There’s even a combination which completely eliminate the needs to put events to source as well as register schema to SR, if we can define events and schema in source and exporting/importing topology can retain such information.

HeartSaVioR avatar Feb 28 '18 05:02 HeartSaVioR

We could add TCP source as well as TCP sink to make steps fairly simple.

I propose we define a source and sink API in SAM itself and build the TCP source and sink on top of that. Right now there are multiple steps required to add even a simple source like adding the corresponding spout, defining the flux translation, UI component definition etc.

TCP Source:

This would be more useful for testing the real flow in the cluster than test mode. For test mode we could inject the data in Json and test the flow. Once we decouple the environment from test mode, it would become even more simpler.

connect to TCP server and read from socket (it should be handy with “nc” command). schema can be defined with SR, but just taking schema information from SR and not requiring data format to be Avro.

I think we should decouple the parsing step from the source itself. Otherwise the scope of the source gets narrow (e.g. it could process only Avro, CSV, JSON etc). We could even have some generic parsers (avro, csv, json etc) that could be attached to any source instead of adding the parsing logic to each source.

It would be ideal if we could even not require SR in this specific case and define schema in somewhat handy way.

Yes we should provide the flexibility to either manage the schema via registry or have users define the output fields in the component (for e.g. like how we allow in the custom processor).

arunmahadevan avatar Mar 01 '18 18:03 arunmahadevan

Intentionally adjusting the sequence:

About TCP source / sink:

This would be more useful for testing the real flow in the cluster than test mode. For test mode we could inject the data in Json and test the flow. Once we decouple the environment from test mode, it would become even more simpler.

Another intention of TCP source/sink is that they don't require any external service, so end-to-end topology can be composed with test environment. Currently source and sink can't be defined without coupling with external service, so even with test environment it is forced to import or clone existing app.

About supporting source and sink API:

I propose we define a source and sink API in SAM itself and build the TCP source and sink on top of that. Right now there are multiple steps required to add even a simple source like adding the corresponding spout, defining the flux translation, UI component definition etc.

I think we should decouple the parsing step from the source itself. Otherwise the scope of the source gets narrow (e.g. it could process only Avro, CSV, JSON etc). We could even have some generic parsers (avro, csv, json etc) that could be attached to any source instead of adding the parsing logic to each source.

Yes strongly agreed. Ideally we should provide the set of public API in SDK for custom source and custom sink. The thing is, unlike custom processor, implementation of source and sink heavily depends on the underlying streaming engine. (Sink may not be the case, but source is the case.) We can't abstract it like custom processor.

Btw, please note the rationalize of this issue. This issue intends to address current lack of testing environment, so the requirement of this issue should be simpler to get it earlier than later. Source and sink API are like new feature and it will need much effort to put. I also describe the hard part of defining public API on source and sink above.

I agree the steps to add will be annoying: if we see the benefit of providing TCP source and sink to the end-user, the steps should not be forced to end-user. If we don't want to expose them by default, then we could just create a script to register only TCP source and sink, and we can execute the script to register them if we need.

HeartSaVioR avatar Mar 01 '18 21:03 HeartSaVioR