guac icon indicating copy to clipboard operation
guac copied to clipboard

task: [processor] create cmd/processor to collect from collectors

Open lumjjb opened this issue 3 years ago • 1 comments

Collectors that obtain documents need somewhere to emit them to. The processor, which is the next part of the pipeline needs to gather the documents and process them..

There are a couple options naturally:

  1. Processor runs as a gRPC server
  2. Processor obtains documents from a Pub/Sub queue (e.g. kafka, nats.io, etc.)
  3. Processor ingests from STDIN or file
  4. Processor and Collector are part of the same process.

This boils down to we collectors and processors want to be run in the architecture. The ingestor will most likely be tied to the assembler.

Deliberation:

  • Will all the collectors be run in a single executable? I.e. the processor will cache duplicate documents so it is beneficial to have an n:m relationship (where n>m) between collectors and executables. If the answer is no, this excludes option 3 and 4.
    • I think it is likely that this answer is no, given the access of collectors to need credentials and not a single account/team would have all credentials
  • Options 1 and 2 are similar, with a trade-off between simplicity and scale.

lumjjb avatar Aug 25 '22 12:08 lumjjb

@trmiller this may be interesting to you

lumjjb avatar Aug 25 '22 16:08 lumjjb