documentation-website icon indicating copy to clipboard operation
documentation-website copied to clipboard

[DOC] Clarification regarding data prepper sinks.

Open nateynateynate opened this issue 1 year ago • 4 comments

What do you want to do?

  • [X] Request a change to existing documentation
  • [X] Add new documentation
  • [ ] Report a technical problem with the documentation
  • [ ] Other

Tell us about your request. Provide a summary of the request.

Someone was asking about whether data prepper can "handle" apache avro data, and found that the documentation wasn't entirely clear. avro is listed as a codec for data prepper, but refers to it "most efficiently being used" in an S3 sink. Could we add a paragraph or so about how it can be used outside of an S3 sink?

Also - it seems to have some weird formatting oddities that make it a little hard to skim. See screenshots.

*Version: List the OpenSearch version to which this issue applies, e.g. 2.14, 2.12--2.14, or all.

2.15

What other resources are available? Provide links to related issues, POCs, steps for testing, etc. image image

nateynateynate avatar Jul 17 '24 20:07 nateynateynate

@dlvenable - Can you please comment on this? Here is the link: https://opensearch.org/docs/latest/data-prepper/common-use-cases/codec-processor-combinations/#avro

hdhalter avatar Jul 17 '24 21:07 hdhalter

Regarding the original question, Data Prepper can read Avro from S3 and write Avro to S3.

Regarding the documentation, we should revisit this page. The original intention was to clarify when a user should use a codec versus a processor for parsing input data.

I might reword this as:

Apache Avro is an open-source serialization format for record data. When reading Avro data you should use the avro codec.

dlvenable avatar Jul 18 '24 22:07 dlvenable

I also noticed some question comments about Parquet.

Apache Parquet is a columnar storage format built for Hadoop. It is most efficient without the use of a codec. Positive results, however, can be achieved when it’s configured with S3 Select.

Perhaps this should say:

Apache Parquet is a columnar storage format built for Hadoop. Pipeline authors can use the parquet codec to read Parquet data directly from the S3 object. This will retrieve all data from Parquet. An alternative is to use S3 Select instead of the codec. In this case, S3 Select will parse the Parquet file directly (additional S3 charges apply). This can be more efficient if you are filtering or loading a subset of data.

dlvenable avatar Jul 18 '24 22:07 dlvenable

@nateynateynate - Do you want to take a stab at pushing up the changes?

hdhalter avatar Jul 19 '24 18:07 hdhalter