opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

[exporter/file] Add posibility to write telemetry in Parquet or Delta format

Open marcinsiennicki95 opened this issue 1 year ago • 2 comments

Component(s)

exporter/file

Is your feature request related to a problem? Please describe.

Parquet Format: Parquet is a columnar storage file format optimized for big data processing frameworks. It provides efficient data compression and encoding schemes, enhancing performance and reducing storage costs. Telemetry data written in Parquet format is stored in columns, making it faster to read and query specific fields.

Delta Format: Delta Lake is an open-source storage layer that brings ACID transactions to big data workloads. Delta format combines the reliability of data lakes with the performance of data warehouses. Writing telemetry data in Delta format allows for scalable and reliable data processing, supporting complex data pipelines and real-time analytics.

Describe the solution you'd like

Ability to write in Parquet or Delta format

Describe alternatives you've considered

No response

Additional context

No response

marcinsiennicki95 avatar Jun 28 '24 13:06 marcinsiennicki95

Pinging code owners:

  • exporter/file: @atingchen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Jun 28 '24 13:06 github-actions[bot]

@jmacd Is it possible with current stat of arrow, because I found in documentation.

https://github.com/open-telemetry/otel-arrow 4. Output OpenTelemetry data to the Parquet file format, part of the Apache Arrow ecosystem

marcinsiennicki95 avatar Jun 28 '24 16:06 marcinsiennicki95

@marcinsiennicki95 there is a connection between Arrow and Parquet, but it is not an automatic translation. The way we have structured the OTel-Arrow data stream, there are multiple logical tables being exchanged within an Arrow IPC payload, both because of varying schemas within the telemetry and because of shared data references. These multiple logical tables would naturally translate into multiple Parquet files.

When writing tables of shared data across an OTel-Arrow stream, the OTel-Arrow components will repeat shared data once per stream - while in a database system it would be possible to refer to past data in the system. The tradeoffs involved between writing across the network and constructing a database are large, so to make progress on this issue we would have to settle on what the Parquet schema looks like.

cc/ @lquerel

jmacd avatar Jul 08 '24 18:07 jmacd

(Teaser: I've been playing around with an Parquet-first telemetry data store, it's helped me come to concrete opinions about this problem. https://github.com/jmacd/duckpond)

jmacd avatar Jul 08 '24 18:07 jmacd

Thx for answer. I had a conversation on the OpenTelemetry Slack channel and found out that @atoulme was working on the Parquet format

marcinsiennicki95 avatar Jul 09 '24 12:07 marcinsiennicki95

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • exporter/file: @atingchen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Sep 09 '24 03:09 github-actions[bot]

Not anymore. As noted, the parquetexporter was not adopted, and we are working on Apache Arrow instead.

atoulme avatar Oct 02 '24 06:10 atoulme

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • exporter/file: @atingchen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Dec 03 '24 03:12 github-actions[bot]

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • exporter/fileexporter: @atingchen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Mar 24 '25 03:03 github-actions[bot]

@jmacd is there any news on this front since last July?

punya avatar Apr 24 '25 19:04 punya