datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Redash ingestion logs overwhelming on large Redash deployments

Open atjones0011 opened this issue 1 year ago • 3 comments

Describe the issue The RedashSourceReport used to report the progress of a Redash ingestion job may result in very large reports on large Redash deployments that contain many items to scan or when filtering out a large number of items. These massive reports occur due to the data structure used for the filtered and timing fields. With the default configuration of skipping draft items and Redash's configuration to save drafts for any query run, the number of filtered items is expected to grow over time while the List data structure reports each item in the list on a separate line. With my team's ingestion job filtering hundreds of thousands of queries and this report being printed frequently, the log greatly impacts the performance of our ingestion job.

Expected behavior As has been applied on other ingestion jobs, I propose we switch filtered to a LossyList and timing to a LossyDict. This will allow users to observe how many queries have been filtered and where the ingestion job is in its processing without printing many extra lines.

atjones0011 avatar Jan 05 '24 18:01 atjones0011

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Feb 05 '24 01:02 github-actions[bot]

I propose we switch filtered to a LossyList and timing to a LossyDict

I agree, this seems like a pretty reasonable way to solve it.

Would you be willing to open a PR?

hsheth2 avatar Feb 12 '24 20:02 hsheth2

PR #9873 has been created to resolve this problem

atjones0011 avatar Feb 16 '24 17:02 atjones0011