redash
redash copied to clipboard
data_length makes less sense when data is a nested dictionary rather than a json string
Issue Summary
Before this PR https://github.com/getredash/redash/pull/6687, the data returned by query runners are json strings. Hence the data_length calculated by len(data) makes sense:
https://github.com/getredash/redash/blob/60a12e906efb8f7948fdbe5e013249b8b0c0089a/redash/tasks/queries/execution.py#L194-L200
But after https://github.com/getredash/redash/pull/6687, data is a nested dictionary. And len(data) only gives the number of keys it has. In most cases, there are only two keys, "columns" and "rows", so the data_length doesn't really give us useful information.
Steps to Reproduce
Search for data_length= in your logs.
Technical details:
- Redash Version: 24.06.0-dev
I replaced len(data) with
def _get_size_iterative(dict_obj):
"""Iteratively finds size of objects in bytes"""
seen = set()
size = 0
objects = deque([dict_obj])
while objects:
current = objects.popleft()
if id(current) in seen:
continue
seen.add(id(current))
size += sys.getsizeof(current)
if isinstance(current, dict):
objects.extend(current.keys())
objects.extend(current.values())
elif hasattr(current, '__dict__'):
objects.append(current.__dict__)
elif hasattr(current, '__iter__') and not isinstance(current, (str, bytes, bytearray)):
objects.extend(current)
return size
It works fine. The in-memory dictionary size is usually a lot larger than in-disk storage size such as a csv file due to Python's in-memory storage overheads but at least it gives us a relative value especially informative because I'm using data_length in a DataDog dashboard to monitor user's query result sizes