redash icon indicating copy to clipboard operation
redash copied to clipboard

Better support for large query results

Open arikfr opened this issue 10 years ago • 39 comments

If a query has large result (~50K rows) it will make the UI to get stuck. We need to detect large results sets and handle them differently (server side pagination?).

arikfr avatar Feb 05 '14 13:02 arikfr

No pagination + Indicator + CSV download should be a good start :)

natict avatar Mar 10 '14 16:03 natict

yep, another optimization is to mark big data sets when we store the query result object.

arikfr avatar Mar 10 '14 16:03 arikfr

Relevant discussion: https://groups.google.com/forum/#!topic/redash-users/UbwvXewsJrQ

arikfr avatar Jul 26 '15 18:07 arikfr

Anybody heard anything about any work on this front? Running into this now.

ChrisLocus avatar Aug 26 '16 14:08 ChrisLocus

@arikfr Do you have any updates on this one ? Like when can be expect a feature release

eshubhamgarg avatar Jan 30 '17 10:01 eshubhamgarg

It's very low priority compared to other stuff, as usually you don't need large result sets in Redash. So far on work was done on this one.

arikfr avatar Jan 31 '17 09:01 arikfr

Hi, Can you explain what the bottleneck is? Thanks

adfel70 avatar Apr 26 '17 12:04 adfel70

@arikfr how open would you be to a pull request in this area? I have little knowledge of Redash internals, however, we would like to solve the issue and may be able to throw some resources at it if we can work to get any changes incorporated into the project.

Do you have a ballpark estimate on the amount of effort it would require to detect a large result set, and offer a download?

bboe avatar May 30 '17 23:05 bboe

@bboe how open? very much :) this is low priority for me, but I definitely want to better handle this.

It's hard to give an estimate without looking into this in more detail & understanding what kind of solution you want to achieve. Shoot me an email and let's talk further (arik at redash io).

arikfr avatar Jun 01 '17 10:06 arikfr

@arikfr @bboe was a pull request ever made regarding this?

jesse-osiecki avatar Aug 23 '17 18:08 jesse-osiecki

Not from my end. Development time for value ended up not being worth it.

bboe avatar Aug 23 '17 20:08 bboe

Value is to use redash to export/browse large sets of data. Currently this is only suitable for statistics generation.

A quick workaround would be to add an option truncate data on the backend (after 1000 entries) so users can still hit the button "export" without laggy UI due to massive JSON being parsed.

antwan avatar Sep 20 '17 18:09 antwan

This has been merged! ✨

jezdez avatar Aug 16 '18 15:08 jezdez

@jezdez this issue is about large query results and not a long list of queries :)

arikfr avatar Aug 27 '18 15:08 arikfr

Ugh, being able to read would clearly be an advantage 😬

jezdez avatar Aug 28 '18 17:08 jezdez

is this issue solved ? I gave same situation when return rows > 50K .

changchichung avatar Aug 30 '18 02:08 changchichung

@changchichung unfortunately not yet. Although if you don't have much more than 50K, maybe just giving more memory to Redash will resolve your issue.

arikfr avatar Sep 07 '18 14:09 arikfr

+1 Version 5.0.1+b4851 on EC2 t2.small redash server cannot resnponse during its processing.

koooge avatar Oct 19 '18 07:10 koooge

Version 5.0.1+b4851 on EC2 t2.small EC2 m3.large getting "redash Worker exited prematurely: signal 9 (SIGKILL)." our main requirement is ability to download large datasets

ismailsimsek avatar Dec 13 '18 16:12 ismailsimsek

@ismailsimsek try using a larger instance (depends on the dataset size you're trying to download).

arikfr avatar Dec 15 '18 17:12 arikfr

@arikfr what do you think about adding pagination to query_runner? using server side cursor where database is handling the large result set. then client application can process the result in batches. This probably requires an message in the UI when the full result-set is not passed to UI.

Thanks for the great software btw.

ismailsimsek avatar Jan 17 '19 16:01 ismailsimsek

@ismailsimsek pagination/server side cursors won't help without changing how we store the data, because we can't stream the data into Postgres (where we currently store results cache). Also it won't help with serving the results to the browser, because we serve the results to the browser from the cache.

It will help once we change how we store the results and will significantly reduce the memory footprint of the workers.

arikfr avatar Jan 20 '19 09:01 arikfr

@arikfr I've been following this issue and I'm keen to contribute back if possible. We've had to deal with bad queries locking up our whole Redash service and would like a way to limit the maximum response sizes that are returned, (either response size in memory or row count).

Could a minimum solution to this simply be adding a configuration option to set a maximum query size, and fail safely if it is exceeded. Some use-cases have been mentioned that include paging the query results into the database and I'm interested to hear how these might be made available for download e.g. as csv.

harveyrendell avatar Aug 15 '19 23:08 harveyrendell

Can we set up a config for enforcing limit clause automatically? Many SQL IDEs do this by default to provide better user experience and prevent users from shooting themselves in the foot.

Default limit to 10k is a reasonable threshold. Nobody actually pages through 10k of results line by line anyways and their UI would stutter.

diwu1989 avatar Jun 30 '20 05:06 diwu1989

Can we set up a config for enforcing limit clause automatically?

Yes.. just need to find a way to do it in a "scalable" way for all the data sources (not all of them have to support it though).

arikfr avatar Jul 01 '20 08:07 arikfr

Can we set up a config for enforcing limit clause automatically?

Yes.. just need to find a way to do it in a "scalable" way for all the data sources (not all of them have to support it though).

@arikfr

Can we make the result payload ‘paginated’ and have a default page size of 1000 rows?

Major ‘big data’ query engines seem to have such nob to control, can we borrow the idea here? API May look something like the following:

GET queries/{queryID}?page=N&pageSize=1000

the above api will make the backend execute the corresponding SQL statement on top of the cache

Isn’t this something a ‘scalable’ solution? If it is, I’d be happy to see how could I help (I am simply a user running into this situation now).

syang61-dev avatar Jun 04 '21 00:06 syang61-dev

I agree pagination makes sense. But we store cached results as serialised JSON today. So even if we fetch 1000 records at a time, each request would deserialise the entire result before plucking some some rows and returning them.

This is fine for result sets <50k rows. But if a user runs a query with 1m rows the serialisation overhead would balloon 🤔

susodapop avatar Jul 23 '21 16:07 susodapop

I agree pagination makes sense. But we store cached results as serialised JSON today. So even if we fetch 1000 records at a time, each request would deserialise the entire result before plucking some some rows and returning them.

This is fine for result sets <50k rows. But if a user runs a query with 1m rows the serialisation overhead would balloon 🤔

For results with 1m rows, maybe we could have the result (BI result cache) chunked and store those chunks. If the community is serious about working out a solution, please let me know and I'd like to see how I could help.

syang61-dev avatar Jul 23 '21 17:07 syang61-dev

We're not going to work on this until at least after the V10 release later this summer. Later this year we'll introduce some processes for improving work planning with the OSS community as we don't want to see this work stagnate. I'll ping this issue once that channel is available.

susodapop avatar Jul 23 '21 18:07 susodapop

Hi there, is there update on this issue? I still got this issue on v10.1. I just expect that my user is able to run query and download the big result set, and no need to catch it into postgresql.

williswyf avatar Jul 04 '22 08:07 williswyf

No update to share at this time. But we have not forgotten about this use case.

and no need to catch it into postgresql.

The results are always cached. Because running the query and downloading results are distinct tasks. Postgres is where Redash saves the state (query result) between these tasks. We can't skip the cache without a significant redesign.

susodapop avatar Jul 04 '22 11:07 susodapop