influxdb-python icon indicating copy to clipboard operation
influxdb-python copied to clipboard

Influx getting gradually slower at every consecutive query

Open rbdm-qnt opened this issue 5 years ago • 4 comments

If I try to query the SAME DAY of data over and over again, my query times look something like this: 1m:18s 1m:28s 1m:50s 2m:11s

and so on. This is an old date, so I'm no new data gets added in that time period. What could this be? Should I stop and restart InfluxDB at every query to minimize this?

Influx 1.7.9, Influx python client 5.2.3, Python 3.7, MacOs 10.12.6

rbdm-qnt avatar Apr 02 '20 12:04 rbdm-qnt

@rbdm-qnt thanks for this. There are many things that could impact this, but to answer your specific question, no, you should not stop and restart influxdb after each query.

If you provide some additional information about the query, the database, the environment, or anything you can think of that might be impacting performance, we can start narrowing it down.

russorat avatar Apr 13 '20 21:04 russorat

Hi @russorat Sorry for torturing with so many questions about Influx's speed recently, appreciate your patience. The application is the same as all my other issues, financial data, have to query entire rows, 7 fields per row. The environment is: Influx 1.7.9, Influx python client 5.2.3, Python 3.7, MacOs 10.12.6, 16GB Ram.

In the last week I've switched to an Apache Parquet database, and Dask. I have to say it's faster, but not by a huge margin, however it does handle Ram way more efficiently. I realized in those last few weeks that no file system is really designed to do what I do (which is query entire rows), they are all optimized to query columns and aggregate data inside the query itself, none of those is optimized for my brute-force style query. Unfortunately, I can't do otherwise. I'm also not on a very performing hardware to begin with. I'd probably need a cluster to do what I'm doing faster.

rbdm-qnt avatar Apr 15 '20 01:04 rbdm-qnt

@rbdm-qnt not a problem! Sounds like processing this in the cloud might be your best option. Pretty easy to horizontally spin up a large number of executors to process your data if you need it in a specific timeframe.

russorat avatar Apr 15 '20 16:04 russorat

How can I do this? I figured I'd need some sort of cluster or at least a dedicated VPS, right? Any link would be really useful

rbdm-qnt avatar Apr 15 '20 17:04 rbdm-qnt