influxdb-python
influxdb-python copied to clipboard
Only 1k rows/second query speed
I'm running an InfluxDB instance to save financial tick data. Every row has 7 fields. I then query this data and aggregate it in Python. Those were my query times (number of rows on the left, query time in hours, minutes and seconds on the right):
9480222, 2:07:46
12839124, 3:06:02
17256737, 4:19:54
13716707, 3:28:37
12671435, 2:35:27
11112483, 2:15:53
17055181, 3:34:21
21232810, 6:29:42
16935780, 4:47:56
Those numbers seem a bit off. The average is around 60-70k rows per minute, 1k rows per second. Since this is my first experience with TS Databases and with Influx, would you consider this performance normal? I'm running InfluxDB 1.7.9, Influx Python client 5.2.3, Python 3.7, running from Pycharm, on a MacBook Pro with 16GB Ram.
Any tip is welcomed!
@rbdm-qnt thanks! am I understanding this correctly that you are returning between 10 and 20 million rows of data with each query? is the performance from querying influx or the python code performing the processing of the rows?
Yes that is correct. The performance is from querying influx! Processing time is around 30 to 60 minutes per query. The total query time for about 1 billion lines, 28GB of database, was around 8 days, should finish tomorrow.
@rbdm-qnt thanks! if it's related to influxdb, probably makes sense to open it in the https://github.com/influxdata/influxdb repo. But there are tons of things that impact db performance, so it's tough to know where to start. Make sure you have followed all the guidance in the context section of the docs pages: https://docs.influxdata.com/influxdb/v1.7/concepts/
Upon closer inspection in the activity monitor, looks like influx is only using 500mb of my ram, while the Python program running the python client for influx is consuming 13GB, of which up to 11GB get allocated automatically to Swap memory. I think this could be part of the issue, not sure if it's the only reason why the query is so slow. Also I'm subdividing my 28GB dataset in 51 chunks, so should be around 600mb per query; i make sure to empty the query result list after every query chunk. Why would it use 13gb?
I've already posted this issue in a number of githubs and forums, I posted it in the official influx git and on their website too, but received no answers. I'll also read that document, thank you for the link.
are you just dumping the raw data landed in Influx back out to Python for aggregation? if so, why don't you push the aggregation into the initial query itself...that's what the query engine is for.
maybe asked another way, what is the initial query you are running to extract the data?
if it's something like select * from and a time range...this is not going to provide the fast results you are looking for.
there are a ton of mathematical functions for working with tick data in the query language itself and allowing the query engine to operate on the compressed data to return a result is going to provide the best performance.
https://docs.influxdata.com/influxdb/v1.7/query_language/functions/#technical-analysis
If you can make the query engine do the bulk of the work...you can still do additional post processing in Python....but the best practice is to push as much of the heavy lifting into the DB.
Thank you for the reply! So, I'm storing financial tick data, which is 7 columns per row, so yes, the only way I can query it is select * from a time range, since I need all the columns. I've read the influx docs, but unfortunately no function suits my needs. I need to process this data in different and more complex ways, so I need it raw, I can't work with averaged data, or already aggregated data. I'm starting to think that a simple parquet file folder could suite my needs better than any database, since I don't need a lot of those functionalities, I just need speed for big queries of raw data.
ok. I think you are right, this isn't a good use for the db.
But, if you want to share a handful of the different and more complex ways, we would be interested in looking at whether these functions could exist (like the technical analysis functions) natively. Our new query language, Flux, allows for a much wider variety of mathematical processing and handling and additional functions -- allowing you to operate on the data without "moving it" around. The idea is to provide as many of the primitive functions as possible and then continue to build up more sophisticated ones (again, like the technical analysis functions). You can see some newer functions that we added to the 2.0 Beta and get a feel for the language here: https://v2.docs.influxdata.com/v2.0/reference/flux/stdlib/built-in/transformations/aggregates/kaufmansama/
let us know...
Thank you for the doc, I'm reading it! You have great variety of functions. My main problem is that I need to apply many different functions to the same raw data, in the same processing step, and then save the processed data all together. So it's not that I can't do 1 type of processing with the influx language, is that I need to do many different ones, and then aggregate processed values in certain ways. And since I already wrote all of this code previously and I use it with live data, I don't want to rewrite it (even if it was possible) in the influx query language, I want the database to return data in exactly, precisely the same format as the live data.
Thank you for the doc, I'm reading it! You have great variety of functions. My main problem is that I need to apply many different functions to the same raw data, in the same processing step, and then save the processed data all together. So it's not that I can't do 1 type of processing with the influx language, is that I need to do many different ones, and then aggregate processed values in certain ways. And since I already wrote all of this code previously and I use it with live data, I don't want to rewrite it (even if it was possible) in the influx query language, I want the database to return data in exactly, precisely the same format as the live data.
would you give your email to me? [email protected] maybe I had some solution for you.