mongo-arrow icon indicating copy to clipboard operation
mongo-arrow copied to clipboard

INTPYTHON-807: reading large amounts of data is rather slow (due to single threaded decoding BSON)

Open sibbiii opened this issue 1 month ago • 1 comments

Hi,

Over the years, we became quite happy with mongo-arrow and also contributed some bug fixes. However, there is still one topic left that we would be happy to get solved, and this is speed as mongo arrow claims to be a FAST:

We use MongoDB to store large amounts of data and as a consequence, also often query large amounts of data. The MongoDB server handles the load pretty well, but when it comes to fetching the result with mongo-arrow in Python there is one big bottleneck, and this seems to be BSON decoding, so this line here.

Image

When i comment it out and just print the len() of the batch speed is as I expect it. Obviously, the BSON has to be decoded, and this takes CPU. Unfortunately, it seems that only one CPU core is used to decode BSON which is quite frustrating as modern libraries such as polars are running their calculations multi-core. Sure we made your way around this limitation, but the implementations are not clean.

Is there any chance to make the BSON decoding itself multi-core or at least the batch processing which looks like the perfect candidate for multicore decoding?

Ps.: I've also spend quite some time googling for a solution, but other then frustrated users that say MongoDB is slow (which is not true in my opinion) i have not found any solution. To give you some numbers. I can easily fetch a large dataset in a few seconds on a super fast connection to the server while it takes a minute to decode. I fully understand the technology difference between an object store and a column based relational database, and i accept some lower performance on bulk reading to gain somewhere else, but it really does not need to be this slow for any sound reason.

Thanks, Sebastian

sibbiii avatar Oct 24 '25 07:10 sibbiii

Great find @sibbiii

On our end, we've spawned a ticket that you can track INTPYTHON-807. Check back on this ticket to get an understanding of progression on this work.

Jibola avatar Oct 24 '25 19:10 Jibola