dynamodb-parallel-scan
dynamodb-parallel-scan copied to clipboard
This library vs paginateScan
I seem to be having an issue with paginateScan in the v3 SDK being slow to retrieve 5000 items from a 16mb table. It takes about 7 seconds on cold boot, with total segments equalling being set to how many mb the table is and a pageSize of 200 (I'm still not understanding the role of pageSize properly)
Do you think this library will perform better?
Yes, paginateScan is sequential. This library works by making requests in parallel.
You can check the source code of aws-sdk to see that paginateScan is making 1 request at a time

I did implement a parallel version of it, by running multiple in a promise all, but I suspect my code isn't the most efficient.
Will definitely try this library out soon
So I have replaced my scan with this library, and i'm still getting seemingly slow speeds.
At ~9500 records now, with a table size of 39mb, it takes 10s to retrieve on cold boot. Does that seem right? Looks awfully slow to me :/
I'm trying the stream version, but I'm not fully understanding how concurrency and chunksize work. I'm struggling to figure out what the optimal values should be. After a few tries of different values, the quickest i'm seeing is 7s on cold boot. Which still seems slow.
https://medium.com/shelf-io-engineering/how-to-scan-a-23-gb-dynamodb-table-in-1-minute-110730879e2b

Is this might be a problem? Check our hash key uniqueness
Our hash keys (partition keys) are all UUIDs, so each one is unique.
@vladgolubev I notcied in your article, you show this code
for await (const items of stream) {
console.log(items); // 10k items here
// do some async processing to offload data from memory
// and move on to cosuming the next chunk
// scanning will be paused for that time
}
Could you explain more about the comments here about offloading the data? I'm wondering if this could help me.
Scanning might be slow if you take 1 page of data which takes ~1s, and then spend 10s to process scanned data. It's faster if you scan all data at once, to collect into 1 array, and then spend time on more costly data processing.
Hmm, I'm not doing much in terms of processing the data, other than sorting and splicing the array of all items.
Do you run code in lambda or locally? If lambda, how much RAM does it have?
I run it in Lambda, with 1024mb of memory, but have been experimenting with 2048mb as well
Is your table using ON_DEMAND or PROVISIONED capacity mode?
On Demand
At this point, I think AWS Support will be better to assist. I ran out of ideas ¯_(ツ)_/¯
No worries, thanks for your help.
I think ultimately, DynamoDB clearly isn't cut out for this kind of operation.
It's just a shame DAX isn't compatible with the v3 JS SDK, as I reckon that may have helped?
Hey 👋 , not related to this topic but how could we pass credentials to the parallelScanAsStream function? Any examples?
Thanks :)
@bilegjargal-jargalsaikhan try new version new v3.5.3