dynamodb-parallel-scan icon indicating copy to clipboard operation
dynamodb-parallel-scan copied to clipboard

This library vs paginateScan

Open paul-uz opened this issue 2 years ago • 16 comments
trafficstars

I seem to be having an issue with paginateScan in the v3 SDK being slow to retrieve 5000 items from a 16mb table. It takes about 7 seconds on cold boot, with total segments equalling being set to how many mb the table is and a pageSize of 200 (I'm still not understanding the role of pageSize properly)

Do you think this library will perform better?

paul-uz avatar Feb 22 '23 23:02 paul-uz

Yes, paginateScan is sequential. This library works by making requests in parallel.

You can check the source code of aws-sdk to see that paginateScan is making 1 request at a time 2023-02-23T13 50 31-PJpYZBRE

vladholubiev avatar Feb 23 '23 12:02 vladholubiev

I did implement a parallel version of it, by running multiple in a promise all, but I suspect my code isn't the most efficient.

Will definitely try this library out soon

paul-uz avatar Feb 23 '23 12:02 paul-uz

So I have replaced my scan with this library, and i'm still getting seemingly slow speeds.

At ~9500 records now, with a table size of 39mb, it takes 10s to retrieve on cold boot. Does that seem right? Looks awfully slow to me :/

I'm trying the stream version, but I'm not fully understanding how concurrency and chunksize work. I'm struggling to figure out what the optimal values should be. After a few tries of different values, the quickest i'm seeing is 7s on cold boot. Which still seems slow.

paul-uz avatar Feb 23 '23 14:02 paul-uz

https://medium.com/shelf-io-engineering/how-to-scan-a-23-gb-dynamodb-table-in-1-minute-110730879e2b 2023-02-23T18 28 57-acpNtLWr

Is this might be a problem? Check our hash key uniqueness

vladholubiev avatar Feb 23 '23 17:02 vladholubiev

Our hash keys (partition keys) are all UUIDs, so each one is unique.

paul-uz avatar Feb 23 '23 17:02 paul-uz

@vladgolubev I notcied in your article, you show this code

for await (const items of stream) {
    console.log(items); // 10k items here
    // do some async processing to offload data from memory
    // and move on to cosuming the next chunk
    // scanning will be paused for that time
  }

Could you explain more about the comments here about offloading the data? I'm wondering if this could help me.

paul-uz avatar Feb 24 '23 14:02 paul-uz

Scanning might be slow if you take 1 page of data which takes ~1s, and then spend 10s to process scanned data. It's faster if you scan all data at once, to collect into 1 array, and then spend time on more costly data processing.

vladholubiev avatar Feb 27 '23 11:02 vladholubiev

Hmm, I'm not doing much in terms of processing the data, other than sorting and splicing the array of all items.

paul-uz avatar Feb 27 '23 13:02 paul-uz

Do you run code in lambda or locally? If lambda, how much RAM does it have?

vladholubiev avatar Feb 27 '23 13:02 vladholubiev

I run it in Lambda, with 1024mb of memory, but have been experimenting with 2048mb as well

paul-uz avatar Feb 27 '23 13:02 paul-uz

Is your table using ON_DEMAND or PROVISIONED capacity mode?

vladholubiev avatar Feb 28 '23 11:02 vladholubiev

On Demand

paul-uz avatar Feb 28 '23 12:02 paul-uz

At this point, I think AWS Support will be better to assist. I ran out of ideas ¯_(ツ)_/¯

vladholubiev avatar Feb 28 '23 12:02 vladholubiev

No worries, thanks for your help.

I think ultimately, DynamoDB clearly isn't cut out for this kind of operation.

It's just a shame DAX isn't compatible with the v3 JS SDK, as I reckon that may have helped?

paul-uz avatar Feb 28 '23 12:02 paul-uz

Hey 👋 , not related to this topic but how could we pass credentials to the parallelScanAsStream function? Any examples? Thanks :)

@bilegjargal-jargalsaikhan try new version new v3.5.3 Screenshot 2023-07-12 at 22 35 08

harazdovskiy avatar Jul 12 '23 19:07 harazdovskiy