dynamodb-geo.js icon indicating copy to clipboard operation
dynamodb-geo.js copied to clipboard

Q: performance issue

Open mattmeye opened this issue 8 years ago • 17 comments

hi rh389, thx for your work!

I tried your example and increase search radius to 500 km and have to wait 16s. Read Capacity was 500. I try to figure out how fast dynamodb will be. I think that im doing something wrong.

Is the geo search faster in smaller areas (with more items)?

my key scenario is a 50km radius search in a 350.000km² area with 100k points of interests. Do you have any recommendations?

mattmeye avatar May 31 '17 07:05 mattmeye

I tried this out with queries greater than 100km and the number of queries increases exponentially. I think there is a bug with the way the query works. If you log out the number of promises generated you will see that you have hundreds of queries for that request, which is impossibly inefficient.

Also, this library doesn't appear to use BatchGetItem (unless I am mistaken) for all the queries, which causes more network overhead.

Great library, but I hope this issues can be fixed! :)

bnolan avatar Jun 01 '17 00:06 bnolan

Thank for the reports - that's definitely way more queries than it should be generating and very likely to be the cause of poor performance. We rely on a combination of nodes2ts to generate an s2 covering and then some internal logic to split queries on partition keys, the issue may be there. Will have a look into it today.

robhogan avatar Jun 01 '17 08:06 robhogan

Having looked into this I think the issue is just the length of the partition key.

In DynamoDB you can only query within a single partition key (of a table, LSI or GSI), so this lib takes the ~eight S2 cells which (more than) cover the region you're interested in, then converts each one into a set of queries according to how many different partition keys the cell spans.

It seems to me like the default of using the first 6 digits of a cell ID as the partition key is unnecessarily granular unless your dataset spans a very small area. For most cases I think you'd want 2-4 digits.

I'd suggest trying a hashKeyLength of 3, which works well for a 200km radius query. Ideally you want the hash to be as long as possible so that AWS can partition your data efficiently, but short enough that you're rarely having to query multiple partitions for a single cell.

const config = new GeoDataManagerConfiguration(ddb, 'myTable');
config.hashKeyLength = 3;
const manager = new GeoDataManager(config);

Unfortunately, that means recreating your data to truncate your partition key. Obviously this needs documenting better - apologies.

robhogan avatar Jun 01 '17 11:06 robhogan

Also, this library doesn't appear to use BatchGetItem (unless I am mistaken) for all the queries, which causes more network overhead.

This lib uses Query for radius/box searches and there's no way to batch-execute queries in a single request, AFAIK. BatchGetItem doesn't support filter expressions so it doesn't help us. We do execute queries in parallel with Promise.all, I think that's the best we can do.

Another performance/resource tip - make sure you're querying with consistent read disabled if you can.

robhogan avatar Jun 02 '17 16:06 robhogan

Version 0.3.0 introduces a new default hashKeyLength of 2. By my reckoning that's large enough unless large proportion of a >10gb data set is in one small area (like a city), or you need >3000 RCUs across a small area.

Some recommended maximum hashKeyLengths can be seen here. As you can see, the old default of 6 starts to fall apart for any radius query larger than about 10km (the equivalent bounding rectangle would have exactly the same problem).

robhogan avatar Jun 03 '17 18:06 robhogan

Thanks for these updates recently! I just updated to the latest code and am going to test it. Do I need to rebuild/update my db to be able to use the default hashKeyLength of 2 or is that only used during query time? I'm using AWS Lambda and I'm regularly running queries of 150km radius with about 800 records in the dataset with mixed results on time to produce query results.

BenHavilland avatar Jun 04 '17 03:06 BenHavilland

Large radius queries seem to be returning much faster with this update. 👍

BenHavilland avatar Jun 04 '17 04:06 BenHavilland

You will need to rebuild your data, it's used to generate the hash key on insertion too.

I've also changed the longitudeFirst default to true for geoJson compatibility, in a break from the java version of this lib, so that's another reason to rebuild unless you're setting it explicitly.

robhogan avatar Jun 04 '17 08:06 robhogan

I've just learnt that a small hashKeyLength may have RCU implications - see the new docs.

robhogan avatar Jun 04 '17 17:06 robhogan

I've done some testing with a larger dataset in a similar sort of scenario (hundreds to thousands of objects over a range from 0-10km, with 5km queries taking approx. 52 sec) and depending on the number of objects returned, the filterByRadius method actually takes longer to compute than the Dynamo query. I'm working with custom objects so I don't have point data readily available so I convert from S2CellId's to LatLng points and then do the same distance query. I'm not sure about the overhead on this conversion but could the distance query have significant overhead?

malz avatar Jun 05 '17 06:06 malz

That's surprising - calculating distances between lat/lon pairs is computationally pretty easy (with Haversine). It may be that we're going back and forth between coordinate systems too much, there's definitely scope for optimising that if it's a drain. I won't have chance this week though unfortunately.

robhogan avatar Jun 05 '17 18:06 robhogan

When I re-created my dataset after pulling in the most recent changes queries were much faster *although queries started breaking (not returning results for some items) as I approached a larger dataset.

I ended up going back to hashKeyLength 6.

I also changed from Node 4 to Node 6 for my radius query call and that had noticeable performance increase and reduced memory consumption.

BenHavilland avatar Jun 05 '17 20:06 BenHavilland

Just something to enrich this discussion and help you to save some money and also boost performance. Remember 2 things:

  1. Disable consistent reads if you can. This can decrease A LOT your costs on this scenarios that we are talking about.

  2. Don't forget how DynamoDB paginates your data. Optimising your indexes by adding only the NECESSARY fields can decrease a lot the number of paginated requests your queries are going to generate because you will have a lot more data on each request.

I did just a small example so you can see the impact: screen shot 2018-09-04 at 12 39 14

1st spike: Original index with consistent read (~200 read units) 2nd spike: Optimised index with NO consistent read (~3.5 read units) 3rd spike: Unoptimised index, NO consistent read. (~50 read units)

I hope it helps.

igorescobar avatar Sep 04 '18 17:09 igorescobar

In my case, I was using the library from a lambda function with the default memory allocated to 128MB, increasing the memory allocated to the lambda improved drastically the performances

robertolosanno-e2x avatar Mar 20 '20 13:03 robertolosanno-e2x

In my case, I was using the library from a lambda function with the default memory allocated to 128MB, increasing the memory allocated to the lambda improved drastically the performances

@robertolosanno-e2x what allocated memory size did you settle on?

S1rFrancis avatar Sep 18 '20 09:09 S1rFrancis

I'm sorry I can't remember and I don't have the project at the moment

robertolosanno-e2x avatar Sep 18 '20 14:09 robertolosanno-e2x

No problem, for now 512MB is giving me decent performance

S1rFrancis avatar Sep 22 '20 11:09 S1rFrancis