go-cloud icon indicating copy to clipboard operation
go-cloud copied to clipboard

docstore/awsdynamodb panics on iter.Next()

Open thisisaaronland opened this issue 11 months ago • 16 comments

Describe the bug

Calling iter.Next on a docstore/awsdynamodb derived Query instance panics with an "index out of range" error along the lines of:

                panic: runtime error: index out of range [0] with length 0

                goroutine 98 [running]:
                gocloud.dev/docstore/awsdynamodb.(*documentIterator).Next(0x14000056050, {0x101ba36b0?, 0x10258bb40?}, {{0x101b386e0, 0x1400019bea0}, 0x0, {0x101b454e0, 0x1400019bea0, 0x199},\
 {0x140004fc808, ...}})
                        /usr/local/sfomuseum/go-activitypub/vendor/gocloud.dev/docstore/awsdynamodb/query.go:492 +0x260

The Query itself has two "Where" statements bracketing a date range:

https://github.com/sfomuseum/go-activitypub/blob/main/accounts_database_docstore.go#L45-L69

Curiously, this problem only seems to affect one table (accounts). The same code (essentially) run against a different table (deliveries) does not panic:

https://github.com/sfomuseum/go-activitypub/blob/main/deliveries_database_docstore.go#L46-L58

To Reproduce

The code in question is here:

https://github.com/sfomuseum/go-activitypub/blob/main/cmd/counts-for-date/main.go

Specifically:

https://github.com/sfomuseum/go-activitypub/blob/main/stats/counts.go#L44

Which is run like this:

go run cmd/counts-for-date/main.go \
   -accounts-database-uri 'awsdynamodb://accounts?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -blocks-database-uri 'awsdynamodb://blocks?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -boosts-database-uri 'awsdynamodb://boosts?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -deliveries-database-uri 'awsdynamodb://deliveries?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -followers-database-uri 'awsdynamodb://followers?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -following-database-uri 'awsdynamodb://following?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -likes-database-uri 'awsdynamodb://likes?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -messages-database-uri 'awsdynamodb://messages?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -notes-database-uri 'awsdynamodb://notes?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -posts-database-uri 'awsdynamodb://ap_posts?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -date {YYYY-MM-DD} \
   -verbose

If you need/want to set up all those tables, the code to do so is here:

https://github.com/sfomuseum/go-activitypub/blob/main/cmd/create-dynamodb-tables/main.go

And:

https://github.com/sfomuseum/go-activitypub/tree/main/app/dynamodb/tables/create

For example:

go run -mod vendor cmd/create-dynamodb-tables/main.go \
	-dynamodb-client-uri 'awsdynamodb://?region={REGION}&credentials={CREDENTIALS}'

Note that all of these URIs are using a custom awsdynamodb:// scheme. Under the hood all of this code is using the aaronland/go-aws-dynamodb and aaronland/go-aws-session packages in to support a wider range of declarative credentialing options than those offered by the default AWS libraries:

  • https://github.com/aaronland/go-aws-dynamodb
  • https://github.com/aaronland/go-aws-session
  • https://github.com/aaronland/go-aws-session?tab=readme-ov-file#credentials

There are also Makefile targets for running everything locally against a containerizer DynamoDB instance:

  • https://github.com/sfomuseum/go-activitypub/tree/main?tab=readme-ov-file#example

Expected behavior

An iterator whose Next method triggers an io.EOF error.

Version

0.37.0

Additional context

Documentation (albeit incomplete) for the application's database design and specifics about the use of gocloud.dev/docstore are here:

  • https://github.com/sfomuseum/go-activitypub?tab=readme-ov-file#databases
  • https://github.com/sfomuseum/go-activitypub?tab=readme-ov-file#goclouddevdocstore

The really long think-y think-y version is here:

  • https://millsfield.sfomuseum.org/blog/2024/03/12/activitypub/

thisisaaronland avatar Mar 18 '24 18:03 thisisaaronland

This is hard to debug from here, can you use gohack and add some Printfs in the relevant code to see what's happening?

In here: https://github.com/google/go-cloud/blob/master/docstore/awsdynamodb/query.go#L472

Specifically it would be good to know what it.offset, it.limit, it.curr, it.last, it.count, it.limit, and len(it.items) is when this happens.

I think that this line: https://github.com/google/go-cloud/blob/master/docstore/awsdynamodb/query.go#L480 is supposed to handle this, but it's not.

It's possible that this is a regression in https://github.com/google/go-cloud/pull/3385.

@bartventer FYI

vangent avatar Mar 21 '24 00:03 vangent

I've been digging into this issue and although I was unable to reproduce the panic, I added a preventive check for awsdynamodb here. This check returns io.EOF if there are no more items, which should prevent an index out of range panic when Next is called after it has already returned EOF once.

I also added a test case to verify this behavior. The test repeatedly calls Next until it gets io.EOF, then calls Next again to ensure it still returns io.EOF.

I've pushed these changes in a PR. Please have a look when you get a chance and let me know if there's anything that needs adjusting.

bartventer avatar Mar 21 '24 12:03 bartventer

Thanks for the fix. I can confirm that I am no longer seeing panics calling iter.Next – 

https://github.com/sfomuseum/go-activitypub/blob/iter-fix/go.mod#L5

Question: In the process of investigating the panics I often noticed that the docstore.Query iterators would return/report no results. For example if I call a simple shell script wrapper around the "counts" code like this:

> sh ./work/counts.sh 2024-03-18
{"time":"2024-03-21T16:16:56.081668-07:00","level":"DEBUG","msg":"Verbose logging enabled"}
{"time":"2024-03-21T16:16:57.021555-07:00","level":"DEBUG","msg":"Get counts for date","date":"2024-03-18","location":"America/Los_Angeles","start":1710720000,"end":1710806400}
{"date":"2024-03-18","location":"America/Los_Angeles","accounts":0,"blocks":0,"boosts":0,"deliveries":0,"followers":0,"following":0,"likes":0,"messages":0,"notes":0,"posts":0}

All the query iterators return nothing. But then if I call the same code immediate afterwards the queries return results:

> sh ./work/counts.sh 2024-03-18
{"time":"2024-03-21T16:17:03.270628-07:00","level":"DEBUG","msg":"Verbose logging enabled"}
{"time":"2024-03-21T16:17:04.100218-07:00","level":"DEBUG","msg":"Get counts for date","date":"2024-03-18","location":"America/Los_Angeles","start":1710720000,"end":1710806400}
{"date":"2024-03-18","location":"America/Los_Angeles","accounts":0,"blocks":0,"boosts":0,"deliveries":9045,"followers":1,"following":0,"likes":0,"messages":0,"notes":0,"posts":13}

The code in question is:

  • https://github.com/sfomuseum/go-activitypub/blob/main/stats/counts.go
  • https://github.com/sfomuseum/go-activitypub/blob/main/deliveries_database_docstore.go#L44

I am uncertain whether this is related to the iter.Next problem or something entirely different. If it's more helpful to create a separate issue I am happy to do that.

thisisaaronland avatar Mar 21 '24 23:03 thisisaaronland

Are you using Offset at all?

vangent avatar Mar 21 '24 23:03 vangent

and/or Limit?

vangent avatar Mar 21 '24 23:03 vangent

@bartventer I think there's a bug in Offset that may be causing both of these issues.

Example:

  • Let's say Offset is 10.
  • We do the initial it.qr.run in RunGetQuery, let's say it returns 5 items.
  • When Next is called, it will recurse 10 times to move past the offset.
  • On the next Next call, we'll do a new it.qr.run ("Make a new query request at the end of this page"), and then set curr = 0, but we may have just gotten items 6 through 8 (for example).

vangent avatar Mar 21 '24 23:03 vangent

I am not currently using either Offet or Limit.

thisisaaronland avatar Mar 21 '24 23:03 thisisaaronland

Hrm that damages my theory, but I still think there's a bug in those.

@thisisaaronland it would be really helpful if you could provide some debugging info as requested above, as it seems like you can reproduce this easily and we can't. Maybe even comment out the new lines that fixed the panic so we can understand what's going on there too.

vangent avatar Mar 22 '24 00:03 vangent

Sure thing. I may not be able to do that until tomorrow but I will get to it as soon as I can. Thanks for all the help so far.

thisisaaronland avatar Mar 22 '24 00:03 thisisaaronland

Per AWS, "A Query operation can return an empty result set and a LastEvaluatedKey if all the items read for the page of results are filtered out."

I think we're not handling that correctly, which causes the empty results sometimes, as well as the panic.

@bartventer can you take a look at https://github.com/google/go-cloud/pull/3408 ?

@thisisaaronland can you try it out and see if that looks better?

vangent avatar Mar 22 '24 00:03 vangent

I couldn't figure out what the replace directive for the changes in #3408 should be so I just brute-force cloned the query.go file in my vendor folder. With only limited testing it doesn't seem to have helped and may, in fact, have made things worse. Specifically:

  • Running the code (mentioned above) for the first time the Query iterator continues to return no results.
  • Running the code (mentioned above) a second time simply hung and never returned any results.

I will carve out some time in the morning to add the debugging hooks you suggested (above) to try and get more concrete details to work with.

If it's at all helpful what I have observed is that the Query iterator seems to return no results the first time it is invoked but then subsequently it starts to return results, even though the queries themselves are happening in entirely new processes. So maybe there is something happening at the AWS/Dynamo layer where the iterator returns prematurely on the first invocation and that, by extension, Dynamo queries (in AWS) exist in some kind of "window of time" between invocations ?

thisisaaronland avatar Mar 22 '24 01:03 thisisaaronland

Can you try again, I made some changes. If it doesn't work let me know, printfs...

On Thu, Mar 21, 2024, 6:16 PM Aaron Straup Cope @.***> wrote:

I couldn't figure out what the replace directive for the changes in #3408 https://github.com/google/go-cloud/pull/3408 should be so I just brute-force cloned the query.go file in my vendor folder. With only limited testing it doesn't seem to have helped and may, in fact, have made things worse. Specifically:

  • Running the code (mentioned above) for the first time the Query iterator continues to return no results.
  • Running the code (mentioned above) a second time simply hung and never returned any results.

I will carve out some time in the morning to add the debugging hooks you suggested (above) to try and get more concrete details to work with.

If it's at all helpful what I have observed is that the Query iterator seems to return no results the first time it is invoked but then subsequently it starts to return results, even though the queries themselves are happening in entirely new processes. So maybe there is something happening at the AWS/Dynamo layer where the iterator returns prematurely on the first invocation and that, by extension, Dynamo queries exist in some kind of "window of time" between invocations ?

— Reply to this email directly, view it on GitHub https://github.com/google/go-cloud/issues/3405#issuecomment-2014139958, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMCKTSPUUQCJOERCSE74TLYZOA55AVCNFSM6AAAAABE4FMXTGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJUGEZTSOJVHA . You are receiving this because you modified the open/close state.Message ID: @.***>

vangent avatar Mar 22 '24 01:03 vangent

Per AWS, "A Query operation can return an empty result set and a LastEvaluatedKey if all the items read for the page of results are filtered out."

I think we're not handling that correctly, which causes the empty results sometimes, as well as the panic.

@bartventer can you take a look at #3408 ?

@thisisaaronland can you try it out and see if that looks better?

I've had a look at PR #3408 and I agree with the logic there. I've created a new PR (#3409) that builds on that work. @vangent, could you take a look when you get a chance?

bartventer avatar Mar 22 '24 09:03 bartventer

Just a quick note that I applied github.com/bartventer/go-cloud v0.0.0-20240322082227-9d139c92796d and I am not seeing time-outs but I am still seeing inconsistent behaviour:

https://github.com/sfomuseum/go-activitypub/blob/iter-fix/go.mod#L7

Specifically: The first time the query is run it will sometimes return 0 results but then if the query is run again inside a short window of time (but in an entirely new process) the query will return results. That's all I've been able to do so far but will try to carve out time to get more concrete debugging information soon.

thisisaaronland avatar Mar 22 '24 18:03 thisisaaronland

Ping @thisisaaronland

vangent avatar Apr 01 '24 17:04 vangent

Any updates here @thisisaaronland ?

vangent avatar Sep 09 '24 16:09 vangent