google-cloud-go icon indicating copy to clipboard operation
google-cloud-go copied to clipboard

firestore: Transaction "context canceled" error leads to write in firestore

Open compilersh opened this issue 2 years ago • 4 comments

Live firestore client version: v1.6.1 Tested locally on: v1.6.1 > v1.13.0

We noticed weird behavior in our live environment. Writes and updates that were in firestore that wasn't supposed to be there since they were handled in a transaction which errored out. I threw together a small test locally using the firestore emulator and can reproduce the error, might need some tinkering with timings to get it to work on other machines however.

package main

func main() {
        // ... client setup

	wg := sync.WaitGroup{}
	nCalls := 10
	startChan := make(chan struct{})
	now := time.Now()

	docID := fmt.Sprintf("testdoc_%d", time.Now().Unix())
	for i := 0; i < nCalls; i++ {
		wg.Add(1)
		go func(i int) {
			<-startChan
			err := client.RunTransaction(ctx, func(ctx context.Context, tx *firestore.Transaction) error {

				docRef := client.Collection("testbubbel").Doc(docID)

				_, err := tx.Get(docRef)
				if err != nil {
					if status.Code(err) != codes.NotFound {
						return err
					}
				}
				// fmt.Printf("worker: %d: data: %s\n", i, doc.Data())
				if err := tx.Set(docRef, map[string]string{"foo": fmt.Sprintf("bar_%d", i)}); err != nil {
					return err
				}

				// fmt.Printf("returning %d\n", i)
				return nil
			})
			fmt.Printf("err from worker %d: %+v\n", i, err)
			wg.Done()
		}(i)
	}
	close(startChan)
	time.Sleep(2300 * time.Millisecond)
	cancel()
	wg.Wait()

	fmt.Printf("time: %s\n", time.Since(now))
}

Output:

err from worker 1: rpc error: code = Canceled desc = context canceled
err from worker 2: rpc error: code = Canceled desc = context canceled
err from worker 6: rpc error: code = Canceled desc = context canceled
err from worker 8: rpc error: code = Canceled desc = context canceled
err from worker 0: rpc error: code = Canceled desc = context canceled
err from worker 7: rpc error: code = Canceled desc = context canceled
err from worker 9: rpc error: code = Canceled desc = context canceled
err from worker 5: rpc error: code = Canceled desc = context canceled
err from worker 3: rpc error: code = Canceled desc = context canceled
err from worker 4: rpc error: code = Canceled desc = context canceled
time: 2.301270541s

Expected behavior

  1. Since all transactions resulted in an error, I expect there to be no update
  2. The call that leads to context canceled error should return no error if it has reached a point of no return

Actual behavior

Some transactions are pushed through even if there is an error (intermittently).

Additional context

I am aware that the emulator doesn't behave exactly as the live environment but since we saw the issue first in our live env and I could get the behavior to reproduce in the emulator I'm quite convinced this issue lies in the client.

Note that this is an intermittent error, not all calls that return a context canceled error leads to a write so we can't trust that a write has happened if that is the error

Update

I tried setting the nCount to 1 and could reproduce it by setting the cancel to 50 milliseconds, so this is not something that needs parallel calls to happen

compilersh avatar Oct 18 '23 06:10 compilersh

Triaged & tagged.

noahdietz avatar Nov 07 '23 00:11 noahdietz

This is by design. Transactions are not rollbacked on context cancelled error.

bhshkh avatar Jan 27 '24 01:01 bhshkh

@bhshkh How can intermittent behavior be expected? how am I supposed to know if a dataset was updated or not when the error provides no additional context other than a “context cancelled” error?

the workaround is to create a new context for all transactions since we can’t rely on your library’s handling of cancellations

compilersh avatar Jan 27 '24 08:01 compilersh

Reducing the priority s there is a workaround available. Will look into the issue soon

bhshkh avatar Apr 15 '24 18:04 bhshkh