aws-sdk-rust icon indicating copy to clipboard operation
aws-sdk-rust copied to clipboard

Unexplained blocking behavior with DynamoDB client

Open QnJ1c2kNCg opened this issue 2 years ago • 14 comments

Describe the bug

I've encountered a problem that I cannot explain, where a deadlock happens. See reproduction steps for more details.

An interesting note is that if I create a new DynamoDB client, instead of cloneing, I do not see the issue.

Expected Behavior

I would have expected to see something like:

Hello, world!
Sync background scan size: 7528
Async background scan size: 7528
Sync background scan size: 7528
Sync background scan size: 7528
Sync background scan size: 7528
Sync background scan size: 7528
Sync background scan size: 7528
Sync background scan size: 7528
...

Current Behavior

I'm seeing a deadlock:

Hello, world!
Sync background scan size: 7528
Async background scan size: 7528

^C

Reproduction Steps

async fn async_background_scan(client: aws_sdk_dynamodb::Client) {
    let scan = client.scan().table_name("Buckets").send().await.unwrap();
    println!("Async background scan size: {}", scan.count());
    loop {}
}

fn blocking_background_scan(client: aws_sdk_dynamodb::Client) {
    loop {
        let scan = futures::executor::block_on(client.scan().table_name("Buckets").send()).unwrap();
        println!("Sync background scan size: {}", scan.count());
    }
}

#[tokio::main]
async fn main() {
    let config = aws_config::load_from_env().await;
    let ddb_client = aws_sdk_dynamodb::Client::new(&config);

    tokio::spawn(async_background_scan(ddb_client.clone()));
    tokio::task::spawn_blocking(move || blocking_background_scan(ddb_client));

    println!("Hello, world!");
    loop {}
}

Possible Solution

No response

Additional Information/Context

No response

Version

├── aws-config v0.47.0
│   ├── aws-http v0.47.0
│   │   ├── aws-smithy-http v0.47.0
│   │   │   ├── aws-smithy-types v0.47.0
│   │   ├── aws-smithy-types v0.47.0 (*)
│   │   ├── aws-types v0.47.0
│   │   │   ├── aws-smithy-async v0.47.0
│   │   │   ├── aws-smithy-client v0.47.0
│   │   │   │   ├── aws-smithy-async v0.47.0 (*)
│   │   │   │   ├── aws-smithy-http v0.47.0 (*)
│   │   │   │   ├── aws-smithy-http-tower v0.47.0
│   │   │   │   │   ├── aws-smithy-http v0.47.0 (*)
│   │   │   │   ├── aws-smithy-types v0.47.0 (*)
│   │   │   ├── aws-smithy-http v0.47.0 (*)
│   │   │   ├── aws-smithy-types v0.47.0 (*)
│   ├── aws-sdk-sso v0.17.0
│   │   ├── aws-endpoint v0.47.0
│   │   │   ├── aws-smithy-http v0.47.0 (*)
│   │   │   ├── aws-types v0.47.0 (*)
│   │   ├── aws-http v0.47.0 (*)
│   │   ├── aws-sig-auth v0.47.0
│   │   │   ├── aws-sigv4 v0.47.0
│   │   │   │   ├── aws-smithy-http v0.47.0 (*)
│   │   │   ├── aws-smithy-http v0.47.0 (*)
│   │   │   ├── aws-types v0.47.0 (*)
│   │   ├── aws-smithy-async v0.47.0 (*)
│   │   ├── aws-smithy-client v0.47.0 (*)
│   │   ├── aws-smithy-http v0.47.0 (*)
│   │   ├── aws-smithy-http-tower v0.47.0 (*)
│   │   ├── aws-smithy-json v0.47.0
│   │   │   └── aws-smithy-types v0.47.0 (*)
│   │   ├── aws-smithy-types v0.47.0 (*)
│   │   ├── aws-types v0.47.0 (*)
│   ├── aws-sdk-sts v0.17.0
│   │   ├── aws-endpoint v0.47.0 (*)
│   │   ├── aws-http v0.47.0 (*)
│   │   ├── aws-sig-auth v0.47.0 (*)
│   │   ├── aws-smithy-async v0.47.0 (*)
│   │   ├── aws-smithy-client v0.47.0 (*)
│   │   ├── aws-smithy-http v0.47.0 (*)
│   │   ├── aws-smithy-http-tower v0.47.0 (*)
│   │   ├── aws-smithy-query v0.47.0
│   │   │   ├── aws-smithy-types v0.47.0 (*)
│   │   ├── aws-smithy-types v0.47.0 (*)
│   │   ├── aws-smithy-xml v0.47.0
│   │   ├── aws-types v0.47.0 (*)
│   ├── aws-smithy-async v0.47.0 (*)
│   ├── aws-smithy-client v0.47.0 (*)
│   ├── aws-smithy-http v0.47.0 (*)
│   ├── aws-smithy-http-tower v0.47.0 (*)
│   ├── aws-smithy-json v0.47.0 (*)
│   ├── aws-smithy-types v0.47.0 (*)
│   ├── aws-types v0.47.0 (*)
├── aws-sdk-dynamodb v0.17.0
│   ├── aws-endpoint v0.47.0 (*)
│   ├── aws-http v0.47.0 (*)
│   ├── aws-sig-auth v0.47.0 (*)
│   ├── aws-smithy-async v0.47.0 (*)
│   ├── aws-smithy-client v0.47.0 (*)
│   ├── aws-smithy-http v0.47.0 (*)
│   ├── aws-smithy-http-tower v0.47.0 (*)
│   ├── aws-smithy-json v0.47.0 (*)
│   ├── aws-smithy-types v0.47.0 (*)
│   ├── aws-types v0.47.0 (*)


### Environment details (OS name and version, etc.)

x86_64 GNU/Linux

### Logs

_No response_

QnJ1c2kNCg avatar Aug 24 '22 18:08 QnJ1c2kNCg

Related: https://github.com/tokio-rs/tokio/issues/4730

I'll take a look at this when I'm able (maybe next week, no promises 😄 ) In the meantime, you may have some success if you use tokio-console to take a look and see what task is blocking everything.

(I'm leaving the needs-triage label on this so it doesn't get swept under the rug.)

Velfi avatar Aug 25 '22 16:08 Velfi

I've been rather busy but I'm going to take a look at this issue this week.

Velfi avatar Sep 06 '22 19:09 Velfi

I've had a bit of trouble reliably reproducing this from your example code. Sometimes it runs fine and other times it gets stopped up after the first iteration of the sync loop just as you're experiencing.

Screen Shot 2022-09-07 at 12 02 22 PM

Based on tokio-console, it really does look to be that "one bad task can hold up the executor" issue that I posted in my earlier comment.

I'm not sure what we can do about this other than wait for the tokio team to address the issue.

For anyone else that runs into an issue like this, try adding console_subscriber::init(); and running tokio-console to see if you're getting similar results.

I think it's important to track this issue but I've removed the bug tag since there's really nothing that we can do about this short of contributing a fix to tokio or changing the default executor in the SDK.

Velfi avatar Sep 07 '22 17:09 Velfi

Firstly, thanks a ton for looking into this!

Sometimes it runs fine and other times it gets stopped up after the first iteration of the sync loop

Same here, the repro isn't 100% consistent.

This is very interesting, thanks for the tokio issue link.

Any intuition as to why creating a new client instead of cloneing seems to help? I was unable to reproduce while using two distinct clients. Using two clients is a decent enough mitigation for now.

I'll keep an eye on both these issues, thanks again.

QnJ1c2kNCg avatar Sep 07 '22 18:09 QnJ1c2kNCg

Any intuition as to why creating a new client instead of cloneing seems to help? I was unable to reproduce while using two distinct clients. Using two clients is a decent enough mitigation for now.

The client is just using an Arc under the hood, so the clone ends up referencing the same hyper client and connection pool. Creating a new client will result in that client having its own hyper client. How that plays into this particular scenario, I don't know enough to answer.

jdisanti avatar Sep 07 '22 18:09 jdisanti

Any chance someone can run this on valgrind with --tool=helgrind? It should print out the deadlock if it's a traditional "incorrect mutex usage" deadlock

JonathanHenson avatar Feb 07 '23 19:02 JonathanHenson

I've been seeing this in ec2::DescribeVolumes calls...

gyuho avatar Mar 03 '23 03:03 gyuho

Also seeing this in s3::PutObject

gyuho avatar Mar 13 '23 05:03 gyuho

@jdisanti I can sort of reliably reproduce this (happens 1 out of 10 runs). Any pointer to debug this further?

(also hangs on ec2::DescribeInstances)

gyuho avatar Mar 15 '23 02:03 gyuho

Setting operational timeouts (https://github.com/gyuho/aws-manager/commit/e4672f0f246b49569547793989bf9730460cc2f0) seem to mask this issue... haven't encountered this issue for two days so far 👍🏼

gyuho avatar Mar 16 '23 14:03 gyuho

I didn't see this above in the thread—does this occur on both multi threaded and single threaded runtimes? @Velfi do you have a full workspace I can go investigate?

rcoh avatar Mar 21 '23 13:03 rcoh

Given it seems to be one in 10 and setting a low timeout makes it go away - it suggests that a suspicious round number like that and overriding the timeout means its likely to be a broken connection in the pool that is hit on a round robin/random causing the problem.

stevedw avatar Apr 12 '23 13:04 stevedw

I didn't see this above in the thread—does this occur on both multi threaded and single threaded runtimes? @Velfi do you have a full workspace I can go investigate?

@rcoh I do not. I just took some test data and was working with that. I can't recall how single vs. multi-threading affected this.

Velfi avatar Apr 12 '23 14:04 Velfi