sagemaker-run-notebook icon indicating copy to clipboard operation
sagemaker-run-notebook copied to clipboard

Getting too many failures in processingJob

Open grossamit opened this issue 3 years ago • 18 comments

ClientError: Failed to download data. ListObjectsV2 failed for s3://.... nextToken:[null]: Unable to execute request to S3

The thing is that sometimes it succeed and sometimes not. I've also added a code to wait 10sec after the notebook upload and verify that the file exists after the upload with ListObjectsV2.

grossamit avatar Oct 18 '21 08:10 grossamit

@grossamit What are you running when you see this error? Where is the error coming from? Do you see this error all the time or just intermittently?

tomfaulhaber avatar Dec 07 '21 18:12 tomfaulhaber

@tomfaulhaber it happens intermittently . please note that I'm specifying VPCs and Subnet lists during my run using the VpcConfig and NetworkConfig. I get these errors a lot. The weird issue is that if you wait for approx. 40min it recovers until happens again. Full error: Failed (ClientError: Failed to download data. ListObjectsV2 failed for s3://aws-emr-resources-406095609952-us-east-1/dataAccess/[email protected]/papermill_input/searchVariables_MP_extractMaping_prod.ipynb-2021-12-31-09-26-33.ipynb, nextToken:[null]: Unable to execute request to S3)

I'm running the notebook also with parameters and instance type.

grossamit avatar Dec 31 '21 09:12 grossamit

@grossamit My guess is that this is an issue with way you're routing connections from your SageMaker Processing node to your VPC. One thing would be to check that your subnet definitions are right, your security groups don't have fixed IPs, or whether there's anything else that could mess things up based on what IP address that SageMaker Processing instance is given.

tomfaulhaber avatar Jan 05 '22 01:01 tomfaulhaber

@tomfaulhaber thanks for your reply ! I believe that if it would be the case ,than it would not work constantly. Currently I have ~30% success :-( I do not have fixed IPs . I'll play a little with the subnets.

grossamit avatar Jan 05 '22 07:01 grossamit

@grossamit I would expect exactly this behavior if, for example, you had a VPC with multiple subnets but only enabled the S3 endpoint for a single subnet.

tomfaulhaber avatar Jan 05 '22 17:01 tomfaulhaber

Playing with this today, we realized there's an interaction between processing jobs and VPCs that's working differently than I understood. I think we can come up with a workaround.

tomfaulhaber avatar Jan 07 '22 01:01 tomfaulhaber

Hi @tomfaulhaber Could you share your findings? We face the same issue with Processing job. We run job in private subnets with NetworkConfig:

 "NetworkConfig": {
        "EnableInterContainerTrafficEncryption": false,
        "EnableNetworkIsolation": false/true, # (we tried both)
        "VpcConfig": {
            "SecurityGroupIds": [
                "sg-xxx" 
            ],
            "Subnets": [
                "subnet-xxx",
                "subnet-xxx",
                "subnet-xxx"
            ]
        }
	}

But it can't access bucket with input data:

sagemaker.exceptions.UnexpectedStatusException: Error for Processing job my-processing-job: Failed. 
Reason: ClientError: Failed to download data. 
ListObjectsV2 failed for s3://my-bucket/input-data/, nextToken:[null]: 
Unable to execute request to S3

alena-m avatar Jan 13 '22 09:01 alena-m

Hi @tomfaulhaber , Any progress with this? It really makes the solution unreliable. Anything I can help?

grossamit avatar Mar 08 '22 10:03 grossamit

Any updates on this? Exact same issue

papierGaylard avatar May 24 '22 13:05 papierGaylard

For me,

I need to run my sagemaker processing job within a VPC and within a subnet, I'm specifying the subnet and VPC like such:

--extra '{ "NetworkConfig": { "EnableInterContainerTrafficEncryption": false, "EnableNetworkIsolation": false, "VpcConfig": { "SecurityGroupIds": [ "sg-xxxxxx" ], "Subnets": [ "subnet-xxxxxx" ] } } }' However I get a s3.listobject failure as soon as I use it. I need to operate within a vpc/subnet with an IP range to connect to another service as well.

papierGaylard avatar May 24 '22 14:05 papierGaylard

Hi @tomfaulhaber , Any progress with this? It really makes the solution unreliable. Anything I can help?

So I think you need to create a VPC enpoint. For some reason processing jobs doesn't have access to aws internal services despite being inside your VPC/Subnet, having an ARN and role. You need to create a VPC endpoint, which is kind of like a pipe that allows aws sagemaker processing jobs direct access to specific internal services.

Would probably be a good thing to add to the script, hah.

papierGaylard avatar May 26 '22 21:05 papierGaylard

Also experiencing this. Any updates?

mattiasliljenzin avatar Jun 03 '22 10:06 mattiasliljenzin

I ended up switching back to no VPC after a few tries and realized that my IAM roles were slightly off. I only had the bucket arn with ** after it when I needed to add just the bucket name with no ** after it. Like as follows:

        {
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::bucket", <----- WAS MISSING THIS
                "arn:aws:s3:::bucket/**"
            ]
        },

I'll update it if I get it working with the VPC

schematical avatar Apr 05 '23 17:04 schematical

ClientError: Failed to download data. ListObjectsV2 failed for s3://.... nextToken:[null]: Unable to execute request to S3

The thing is that sometimes it succeed and sometimes not. I've also added a code to wait 10sec after the notebook upload and verify that the file exists after the upload with ListObjectsV2.

I got the same error, but then I remove all the network config in my processing job. And it works !

ConstantSun avatar May 30 '23 17:05 ConstantSun

Any update here?, I created a S3 VPC endpoint but still giving me that error. I'm using training jobs in a isolated subnet

gabriel-loka avatar Jun 08 '23 18:06 gabriel-loka

@gabriel-loka I got same problem, I solved to allow 443 port to Security Group of connection.

takeru1205 avatar Jun 29 '23 02:06 takeru1205

Facing this problem. Any update ?

telmen87 avatar Oct 27 '23 15:10 telmen87

I had the same error. In my case I preferred not to have a NAT GW, thus I used the public access option when I configured the domain in Sagemaker. Following a suggestion to create a VPC Endpoint for S3 solved this problem for me.

Thanks @papierGaylard ;)

michelpf avatar Jan 06 '24 13:01 michelpf