sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

An error occurred (ValidationError) when calling the PutRecord operation: Resource Not Found: Amazon SageMaker can't find a FeatureGroup

Open oonisim opened this issue 2 years ago • 1 comments

Describe the bug Feature Group ingest method fails to insert several records sporadically and throws an exception Failed to ingest row 0: An error occurred (ValidationError) when calling the PutRecord operation: Resource Not Found: Amazon SageMaker can't find a FeatureGroup with name although the FeatureGroup exists. Sometime one record, sometime multiple records, and there is no specific pattern.

Querying from the Feature Group returns records, so the FeatureGroup is there and records have been inserted.

feature_store_query.run(
    query_string=query_string,
    output_location=feature_group_query_uri,
)

feature_store_query.wait()
feature_store_query.as_dataframe().head()
-----
review_id | star_rating | review_date
-- | -- | --
RM4XTAWT3FV8S | 1 | 2015-08-03T00:00:00Z
RM4XTAWT3FV8S | 1 | 2015-08-03T00:00:00Z
RM4XTAWT3FV8S | 1 | 2015-08-03T00:00:00Z
R195KUJIQS3UR7 | 5 | 2015-08-03T00:00:00Z
R195KUJIQS3UR7 | 5 | 2015-08-03T00:00:00Z

To reproduce

  1. Open SageMaker studio in us-east-1 in non-VPC deployment.
  2. Run below and feature_group.ingest(...) causes the exception in the SageMaker studio.
import time
import json
import multiprocessing

import sagemaker
from sagemaker.session import Session
from sagemaker import get_execution_role

NUM_CPUS = multiprocessing.cpu_count()
role = get_execution_role()
session = sagemaker.Session()
region = session.boto_region_name
bucket = session.default_bucket()


from sagemaker.feature_store.feature_definition import (
    FeatureDefinition,
    FeatureTypeEnum,
)

feature_definitions = [
    FeatureDefinition(feature_name="review_id", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="review_date", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="star_rating", feature_type=FeatureTypeEnum.INTEGRAL),
]


feature_group_prefix = "sagemaker-feature-group"
feature_group_name = "amazon-product-review"
feature_group_offline_uri = f"s3://{bucket}/{feature_group_prefix}/{feature_group_name}/features"
feature_group_query_uri = f"s3://{bucket}/{feature_group_prefix}/{feature_group_name}/queries"

record_identifier_feature_name = "review_id"
event_time_feature_name = "review_date"

from sagemaker.feature_store.feature_group import FeatureGroup

feature_group = FeatureGroup(
    name=feature_group_name, 
    feature_definitions=feature_definitions, 
    sagemaker_session=session
)

def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    print("Waiting for Feature Group Creation")
    print("Feature Group status: {}".format(status))
    while status == "Creating":
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
        print("Feature Group status: {}".format(status))
        
    if status != "Created":
        print("Feature Group creation failed. Status: {}".format(status))
        raise RuntimeError(f"Failed to create feature group {feature_group.name}")
    else:
        print(f"FeatureGroup {feature_group.name} successfully created.")


try:
    print("Creating Feature Group with role {}...".format(role))
    response = feature_group.create(
        s3_uri=feature_group_offline_uri,
        record_identifier_name=record_identifier_feature_name,
        event_time_feature_name=event_time_feature_name,
        role_arn=role,
        enable_online_store=True,
    )

    print("Waiting for new Feature Group to become available...")
    wait_for_feature_group_creation_complete(feature_group)
    feature_group.describe()

    print("Creating Feature Group. Completed.")

except Exception as e:
    raise RuntimeError("Feature Group creation failed: {}".format(e)) from e

client = session.boto_session.client(
    "sagemaker", region_name=region
)
client.list_feature_groups()
feature_group.describe()

import pandas as pd
import s3fs

amazon_product_review_bucket = "amazon-reviews-pds"
generator = pd.read_csv(
    f"s3://{amazon_product_review_bucket}/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz",
    header=0,
    usecols=["review_id", "star_rating", "review_date"],
    parse_dates=["review_date"],
    sep='\t',
    compression="gzip",
    chunksize=1024 * (NUM_CPUS -1) * 3
)

df = next(generator)
df.dropna(inplace=True)
df['review_date'] = df['review_date'].dt.strftime('%Y-%m-%dT%H:%M:%SZ')
df.head()

feature_group.describe()
feature_group.ingest(             # <---------- Cause the error
    data_frame = df,
    max_processes=1,
    max_workers=3,
    wait=True
)

feature_store_query = feature_group.athena_query()
feature_store_table = feature_store_query.table_name

query_string = """
SELECT review_id, star_rating, review_date FROM "{}" LIMIT 10
""".format(
    feature_store_table
)

print("Running " + query_string)

feature_store_query.run(
    query_string=query_string,
    output_location=feature_group_query_uri,
)

feature_store_query.wait()
feature_store_query.as_dataframe()

Expected behavior All the records get inserted successfully.

Screenshots or logs

Failed to ingest row 0: An error occurred (ValidationError) when calling the PutRecord operation: Resource Not Found: Amazon SageMaker can't find a FeatureGroup with name [amazon-product-review].
Failed to ingest row 0 to 1024
---------------------------------------------------------------------------
IngestionError                            Traceback (most recent call last)
<ipython-input-40-28d1e702daaf> in <module>
      4     max_workers=3,
      5 #    timeout=3,
----> 6     wait=True
      7 )

/opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in ingest(self, data_frame, max_workers, max_processes, wait, timeout)
    596         )
    597 
--> 598         manager.run(data_frame=data_frame, wait=wait, timeout=timeout)
    599 
    600         return manager

/opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in run(self, data_frame, wait, timeout)
    347                 if timeout is reached.
    348         """
--> 349         self._run_multi_process(data_frame=data_frame, wait=wait, timeout=timeout)
    350 
    351 

/opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in _run_multi_process(self, data_frame, wait, timeout)
    290 
    291         if wait:
--> 292             self.wait(timeout=timeout)
    293 
    294     def _run_multi_threaded(self, data_frame: DataFrame, row_offset=0, timeout=None) -> List[int]:

/opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in wait(self, timeout)
    259             raise IngestionError(
    260                 self._failed_indices,
--> 261                 f"Failed to ingest some data into FeatureGroup {self.feature_group_name}",
    262             )
    263 

IngestionError: [0] -> Failed to ingest some data into FeatureGroup amazon-product-review
Failed to ingest row 0: An error occurred (ValidationError) when calling the PutRecord operation: Resource Not Found: Amazon SageMaker can't find a FeatureGroup with name [amazon-product-review].
Failed to ingest row 0 to 1024
---------------------------------------------------------------------------
IngestionError                            Traceback (most recent call last)
<ipython-input-44-48dcfcf21274> in <module>
      4     max_workers=3,
      5     timeout=30,
----> 6     wait=True
      7 )

/opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in ingest(self, data_frame, max_workers, max_processes, wait, timeout)
    596         )
    597 
--> 598         manager.run(data_frame=data_frame, wait=wait, timeout=timeout)
    599 
    600         return manager

/opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in run(self, data_frame, wait, timeout)
    347                 if timeout is reached.
    348         """
--> 349         self._run_multi_process(data_frame=data_frame, wait=wait, timeout=timeout)
    350 
    351 

/opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in _run_multi_process(self, data_frame, wait, timeout)
    290 
    291         if wait:
--> 292             self.wait(timeout=timeout)
    293 
    294     def _run_multi_threaded(self, data_frame: DataFrame, row_offset=0, timeout=None) -> List[int]:

/opt/conda/lib/python3.7/site-packages/sagemaker/feature_store/feature_group.py in wait(self, timeout)
    259             raise IngestionError(
    260                 self._failed_indices,
--> 261                 f"Failed to ingest some data into FeatureGroup {self.feature_group_name}",
    262             )
    263 

IngestionError: [0] -> Failed to ingest some data into FeatureGroup amazon-product-review

System information A description of your system. Please provide:

  • SageMaker Python SDK version: '2.49.1'
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): NA
  • Framework version:
  • Python version: 3.7.10 (default, Jun 4 2021, 14:48:32) [GCC 7.5.0]
  • CPU or GPU: CPU (SageMaker studio DataScience kernel.
  • Custom Docker image (Y/N): N

Additional context Add any other context about the problem here.

oonisim avatar Aug 30 '21 04:08 oonisim

got same issues. Does it solve ?

liyunrui avatar May 28 '22 05:05 liyunrui

I am facing the same issue, any lead for the solution. can someone help me with another way of ingesting data to feature group.

Pooja-Karangale avatar Jan 27 '23 09:01 Pooja-Karangale

This look like a service issue, not SDK's.

@Pooja-Karangale would you have some IDs of failed requests, along with time and region that can help us debug ?

EDIT: Looks like boto3 won't log request IDs even for exceptional cases unless verbose logging is enabled. If you have request IDs, please share them here. Otherwise, open a case with AWS support where you can share more sensitive details like Account ID that'll help with the investigation.

psnilesh avatar Jan 30 '23 05:01 psnilesh

Does anyone know if this was fixed or if there is a workaround? I have the same problem

OmarDispatch avatar Apr 04 '23 01:04 OmarDispatch

If you see this issue consistently, please share the request IDs, region and a timeframe when you encountered this issue. Alternatively, you can open a case with AWS support to provide more sensitive details.

jiapinw avatar May 30 '23 17:05 jiapinw

Thank you for opening this issue. Closing this issue as per the above comment. Please feel free to reopen if you continue to see this issue with the latest sagemaker version.

mufaddal-rohawala avatar Dec 21 '23 21:12 mufaddal-rohawala