sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

Missing json error when trying to compile a semantic segmentation model (builtin algorithm) with Neo

Open MCE-KobyBo opened this issue 3 years ago • 12 comments

Describe the bug Not sure if this is a bug or an unsupported feature. We've trained a semantic segmentation model, using the built in sagemaker semantic segmentation algorithm, (FCN with resenet 50) and were able to successfully deploy it. But, we wanted to compile it with Neo in order to improve inference performance, and to be able to deploy it to an inf1 instance. When I try to compile the model (based on examples in sample notebooks), I receive the following error: ClientError: InputConfiguration: No valid Mxnet model file -symbol.json found The model.tar.gz for semantic segmentation models contains hyperparams.json, model_algo-1, model_best.params. According to the docs, model_algo-1 is the serialized mxnet model. Aren't gluon models supported by Neo? If not, can I manulay use gluon\mxnet to save the required symbols json from the serialized model in order to use Neo? Thanks!

To reproduce Train a Semantic Segmentation model using sagemaker builtin algorithm, with FCN and resnet 50, and try to call the estimators compile_model.

Expected behavior Neo should successfully compile the model.

Screenshots or logs If applicable, add screenshots or logs to help explain your problem.

System information A description of your system. Please provide:

  • SageMaker Python SDK version:
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Semantic segmentation (builtin)
  • Framework version:
  • Python version: 3.6
  • CPU or GPU: Running in a sagemaker jupyter notebook, hosted on a a CPU instance
  • Custom Docker image (Y/N):

Additional context Add any other context about the problem here.

MCE-KobyBo avatar Dec 30 '20 20:12 MCE-KobyBo

Hi @MCE-KobyBo

Sagemaker training stores the actual Mxnet model (*-symbol.json and *.params) in a non-standard way by zipping them into the model_algo-1 file - which is actually a zip file with no extension.

You can work around this by unzipping the model_algo-1 file and creating a new .tar.gz with the *-symbol.json and *.params files, which can be submitted to Neo for compilation.

trevor-m avatar Jan 12 '21 18:01 trevor-m

@trevor-m Thanks for your reply. I read about that somewhere, but it doesn't seem to be the case here. The model.tar.gz file produced by the training job contains 3 files: hyperparams.json, model_algo-1 and model_best.params. model_algo-1 doesn't seem to be a zip file in this case, it has the same size and format as model_best.params.

They both seem to be exported gloucv model parameters. To test that, I've followed the sample code in this stackoverflow thread (but using FCN) and it worked, I was able to load it directly with load_params. This means that for Neo, maybe I'll have to manually export it.

MCE-KobyBo avatar Jan 13 '21 07:01 MCE-KobyBo

Thanks for the response! It appears that sagemaker builtin models does not have a consistent format. While unzipping the model_algo-1 file worked for other Sagemaker builtin models such as LinearLearner, it appears it is a different internal format in this case. I would advise asking the Sagemaker builtin segmentation model team about how to extract the underlying model, or avoiding Sagemaker builtin models altogether.

trevor-m avatar Feb 02 '21 22:02 trevor-m

@MCE-KobyBo Did you manage to solve this manually? I'm having similar issues.

taroko-mooncake avatar Sep 10 '21 08:09 taroko-mooncake

@taroko-mooncake Unfortunately no, but as we didn't have time to continue trying we just decided not to use neo for now

MCE-KobyBo avatar Sep 14 '21 15:09 MCE-KobyBo

@MCE-KobyBo I solved it - if you post the question on stackoverflow i can send to you

taroko-mooncake avatar Sep 16 '21 10:09 taroko-mooncake

Hey @taroko-mooncake , I'm ready to post it on Stackoverflow in order to get a solution

korimarik avatar Oct 05 '21 07:10 korimarik

Sure - send me the link to the question.

On Tue, 5 Oct 2021 at 08:49, korimarik @.***> wrote:

Hey @taroko-mooncake https://github.com/taroko-mooncake , I'm ready to post it on Stackoverflow in order to get a solution

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aws/sagemaker-python-sdk/issues/2062#issuecomment-934155340, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKV6JJR6TA63VA7GEKR2NWLUFKUXPANCNFSM4VOWSHRA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

taroko-mooncake avatar Oct 07 '21 15:10 taroko-mooncake

@taroko-mooncake @korimarik Did you end up solving this? Was there a StackOverflow Q/A posted? Having same problem using built-in Semantic Segmentation model and SageMaker Neo

dcoder4 avatar Mar 11 '22 03:03 dcoder4

Yes I did solve it - if you post a SO QA I can send you the answer

@taroko-mooncake @korimarik Did you end up solving this? Was there a StackOverflow Q/A posted? Having same problem using built-in Semantic Segmentation model and SageMaker Neo

taroko-mooncake avatar Mar 22 '22 09:03 taroko-mooncake

Thanks @taroko-mooncake appreciate the help. I have posted a StackOverflow Question here: [https://stackoverflow.com/questions/71579883/missing-symbol-json-error-when-trying-to-compile-a-sagemaker-semantic-segmentat]

dcoder4 avatar Mar 22 '22 22:03 dcoder4

I have the same issue with Linear Learner. The generated model.tar.gz has a file model.algo-1 which is a zip file. I was able to compile the model only after I unpacked the file and created a separate tar.gz file witn only one symbol.json file and one params file.

vveselov avatar Sep 15 '22 15:09 vveselov