André

Results 19 comments of André

Hi @harthur, Thanks for using Amazon SageMaker! There are two SageMaker clients: the `AmazonSageMaker` client which is used to create and manage Training Jobs, Endpoints and such, and the `AmazonSageMakerRuntime`...

Hey @harthur, I'm not sure exactly when the client is run. It's possible we should make that a `lazy val` or otherwise delay instantiation. Do you have a stack trace...

Hi @harthur , Thanks for the stacktrace! Just FYI: I haven't gotten a chance to reproduce this yet, but this definitely seems like a bug. I suppose that workers are...

Hey @harthur , Ah, interesting, thanks for the update! Glad to hear you got it working, but you're right, we should let users build their own client. I've put a...

@haowang-ms89 All SageMakerEstimators rely on Spark's DataFrame writers. The XGBoostSageMakerEstimator defaults to write data using "libsvm" format. Can you try passing in "csv" to "trainingSparkDataFormat" (or "com.databricks.spark.csv" if you're using...

@haowang-ms89 Sure! I just commented on that issue. You will also have to pass in Some("csv") for the `trainingContentType`, or XGBoost will think you're trying to give it LibSVM data....

`transform()` is trying to convert your DataFrame to LibSVM for inference because the `requestRowSerializer` is set to be `LibSVMRequestRowSerializer`: https://github.com/aws/sagemaker-spark/blob/81ac05625e86db577124d7c49d4cea7ec25d181f/sagemaker-spark-sdk/src/main/scala/com/amazonaws/services/sagemaker/sparksdk/algorithms/XGBoostSageMakerEstimator.scala#L479-L480 If you want to send CSV, you should use this...

@haowang-ms89 For PySpark, it's here: https://github.com/aws/sagemaker-spark/blob/81ac05625e86db577124d7c49d4cea7ec25d181f/sagemaker-pyspark-sdk/src/sagemaker_pyspark/transformation/serializers/serializers.py#L31-L40

@haowang-ms89 That's normal. The VectorAssembler sparsely encodes vectors if there are lots of zeros in the data to save memory. The rows with 27 are SparseVectors. The 27 is the...

That looks like it's still using the LibSVM serializer, not the UnlabeledCSVRequestRowSerializer. The LibSVM serializer validates the schema like this: https://github.com/aws/sagemaker-spark/blob/81ac05625e86db577124d7c49d4cea7ec25d181f/sagemaker-spark-sdk/src/main/scala/com/amazonaws/services/sagemaker/sparksdk/transformation/serializers/SchemaValidators.scala#L28-L30 Did you set `xgboost_model.requestRowSerializer = UnlabeledCSVRequestRowSerializer()` before transforming?