amazon-sagemaker-examples
amazon-sagemaker-examples copied to clipboard
LDA batch transform fails at scale (even on original training data)
I have successfully trained an LDA model on a corpus of ~100k documents. As a next step, I wish to predict the topic distribution associated with each document using batch transform. The document-term-matrix is in x_recordio-protobuf format.
My code is as follows:
transformer = lda.transformer(instance_count=1, instance_type='ml.c5.4xlarge', max_payload=1, output_path=batch_output, accept='application/x-recordio-protobuf')
transformer.transform(data=batch_input, data_type='S3Prefix', content_type='application/x-recordio-protobuf', split_type='RecordIO')
transformer.wait()
Using the original corpus as input results in an InternalServerError
Using a subset of the original corpus works so long as it is sufficiently small.
I have tried up to an ml.c4.18xlarge instance, as well as up to 5 instances of ml.c5.4xlarge. I have tried restricting max_payload and/or max_concurrent_transforms to 1. None of these help.
Here is the full error message:
ValueError: Error for Transform job lda-2019-06-04-22-56-24-198-2019-06-04-22-56-24-517: Failed Reason: InternalServerError: We encountered an internal error. Please try again.
Digging into logs:
the data log only contains the following:
MaxConcurrentTransforms=1, MaxPayloadInMB=1, BatchStrategy=MULTI_RECORD
temp-lda-results/lda-d100-predict.data: Unable to get response from algorithm
The main log contains, amongst many messages, 5 instances of the following two errors:
[CRITICAL] WORKER TIMEOUT (pid:89)
terminate called after throwing an instance of 'std::system_error'. what(): No such process
Again, it will run fine with a smaller set of documents, eg 2000 documents are processed without errors, but 20000 will throw this error.
UPDATE: I recently changed the LDA alpha0 parameter from 1 to 0.1. Now the error happens with as few as 500 documents being batch transformed.
Any assistance would be much appreciated.
It works only if the Batch Strategy as 'SingleRecord'. If you try MultiRecord, it doesn't work and give the same error as you mentioned. It took the job a bit longer to complete but that's the only way I could figure out.
Thanks for looking into this, but It doesn't seem to work for me on 'SingleRecord' either. With 'SingleRecord, I get the following error in the data log:
[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=SINGLE_RECORD [sagemaker logs]: temp-lda-results/document_term_matrices/lda-d200000-sampling5-stemFalse-minwc150-purged04-predict.data: Too much data for max payload size
Currently the only way I've found around this is to manually break the dataset into smaller files of 100 records each. But if a single record can be successfully processed if drawn from a 100-record file, I can't imagine why it would be "too much data" when the same record is read from larger file...
It works only if the Batch Strategy as 'SingleRecord'. If you try MultiRecord, it doesn't work and give the same error as you mentioned. It took the job a bit longer to complete but that's the only way I could figure out. Thanks. The 'SingleRecord' worked for me perfect!!