amazon-sagemaker-examples icon indicating copy to clipboard operation
amazon-sagemaker-examples copied to clipboard

LDA batch transform fails at scale (even on original training data)

Open jonlehrer opened this issue 6 years ago • 3 comments

I have successfully trained an LDA model on a corpus of ~100k documents. As a next step, I wish to predict the topic distribution associated with each document using batch transform. The document-term-matrix is in x_recordio-protobuf format.

My code is as follows:

transformer = lda.transformer(instance_count=1, instance_type='ml.c5.4xlarge', max_payload=1, output_path=batch_output, accept='application/x-recordio-protobuf')
transformer.transform(data=batch_input, data_type='S3Prefix', content_type='application/x-recordio-protobuf', split_type='RecordIO')
transformer.wait()

Using the original corpus as input results in an InternalServerError

Using a subset of the original corpus works so long as it is sufficiently small.

I have tried up to an ml.c4.18xlarge instance, as well as up to 5 instances of ml.c5.4xlarge. I have tried restricting max_payload and/or max_concurrent_transforms to 1. None of these help.

Here is the full error message:

ValueError: Error for Transform job lda-2019-06-04-22-56-24-198-2019-06-04-22-56-24-517: Failed Reason: InternalServerError: We encountered an internal error.  Please try again.

Digging into logs:

the data log only contains the following:

MaxConcurrentTransforms=1, MaxPayloadInMB=1, BatchStrategy=MULTI_RECORD
temp-lda-results/lda-d100-predict.data: Unable to get response from algorithm

The main log contains, amongst many messages, 5 instances of the following two errors:

[CRITICAL] WORKER TIMEOUT (pid:89)
terminate called after throwing an instance of 'std::system_error'. what(): No such process

Again, it will run fine with a smaller set of documents, eg 2000 documents are processed without errors, but 20000 will throw this error.

UPDATE: I recently changed the LDA alpha0 parameter from 1 to 0.1. Now the error happens with as few as 500 documents being batch transformed.

Any assistance would be much appreciated.

jonlehrer avatar Jun 04 '19 23:06 jonlehrer

It works only if the Batch Strategy as 'SingleRecord'. If you try MultiRecord, it doesn't work and give the same error as you mentioned. It took the job a bit longer to complete but that's the only way I could figure out.

ghimanshu1273 avatar Jul 25 '19 11:07 ghimanshu1273

Thanks for looking into this, but It doesn't seem to work for me on 'SingleRecord' either. With 'SingleRecord, I get the following error in the data log:

[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=SINGLE_RECORD [sagemaker logs]: temp-lda-results/document_term_matrices/lda-d200000-sampling5-stemFalse-minwc150-purged04-predict.data: Too much data for max payload size

Currently the only way I've found around this is to manually break the dataset into smaller files of 100 records each. But if a single record can be successfully processed if drawn from a 100-record file, I can't imagine why it would be "too much data" when the same record is read from larger file...

jonlehrer avatar Aug 15 '19 16:08 jonlehrer

It works only if the Batch Strategy as 'SingleRecord'. If you try MultiRecord, it doesn't work and give the same error as you mentioned. It took the job a bit longer to complete but that's the only way I could figure out. Thanks. The 'SingleRecord' worked for me perfect!!

JacoMoolman avatar Aug 23 '22 18:08 JacoMoolman