IPED icon indicating copy to clipboard operation
IPED copied to clipboard

Do not allow connections when the model is loaded

Open hauck-jvsh opened this issue 1 year ago • 7 comments

Some of our transcriptions workers were not able to load the model, I think that with this kind of exception should be better to close the application. This way is easier to see that there is something wrong with that worker. What do you think @lfcnassif ?

image

hauck-jvsh avatar Aug 03 '23 21:08 hauck-jvsh

You mean the java process, right? It's fine to me but we need to continue to log the error before closing, otherwise non kubernetes users will be in a dark room in this situation...

lfcnassif avatar Aug 03 '23 22:08 lfcnassif

I agree with you, the message should be logged before closing the application. I think that this behavior is better even to who is not using kubernets, as it will not sent files the problematic node.

hauck-jvsh avatar Aug 04 '23 20:08 hauck-jvsh

Problematic nodes that didn't load the model correctly are accepting transcription requests? If yes, this seems a bug. I thought they would try to load the model indefinitely and don't accept requests, but I'm away of my computer right now to check.

lfcnassif avatar Aug 04 '23 20:08 lfcnassif

Yes they accept connections and send the following erro to the IPED client:

2023-08-04 16:50:04 [WARN] [task.transcript.RemoteWav2Vec2TranscriptTask] Fail to transcribe on server: 10.61.86.41:8000 audio: base-audios-teste.zip>>base-audios-teste/20/PTT-20211103-WA0034.opus error: Exception while transcribing: java.lang.RuntimeException: iped.exception.IPEDException: Error loading '/home/transcript/.cache/huggingface/hub/models--jonatasgrosman--wav2vec2-xls-r-1b-portuguese/snapshots/8926743abe7e95bb81b64305cb3c5fa85173f6b0' transcription model..

hauck-jvsh avatar Aug 04 '23 21:08 hauck-jvsh

Was this caused by a GPU driver update in the host machine? AFAIK the java process starts to listen for requests just after all transcription python processes, that load the model, start correctly. After initial start up and after transcribing some audios, if the python processes crash eventually, e.g. because of a host driver inadvertently updated by the host system, that may cause this situation, since IPED would try to restart the python processes again...

So I suggest to exit the java process just if this specific model loading error occurs. The transcription process may crash because of other things, like corrupted audios or rare bugs in the transcription library, and we should retry in those situations.

lfcnassif avatar Aug 04 '23 21:08 lfcnassif

I'm aware that other issues may occur, so I think that this is the approach. Maybe we can try to detect when several audios in sequence are failing and abort the process, do you think that this can happen in a recoverable error?

hauck-jvsh avatar Aug 04 '23 21:08 hauck-jvsh

Maybe we can try to detect when several audios in sequence are failing and abort the process, do you think that this can happen in a recoverable error?

That is a good idea, many failures in sequence on different audios probably means something is very bad with the service.

lfcnassif avatar Aug 04 '23 21:08 lfcnassif

@hauck-jvsh do you remember if we implemented this?

lfcnassif avatar Mar 12 '24 11:03 lfcnassif

I'm aware that other issues may occur, so I think that this is the approach. Maybe we can try to detect when several audios in sequence are failing and abort the process, do you think that this can happen in a recoverable error?

This wasn't implemented, but the part of not allowing audios when the model is being loaded is already running.

hauck-jvsh avatar Mar 12 '24 13:03 hauck-jvsh

but the part of not allowing audios when the model is being loaded is already running.

Ok, seems it was implemented in https://github.com/sepinf-inc/IPED/pull/1944 so I'm closing this as completed.

lfcnassif avatar Mar 12 '24 17:03 lfcnassif