IPED
IPED copied to clipboard
Do not allow connections when the model is loaded
Some of our transcriptions workers were not able to load the model, I think that with this kind of exception should be better to close the application. This way is easier to see that there is something wrong with that worker. What do you think @lfcnassif ?
You mean the java process, right? It's fine to me but we need to continue to log the error before closing, otherwise non kubernetes users will be in a dark room in this situation...
I agree with you, the message should be logged before closing the application. I think that this behavior is better even to who is not using kubernets, as it will not sent files the problematic node.
Problematic nodes that didn't load the model correctly are accepting transcription requests? If yes, this seems a bug. I thought they would try to load the model indefinitely and don't accept requests, but I'm away of my computer right now to check.
Yes they accept connections and send the following erro to the IPED client:
2023-08-04 16:50:04 [WARN] [task.transcript.RemoteWav2Vec2TranscriptTask] Fail to transcribe on server: 10.61.86.41:8000 audio: base-audios-teste.zip>>base-audios-teste/20/PTT-20211103-WA0034.opus error: Exception while transcribing: java.lang.RuntimeException: iped.exception.IPEDException: Error loading '/home/transcript/.cache/huggingface/hub/models--jonatasgrosman--wav2vec2-xls-r-1b-portuguese/snapshots/8926743abe7e95bb81b64305cb3c5fa85173f6b0' transcription model..
Was this caused by a GPU driver update in the host machine? AFAIK the java process starts to listen for requests just after all transcription python processes, that load the model, start correctly. After initial start up and after transcribing some audios, if the python processes crash eventually, e.g. because of a host driver inadvertently updated by the host system, that may cause this situation, since IPED would try to restart the python processes again...
So I suggest to exit the java process just if this specific model loading error occurs. The transcription process may crash because of other things, like corrupted audios or rare bugs in the transcription library, and we should retry in those situations.
I'm aware that other issues may occur, so I think that this is the approach. Maybe we can try to detect when several audios in sequence are failing and abort the process, do you think that this can happen in a recoverable error?
Maybe we can try to detect when several audios in sequence are failing and abort the process, do you think that this can happen in a recoverable error?
That is a good idea, many failures in sequence on different audios probably means something is very bad with the service.
@hauck-jvsh do you remember if we implemented this?
I'm aware that other issues may occur, so I think that this is the approach. Maybe we can try to detect when several audios in sequence are failing and abort the process, do you think that this can happen in a recoverable error?
This wasn't implemented, but the part of not allowing audios when the model is being loaded is already running.
but the part of not allowing audios when the model is being loaded is already running.
Ok, seems it was implemented in https://github.com/sepinf-inc/IPED/pull/1944 so I'm closing this as completed.