clearml
clearml copied to clipboard
Bug with "waiting to finish uploads"
Hello! When running multiple experiments one by one, at some point I meet infinite "waiting to finish uploads", can you please add a parameter to just abort task on task close after some timeout?
Hi @RedMoon32 ,
The close process has some timeouts for some of the steps (detecting repository for example).
The waiting to finish uploads won't be infinite but can take some time (depends on what you are uploading), do you have a way I can reproduce it? Do you see any progress in the app (UI)?
No, I don't see any progress in the UI and I don't work with big repositories, just with jupyter notebook I will think about how you can reproduce this problem
@RedMoon32 which clearml version do you use?
latest, 1.0.4, but I am not sure about the version on the local server
Hello, I would suggest using
sudo clearml-data close --verbose
Verbose output may give you a better hint as into why it is taking long.
@ErenBalatkan I think this issue is for the "atexit" callback, when leaving the process, so no actual argument can be passed ...
@RedMoon32 or is it task.close() you are calling ?
Is there a way to reproduce it ?
What do you mean by "multiple experiments one by one" ?
Getting the same problem with Waiting to finish uploads hung for 2+ hours. With the same data it can sometimes finish instantaneously (within 3min) which suggests that it is a hung job.
Hi @angusfong , are you using the latest ClearML SDK version?
Hi, I'm also getting stuck on Waiting to finish uploads. I'm on version 1.3.2 and trying to upload a simple python dict. The artifact does get uploaded and is visible in the UI but the Python script does not end due to the error and I need to abort the task manually.
@derEitel Does this reproduce all the time with the same code?
@jkhenning yes it does
@jkhenning Running the following script in Jupyter Lab leads to a cell that never finishes executing, albeit uploading the artifact:
from clearml import Task
import numpy as np
task = Task.init(project_name='Upload test', task_name='uploading dict')
my_dict = {'apples' : 2,
'oranges' : [1, 3, 5, 6],
'vegetables' : {'potatoes' : 3,
'aubergine' : np.array([3, 5, 6])}
}
task.upload_artifact(name='result_dict', artifact_object=my_dict)
task.close()
Edit: actually it does not properly upload the document. I can find a previewer but the link to the pickle file returns a 404.
Hi @derEitel,
We're taking a look, we'll update here once we have a fix
Hi @derEitel ,
We've tried with your code and it's working well. Can you help with us some more info about your environment like which operating system you've tried on, python and installed module versions, and which cloud you're uploading to?
Also, we fixed some stuff that might cause this issue. So, can you try the latest version of ClearML and update us about the issue?
Hello @derEitel ,
Have you tried our latest release? Is this still reproducing?
I'm not sure if it is the same issue, but I'm getting the following message (every 30s) on most files:
2023-10-16 06:30:56,491 - clearml.Task - INFO - Waiting for previous model to upload (2 pending, .../optimizer.pt)
For each file it takes ~40 minutes to upload, even 13KB files. My pipeline suddenly takes more than 6 hours instead of 20 minutes. After a day of debugging I decided to delete de dataset, the project and the pipeline, after which it started working again. While this works for now, I'm not sure how it got fixed, and would not like that this issue happens again on production. What can cause the artifact upload to wait?
@janwytze are you using a self-hosted server? What SDK version are you using?
@jkhenning Yes it is a self-hosted server. I use clearml==1.12.2.