clearml icon indicating copy to clipboard operation
clearml copied to clipboard

Bug with "waiting to finish uploads"

Open RedMoon32 opened this issue 4 years ago • 18 comments
trafficstars

Hello! When running multiple experiments one by one, at some point I meet infinite "waiting to finish uploads", can you please add a parameter to just abort task on task close after some timeout?

RedMoon32 avatar Jul 12 '21 09:07 RedMoon32

Hi @RedMoon32 ,

The close process has some timeouts for some of the steps (detecting repository for example). The waiting to finish uploads won't be infinite but can take some time (depends on what you are uploading), do you have a way I can reproduce it? Do you see any progress in the app (UI)?

JDennisJ avatar Jul 12 '21 10:07 JDennisJ

No, I don't see any progress in the UI and I don't work with big repositories, just with jupyter notebook I will think about how you can reproduce this problem

RedMoon32 avatar Jul 13 '21 08:07 RedMoon32

@RedMoon32 which clearml version do you use?

shomratalon avatar Jul 13 '21 09:07 shomratalon

latest, 1.0.4, but I am not sure about the version on the local server

RedMoon32 avatar Jul 13 '21 10:07 RedMoon32

Hello, I would suggest using

sudo clearml-data close --verbose Verbose output may give you a better hint as into why it is taking long.

ErenBalatkan avatar Jul 14 '21 14:07 ErenBalatkan

@ErenBalatkan I think this issue is for the "atexit" callback, when leaving the process, so no actual argument can be passed ... @RedMoon32 or is it task.close() you are calling ? Is there a way to reproduce it ? What do you mean by "multiple experiments one by one" ?

bmartinn avatar Jul 15 '21 22:07 bmartinn

Getting the same problem with Waiting to finish uploads hung for 2+ hours. With the same data it can sometimes finish instantaneously (within 3min) which suggests that it is a hung job.

angusfong avatar Jan 24 '22 11:01 angusfong

Hi @angusfong , are you using the latest ClearML SDK version?

jkhenning avatar Jan 24 '22 21:01 jkhenning

Hi, I'm also getting stuck on Waiting to finish uploads. I'm on version 1.3.2 and trying to upload a simple python dict. The artifact does get uploaded and is visible in the UI but the Python script does not end due to the error and I need to abort the task manually.

derEitel avatar Apr 27 '22 12:04 derEitel

@derEitel Does this reproduce all the time with the same code?

jkhenning avatar Apr 27 '22 12:04 jkhenning

@jkhenning yes it does

derEitel avatar Apr 27 '22 12:04 derEitel

@jkhenning Running the following script in Jupyter Lab leads to a cell that never finishes executing, albeit uploading the artifact:

from clearml import Task
import numpy as np
task = Task.init(project_name='Upload test', task_name='uploading dict')

my_dict = {'apples' : 2,
           'oranges' : [1, 3, 5, 6],
           'vegetables' : {'potatoes' : 3,
                           'aubergine' : np.array([3, 5, 6])}
          }

task.upload_artifact(name='result_dict', artifact_object=my_dict)
task.close()

Edit: actually it does not properly upload the document. I can find a previewer but the link to the pickle file returns a 404.

derEitel avatar Apr 27 '22 12:04 derEitel

Hi @derEitel,

We're taking a look, we'll update here once we have a fix

erezalg avatar May 02 '22 12:05 erezalg

Hi @derEitel ,

We've tried with your code and it's working well. Can you help with us some more info about your environment like which operating system you've tried on, python and installed module versions, and which cloud you're uploading to?

Also, we fixed some stuff that might cause this issue. So, can you try the latest version of ClearML and update us about the issue?

Rizwan-Hasan avatar May 16 '22 21:05 Rizwan-Hasan

Hello @derEitel ,

Have you tried our latest release? Is this still reproducing?

Rizwan-Hasan avatar May 26 '22 07:05 Rizwan-Hasan

I'm not sure if it is the same issue, but I'm getting the following message (every 30s) on most files:

2023-10-16 06:30:56,491 - clearml.Task - INFO - Waiting for previous model to upload (2 pending, .../optimizer.pt)

For each file it takes ~40 minutes to upload, even 13KB files. My pipeline suddenly takes more than 6 hours instead of 20 minutes. After a day of debugging I decided to delete de dataset, the project and the pipeline, after which it started working again. While this works for now, I'm not sure how it got fixed, and would not like that this issue happens again on production. What can cause the artifact upload to wait?

janwytze avatar Oct 17 '23 08:10 janwytze

@janwytze are you using a self-hosted server? What SDK version are you using?

jkhenning avatar Oct 24 '23 15:10 jkhenning

@jkhenning Yes it is a self-hosted server. I use clearml==1.12.2.

janwytze avatar Oct 27 '23 14:10 janwytze