clearml
clearml copied to clipboard
Sample images stop reporting with no errors
I use report_image
to show some of my validation image samples but it breaks up to a point with no errors:
Logger.current_logger().report_image(
"Autotrain",
image_title,
iteration=self.current_epoch,
image=versus_image
)
The problem: up to a certain point (which is the same each time, around a 100 reported images), something breaks and the latest reported images don't show in the ClearML web interface (which are valid images that I could save and open). Here are the my latest logs where it's breaking:
...
clearml-apiserver | [2022-05-11 17:22:17,718] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 2ms
clearml-apiserver | [2022-05-11 17:22:17,750] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 21ms
clearml-apiserver | [2022-05-11 17:22:17,780] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 21ms
clearml-apiserver | [2022-05-11 17:22:47,915] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 2ms
clearml-apiserver | [2022-05-11 17:22:47,953] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 25ms
clearml-apiserver | [2022-05-11 17:22:52,989] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 22ms
clearml-apiserver | [2022-05-11 17:23:01,970] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 3ms
clearml-apiserver | [2022-05-11 17:23:01,992] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 2ms
clearml-apiserver | [2022-05-11 17:23:02,228] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 17ms
clearml-apiserver | [2022-05-11 17:23:03,037] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 15ms
clearml-apiserver | [2022-05-11 17:23:06,373] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all_ex in 10ms
clearml-apiserver | [2022-05-11 17:23:06,552] [9] [INFO] [clearml.service_repo] Returned 200 for events.debug_images in 184ms
clearml-apiserver | [2022-05-11 17:23:06,560] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 4ms
clearml-apiserver | [2022-05-11 17:23:06,561] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 8ms
clearml-apiserver | [2022-05-11 17:23:06,695] [9] [INFO] [clearml.service_repo] Returned 200 for events.get_task_metrics in 4ms
clearml-apiserver | [2022-05-11 17:23:08,586] [9] [INFO] [clearml.service_repo] Returned 200 for events.debug_images in 133ms
clearml-apiserver | [2022-05-11 17:23:16,350] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all_ex in 8ms
clearml-apiserver | [2022-05-11 17:23:18,047] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 2ms
clearml-apiserver | [2022-05-11 17:23:18,082] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 24ms
clearml-apiserver | [2022-05-11 17:23:18,111] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 20m
My environment:
docker-compose version 1.24.1, build 4667896b
docker-py version: 3.7.3
CPython version: 3.6.8
OpenSSL version: OpenSSL 1.1.0j 20 Nov 2018
How could I debug the issue? Thanks!
Hi @aurelien-m,
something breaks and the latest reported images don't show in the ClearML web interface
What do you mean exactly? Do you see images with broken links? Do you only see 100 images and never more?
Hey, sorry I wasn't very clear. I can see precisely my first 116 reported images, but then the next reported images don't appear in the interface. It just shows nothing more, even I could see my code reporting them (and optionally saving them, for me to check if they're valid images).
So yes, the first 116 images, and then nothing more.
@jkhenning, I looked more into it, and all of my reported images do exist in the ClearML fileserver (at /opt/clearml/fileserver/project/.../all-of-my-images-are-here!
). So I assume they have been sent and receive, but for some reason, I don't see them all in the web interface
Can you add a screenshot of how the debug images look in the WebApp?
Yes sure, here is what the interface looks like:
Notice that it stops at epoch_002_batch_001_item_009_versus
, but in the ClearML folder I have:
All of the rest of my images! I'm i using something incorrectly? Is it normal that each image have their own folder?
Can you show me how you are reporting these images?
@aurelien-m can yo also include the version numbers for the ClearML server and python package? I assume you're using a self-hosted open source ClearML server?
Yes sure: clearml python package: Version: 1.0.3
I'm indeed self hosted, how do I check the version of the ClearML server?
As to how I report the images, I just report them in my validation_step
of my pytorch_lightning.LightningModule
:
def validation_step(
self, batch: List[...], batch_idx: int
) -> Tuple[List[Dict], List[...]]:
...
images = [ ... ]
predictions = self.model(images)
predictions = [
{k: v.detach().cpu() for (k, v) in prediction.items()} for prediction in predictions
]
targets = [{k: v.detach().cpu() for (k, v) in batch_item.items()} for batch_item in batch]
if (self.versus_image_batch_count) == -1 or (batch_idx < self.versus_image_batch_count):
for img_index, (image, prediction, target) in enumerate(
zip(images, predictions, targets)
):
image_title = (
f"epoch_{self.current_epoch:03d}_batch_{batch_index:03d}_item_{image_index:03d}_versus"
)
...
prediction_annotated_image = ...
target_annotated_image = ...
versus_image = np.hstack((prediction_annotated_image, target_annotated_image))
image = (versus_image * 255).astype(np.uint8)
versus_image = Image.fromarray(np.rollaxis(image, 0, 3))
Logger.current_logger().report_image(
"Autotrain",
image_title,
iteration=self.current_epoch,
image=versus_image
)
return (predictions, targets)
And this is done for each batch of each of my validation steps. Not sure what's going wrong :/ This is also some company code, so I won't put in everything.
I'll look more into it tonight, I'll make a simple example that can be reproduced and see if it works.
Hi @aurelien-m ,
The ClearML Server version should appear in the bottom right part of the screen after you go into the "Settings" page (available from the Profile drop-down menu on the top-right corner of the UI)
Is this a server you deployed yourself?
Hello, yes it's a server I deployed myself on Linux following these steps. My ClearML server version is WebApp: 1.4.0-175 • Server: 1.4.0-175 • API: 2.18
.
I haven't really looked more into it yet though, I can try to see what's happening in the report_image
function when my images don't show.
Hi, @aurelien-m! Thank you for reporting this. I this that the SDK does not wait properly for the upload events to finish. Take this code, for instance:
import numpy as np
import time
import logging
import http.client as http_client
from clearml import Task, Logger
from PIL import Image, ImageFont, ImageDraw
def gen_img(txt):
width = 128
height = 128
font = ImageFont.truetype("arial.ttf", size=20)
img = Image.new("RGB", (width, height), color="black")
imgDraw = ImageDraw.Draw(img)
imgDraw.text((10, 10), str(txt), font=font, fill=(255, 255, 0))
return img
task = Task.init(project_name="example", task_name="report_img", reuse_last_task_id=False)
for i in range(100):
m = np.eye(256, 256, dtype=np.float)
Logger.current_logger().report_image(
"image", f"image float{i}", iteration=1, image=gen_img(i)
)
# time.sleep(10)
Try running this one once without the sleep and once with the sleep. In my case, I see only 10 images in the UI when I don't sleep, and all of them when I do sleep.
I think the issue is here: https://github.com/allegroai/clearml/blob/master/clearml/backend_interface/metrics/reporter.py#L92-95
if self._is_subprocess_mode_and_not_parent_process():
while self._queue and not self._queue.empty():
sleep(0.1)
return
After removing that return
statement the issue seems to be fixed.
Can you actually just comment/remove that return
yourself and check if the issue is solved for you as well?
The actual fix for this might be different, but it would be nice to know if this is the root of your problem.
To do this:
- find where clearml is installed:
- can do it with
pip show clearml
or in a python shell:import clearml; print(clearml.__file__)
- can do it with
- open
clearml/backend_interface/metrics/reporter.py
in a text editor - delete line 95
- run the program you have issues with
Hey, thanks for the help. However, it did not fix my issue. It looks like BackgroundReportService.wait_for_events
is never called. I tried putting a break point there but it's never triggered.
I tried adding Logger.current_logger().flush()
but it also didn't help. :/
I updated the clearml
python package to 1.4.0
.
@aurelien-m I believe that most debuggers will not trigger a breakpoint in BackgroundReportService.wait_for_events
because it is called on exit.
Anyhow, could you please upgrade to 1.4.1rc0
and eventually time.sleep(60)
at the end of your program and tell us if you see more images in the UI (just to rule out the possibility that the program exits before images are properly reported)?
I'm unfortunately still having this issue. I thought I fixed it but no.
I have tried removing the line you have specified, but it doesn't seem to help.
I tried setting a limit of reported images at the end of each epochs, but the issue is still happening. It also becomes all blank after the issue. There used to be some images shown, but now there's nothing after I ended the task:
My versions now:
$ pip show clearml
Name: clearml
Version: 1.5.0
...
Server:
Hi @aurelien-m! We're sorry to hear you're still experiencing this issue. We will investigate and try to come up with a fix
I am facing the same issue when trying to report_image()
for more than 100 images. There is no exception on either the server or "uploader" side. For reference, I am running a python 3.6 with the latest ClearML version. It works perfectly when uploading less than 100 images. I have tried slowing the rate of upload by putting in an iterative 0.1s sleep after each upload. I have also done your method of commenting out line 95 of reporter.py
but to no avail.
The code I used
def __upload_clearml_image(self, image, image_name):
### Upload images to clearml_logger ###
self.clearml_logger.report_image(self.data_filename, image_name, image=image, max_image_history = -1, iteration=1) #
##### main code continuation .... #####
### Init Clearml ###
with open(f"path/val_list.txt", "r") as val_list_file:
lines = val_list_file.readlines()
im_list = []
for line in lines:
image_name = line.split("/")[-1].split(".")[0] // path/test.jpg -> `test`
im = Image.open(line.strip()) # PIL image
# im = cv2.imread(line.strip())[..., ::-1] # OpenCV image
result = model(im) # includes NMS
### Method: BytesIO
res_render = result.render()
result_base64_str = None
for idx, img in enumerate(res_render):
buffered = BytesIO()
img_base64 = Image.fromarray(img)
img_base64.save(buffered, format="JPEG")
result_base64_str = base64.b64encode(buffered.getvalue()).decode('utf-8')
res_im = Image.open(BytesIO(base64.b64decode(result_base64_str)))
try:
self.__upload_clearml_image(res_im, image_name)
except Exception as e:
print(e)
### Exit Clearml ###
Any update on this issue will be greatly appreciated!
Hi @Berttwm,
Can you try to modify the "max_image_history" in Logger.report_image() and see if it solves it for you? If it does, you can increase the max # of images for all codes by changing file_history_size: 200 in the clearml.conf file
Hi @Berttwm,
Can you try to modify the "max_image_history" in Logger.report_image() and see if it solves it for you? If it does, you can increase the max # of images for all codes by changing file_history_size: 200 in the clearml.conf file
I forgot to mention, that was one of the first things I have tried. Setting the value to 1000 to no avail.
Hi @Berttwm,
I tried this simple code
from clearml import Task, Logger
t = Task.init(project_name='test',task_name='test_report_image')
l = t.get_logger()
for i in range(0,200):
l.report_image(title='test',series='123',iteration=i,local_path='PATH.png'.format(i),max_image_history=10)
For me when I change max_image_history to 200, it can show 200 images. Also, I saw that you set it to -1. I have also changed the "file_history_size" to 200 and saw that I can see up to 200 images. Can you verify my findings? If it works for you, let's try to figure out where's the difference in the code
Hello @erezalg, thanks for reaching out.
The upload clearml
code I am using now. Set max_image_history=1000
.
We have exactly 841 images to upload (gaps between each indexed image), however my output caps at image 244 which is exactly the 100th indexed image in our list. This is after I have tried changing the max_image_history
argument to many different numbers.
Clearml page output
My own log that shows that upwards of image 2045 (index 841) has been processed, yet no image after image 244 (index 100) has been processed. It is interesting to note that they each enter the report_image
function and processes within the same observable relative time, yet no new image is uploaded after image 244 (index 100). Setting the max_image_history=30
still outputted the same 100 images.
I have replicated this process with different clearml.Task
to the same effect of a 100 image cap. I have also tried it with less than 100 images (example 10 images), and it outputs 10 perfectly fine. There is an arbitrary 100 cap on my images and I do not know how to bypass this number.
FURTHER EDIT:
I have also changed the file_history_size:1000
on my server's side clearml.conf
, restarted the docker service (assuming this commits the clearml.conf changes).
Hi @Berttwm , I suspect that we are hitting the internal limit here on maximum amounts of unique metrics and variants. Currently due to Elasticsearch limitations the unique_metrics_count * unique_variants_count should not exceed 10000 per task. When you report images like this:
self.clearml_logger.report_image(self.data_filename, image_name, image=image, max_image_history = -1, iteration=1)
The first parameter (title) effectively becomes a metric name and the second one (series) the variant name. This leads to too many unique variants and metrics. Can you please try to put one of them to a constant? The other one should still be a unique value so that the images do not overlap.
self.clearml_logger.report_image('test', image_name, image=image, max_image_history = -1, iteration=1)