clearml icon indicating copy to clipboard operation
clearml copied to clipboard

Sample images stop reporting with no errors

Open montmejat opened this issue 2 years ago • 20 comments

I use report_image to show some of my validation image samples but it breaks up to a point with no errors:

Logger.current_logger().report_image(
    "Autotrain", 
    image_title, 
    iteration=self.current_epoch, 
    image=versus_image
)

The problem: up to a certain point (which is the same each time, around a 100 reported images), something breaks and the latest reported images don't show in the ClearML web interface (which are valid images that I could save and open). Here are the my latest logs where it's breaking:

...
clearml-apiserver | [2022-05-11 17:22:17,718] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 2ms
clearml-apiserver | [2022-05-11 17:22:17,750] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 21ms
clearml-apiserver | [2022-05-11 17:22:17,780] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 21ms
clearml-apiserver | [2022-05-11 17:22:47,915] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 2ms
clearml-apiserver | [2022-05-11 17:22:47,953] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 25ms
clearml-apiserver | [2022-05-11 17:22:52,989] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 22ms
clearml-apiserver | [2022-05-11 17:23:01,970] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 3ms
clearml-apiserver | [2022-05-11 17:23:01,992] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 2ms
clearml-apiserver | [2022-05-11 17:23:02,228] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 17ms
clearml-apiserver | [2022-05-11 17:23:03,037] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 15ms
clearml-apiserver | [2022-05-11 17:23:06,373] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all_ex in 10ms
clearml-apiserver | [2022-05-11 17:23:06,552] [9] [INFO] [clearml.service_repo] Returned 200 for events.debug_images in 184ms
clearml-apiserver | [2022-05-11 17:23:06,560] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 4ms
clearml-apiserver | [2022-05-11 17:23:06,561] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 8ms
clearml-apiserver | [2022-05-11 17:23:06,695] [9] [INFO] [clearml.service_repo] Returned 200 for events.get_task_metrics in 4ms
clearml-apiserver | [2022-05-11 17:23:08,586] [9] [INFO] [clearml.service_repo] Returned 200 for events.debug_images in 133ms
clearml-apiserver | [2022-05-11 17:23:16,350] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all_ex in 8ms
clearml-apiserver | [2022-05-11 17:23:18,047] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 2ms
clearml-apiserver | [2022-05-11 17:23:18,082] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 24ms
clearml-apiserver | [2022-05-11 17:23:18,111] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 20m

My environment:

docker-compose version 1.24.1, build 4667896b
docker-py version: 3.7.3
CPython version: 3.6.8
OpenSSL version: OpenSSL 1.1.0j  20 Nov 2018

How could I debug the issue? Thanks!

montmejat avatar May 11 '22 17:05 montmejat

Hi @aurelien-m,

something breaks and the latest reported images don't show in the ClearML web interface

What do you mean exactly? Do you see images with broken links? Do you only see 100 images and never more?

jkhenning avatar May 12 '22 07:05 jkhenning

Hey, sorry I wasn't very clear. I can see precisely my first 116 reported images, but then the next reported images don't appear in the interface. It just shows nothing more, even I could see my code reporting them (and optionally saving them, for me to check if they're valid images).

So yes, the first 116 images, and then nothing more.

montmejat avatar May 12 '22 08:05 montmejat

@jkhenning, I looked more into it, and all of my reported images do exist in the ClearML fileserver (at /opt/clearml/fileserver/project/.../all-of-my-images-are-here!). So I assume they have been sent and receive, but for some reason, I don't see them all in the web interface

montmejat avatar May 12 '22 08:05 montmejat

Can you add a screenshot of how the debug images look in the WebApp?

jkhenning avatar May 12 '22 09:05 jkhenning

Yes sure, here is what the interface looks like:

Screenshot from 2022-05-12 11-38-21

Notice that it stops at epoch_002_batch_001_item_009_versus, but in the ClearML folder I have:

Screenshot from 2022-05-12 11-41-10

All of the rest of my images! I'm i using something incorrectly? Is it normal that each image have their own folder?

montmejat avatar May 12 '22 09:05 montmejat

Can you show me how you are reporting these images?

jkhenning avatar May 12 '22 11:05 jkhenning

@aurelien-m can yo also include the version numbers for the ClearML server and python package? I assume you're using a self-hosted open source ClearML server?

jkhenning avatar May 12 '22 11:05 jkhenning

Yes sure: clearml python package: Version: 1.0.3 I'm indeed self hosted, how do I check the version of the ClearML server?

As to how I report the images, I just report them in my validation_step of my pytorch_lightning.LightningModule:

    def validation_step(
        self, batch: List[...], batch_idx: int
    ) -> Tuple[List[Dict], List[...]]:

        ...

        images = [ ... ]
        predictions = self.model(images)
        predictions = [
            {k: v.detach().cpu() for (k, v) in prediction.items()} for prediction in predictions
        ]
        targets = [{k: v.detach().cpu() for (k, v) in batch_item.items()} for batch_item in batch]

        if (self.versus_image_batch_count) == -1 or (batch_idx < self.versus_image_batch_count):
            for img_index, (image, prediction, target) in enumerate(
                zip(images, predictions, targets)
            ):
                image_title = (
                    f"epoch_{self.current_epoch:03d}_batch_{batch_index:03d}_item_{image_index:03d}_versus"
                )
                
                ...

                prediction_annotated_image = ...
                target_annotated_image = ...
                versus_image = np.hstack((prediction_annotated_image, target_annotated_image))

                image = (versus_image * 255).astype(np.uint8)
                versus_image = Image.fromarray(np.rollaxis(image, 0, 3))

                Logger.current_logger().report_image(
                    "Autotrain", 
                    image_title, 
                    iteration=self.current_epoch, 
                    image=versus_image
                )

        return (predictions, targets)

And this is done for each batch of each of my validation steps. Not sure what's going wrong :/ This is also some company code, so I won't put in everything.

I'll look more into it tonight, I'll make a simple example that can be reproduced and see if it works.

montmejat avatar May 12 '22 13:05 montmejat

Hi @aurelien-m ,

The ClearML Server version should appear in the bottom right part of the screen after you go into the "Settings" page (available from the Profile drop-down menu on the top-right corner of the UI)

Is this a server you deployed yourself?

jkhenning avatar May 15 '22 19:05 jkhenning

Hello, yes it's a server I deployed myself on Linux following these steps. My ClearML server version is WebApp: 1.4.0-175 • Server: 1.4.0-175 • API: 2.18 .

I haven't really looked more into it yet though, I can try to see what's happening in the report_image function when my images don't show.

montmejat avatar May 16 '22 08:05 montmejat

Hi, @aurelien-m! Thank you for reporting this. I this that the SDK does not wait properly for the upload events to finish. Take this code, for instance:

import numpy as np
import time
import logging
import http.client as http_client
from clearml import Task, Logger
from PIL import Image, ImageFont, ImageDraw


def gen_img(txt):
    width = 128
    height = 128
    font = ImageFont.truetype("arial.ttf", size=20)
    img = Image.new("RGB", (width, height), color="black")
    imgDraw = ImageDraw.Draw(img)
    imgDraw.text((10, 10), str(txt), font=font, fill=(255, 255, 0))
    return img



task = Task.init(project_name="example", task_name="report_img", reuse_last_task_id=False)
for i in range(100):
    m = np.eye(256, 256, dtype=np.float)
    Logger.current_logger().report_image(
        "image", f"image float{i}", iteration=1, image=gen_img(i)
    )
# time.sleep(10)

Try running this one once without the sleep and once with the sleep. In my case, I see only 10 images in the UI when I don't sleep, and all of them when I do sleep.

I think the issue is here: https://github.com/allegroai/clearml/blob/master/clearml/backend_interface/metrics/reporter.py#L92-95

if self._is_subprocess_mode_and_not_parent_process():
    while self._queue and not self._queue.empty():
        sleep(0.1)
    return

After removing that return statement the issue seems to be fixed. Can you actually just comment/remove that return yourself and check if the issue is solved for you as well? The actual fix for this might be different, but it would be nice to know if this is the root of your problem. To do this:

  • find where clearml is installed:
    • can do it with pip show clearml or in a python shell: import clearml; print(clearml.__file__)
  • open clearml/backend_interface/metrics/reporter.py in a text editor
  • delete line 95
  • run the program you have issues with

Hey, thanks for the help. However, it did not fix my issue. It looks like BackgroundReportService.wait_for_events is never called. I tried putting a break point there but it's never triggered.

I tried adding Logger.current_logger().flush() but it also didn't help. :/

I updated the clearml python package to 1.4.0.

montmejat avatar May 16 '22 15:05 montmejat

@aurelien-m I believe that most debuggers will not trigger a breakpoint in BackgroundReportService.wait_for_events because it is called on exit. Anyhow, could you please upgrade to 1.4.1rc0 and eventually time.sleep(60) at the end of your program and tell us if you see more images in the UI (just to rule out the possibility that the program exits before images are properly reported)?

I'm unfortunately still having this issue. I thought I fixed it but no.

I have tried removing the line you have specified, but it doesn't seem to help.

I tried setting a limit of reported images at the end of each epochs, but the issue is still happening. It also becomes all blank after the issue. There used to be some images shown, but now there's nothing after I ended the task:

image

My versions now:

$ pip show clearml
Name: clearml
Version: 1.5.0
...

Server: image

montmejat avatar Jun 28 '22 08:06 montmejat

Hi @aurelien-m! We're sorry to hear you're still experiencing this issue. We will investigate and try to come up with a fix

I am facing the same issue when trying to report_image() for more than 100 images. There is no exception on either the server or "uploader" side. For reference, I am running a python 3.6 with the latest ClearML version. It works perfectly when uploading less than 100 images. I have tried slowing the rate of upload by putting in an iterative 0.1s sleep after each upload. I have also done your method of commenting out line 95 of reporter.py but to no avail.

The code I used

def __upload_clearml_image(self, image, image_name):
    ### Upload images to clearml_logger ###
    self.clearml_logger.report_image(self.data_filename, image_name, image=image, max_image_history = -1, iteration=1) # 

##### main code continuation .... #####
### Init Clearml ###
with open(f"path/val_list.txt", "r") as val_list_file:
            lines = val_list_file.readlines()
            im_list = []
            for line in lines:
                image_name = line.split("/")[-1].split(".")[0] // path/test.jpg -> `test`
                im = Image.open(line.strip())  # PIL image
                # im =    cv2.imread(line.strip())[..., ::-1]  # OpenCV image
                result = model(im)  # includes NMS
                ### Method: BytesIO
                res_render = result.render()
                result_base64_str = None
                for idx, img in enumerate(res_render):
                    buffered = BytesIO()
                    img_base64 = Image.fromarray(img)
                    img_base64.save(buffered, format="JPEG")
                    result_base64_str = base64.b64encode(buffered.getvalue()).decode('utf-8')
                res_im = Image.open(BytesIO(base64.b64decode(result_base64_str)))
                try:
                    self.__upload_clearml_image(res_im, image_name)
                except Exception as e:
                    print(e)
### Exit Clearml ###

Any update on this issue will be greatly appreciated!

Berttwm avatar Aug 31 '22 09:08 Berttwm

Hi @Berttwm,

Can you try to modify the "max_image_history" in Logger.report_image() and see if it solves it for you? If it does, you can increase the max # of images for all codes by changing file_history_size: 200 in the clearml.conf file

erezalg avatar Sep 13 '22 17:09 erezalg

Hi @Berttwm,

Can you try to modify the "max_image_history" in Logger.report_image() and see if it solves it for you? If it does, you can increase the max # of images for all codes by changing file_history_size: 200 in the clearml.conf file

I forgot to mention, that was one of the first things I have tried. Setting the value to 1000 to no avail.

Berttwm avatar Sep 15 '22 07:09 Berttwm

Hi @Berttwm,

I tried this simple code

from clearml import Task, Logger
t = Task.init(project_name='test',task_name='test_report_image')
l = t.get_logger()
for i in range(0,200):
    l.report_image(title='test',series='123',iteration=i,local_path='PATH.png'.format(i),max_image_history=10)

For me when I change max_image_history to 200, it can show 200 images. Also, I saw that you set it to -1. I have also changed the "file_history_size" to 200 and saw that I can see up to 200 images. Can you verify my findings? If it works for you, let's try to figure out where's the difference in the code

erezalg avatar Sep 19 '22 07:09 erezalg

Hello @erezalg, thanks for reaching out.

The upload clearml code I am using now. Set max_image_history=1000. image

We have exactly 841 images to upload (gaps between each indexed image), however my output caps at image 244 which is exactly the 100th indexed image in our list. This is after I have tried changing the max_image_history argument to many different numbers.

Clearml page output image

My own log that shows that upwards of image 2045 (index 841) has been processed, yet no image after image 244 (index 100) has been processed. It is interesting to note that they each enter the report_image function and processes within the same observable relative time, yet no new image is uploaded after image 244 (index 100). Setting the max_image_history=30 still outputted the same 100 images. image

I have replicated this process with different clearml.Task to the same effect of a 100 image cap. I have also tried it with less than 100 images (example 10 images), and it outputs 10 perfectly fine. There is an arbitrary 100 cap on my images and I do not know how to bypass this number.

FURTHER EDIT: I have also changed the file_history_size:1000 on my server's side clearml.conf, restarted the docker service (assuming this commits the clearml.conf changes). image

Berttwm avatar Sep 20 '22 09:09 Berttwm

Hi @Berttwm , I suspect that we are hitting the internal limit here on maximum amounts of unique metrics and variants. Currently due to Elasticsearch limitations the unique_metrics_count * unique_variants_count should not exceed 10000 per task. When you report images like this: self.clearml_logger.report_image(self.data_filename, image_name, image=image, max_image_history = -1, iteration=1) The first parameter (title) effectively becomes a metric name and the second one (series) the variant name. This leads to too many unique variants and metrics. Can you please try to put one of them to a constant? The other one should still be a unique value so that the images do not overlap. self.clearml_logger.report_image('test', image_name, image=image, max_image_history = -1, iteration=1)

evg-allegro avatar Sep 29 '22 08:09 evg-allegro