aim
aim copied to clipboard
Performance issues in Aim UI when exploring images on Brave and Chrome
🐛 Bug
I am tracking images with Aim on a remote server. To explore runs, I created a SSH tunnel forwarding Aim's remote port to a local port. I can connect to Aim UI just fine. However, when I try to explore ~400 tracked images, the UI sometimes get stuck and freezes. It happens a lot when using the "Goup by" tab. The browser shows a warning pop-up saying that the page crashed. I tried with Chrome and Brave (ad-blocker and other add-ons disabled), the issue persists in both browsers. With Firefox, I have no problem at all.
To reproduce
- Track images through Aim Python API on a remote machine.
- Start up Aim UI.
- Create a SSH tunnel forwarding the remote port where Aim UI is listening to a local port.
- Browse to the local port using Chrome or Brave and explore images.
Environment
- Aim versions: tested with 3.11.2 and 3.13.1
- Python versions: tested with 3.10.7 and 3.8.13
- pip version: 22.2.2
- OS: Linux (remote), Windows (local)
@Aonnghus thanks for reporting! @roubkar could you please look into this?
Looking into it!
Hey @Aonnghus, I tested the Images explorer but unfortunately couldn't reproduce your issue. Could you share more info about the images?
- type
- size
Ideally, if possible could you share the tracking scripts and the logs of the specific run that has caused the issue: just go to the .aim folder, and zip the subsequent run folder.
I would only need it for debugging purposes. Feel free to ping me on the community slack if the script and log cannot be shared here on GitHub :raised_hands:
Hello @VkoHov. The images I'm tracking are directly created with aim.Image from torch tensors, with options format="png" and optimize=True. Their size is 256x256.
I cannot send you the real data for confidentiality issues, but I created a dummy run with the same image parameters, and it also causes the bug to appear.
I'll send it to you through Slack
Hey @Aonnghus. Thanks for sharing the logs. Let me check it out and get back to you soon.
I experience the same issue on chrome. The view crashes with 4 runs. Probably too many images are logged. I couldnt even reset the query as it crashes the second you click on image explorer.
I was able to select 2 runs in the run explorer and use the compare option to open the image viewer with just 2 runs and it worked fine. For my experiments, I can log with much lesser quality so will do that,
Something to catch and show that the groupby is not possible due to too much data would be helpful.
@bsridatta could you please share how many images you track per run? Do you track images for each batch? Are those images large in size? Any detail would be very helpful to debug and resolve the issue. We have tested it for different scenarios and unfortunately haven't been able to reproduce the issue.
Sure @gorarakelyan , it is a bit too weird to understand when it breaks. Went from working fine to freezing (I guess at "searching over all runs") to working fine again.
I log 10 images every validation epoch (top 10 by loss), for 30 epochs. 300 images around 200Kbytes per run
As I mentioned the image explorer worked fine when enter the explorer by compare feature. Deleted few runs but same problem, freezes and have to reopen the tab. I query for the image explorer was set to something like context, subset: val, I couldnt reset the query as the ui freezes.
Initially logged with options from an example - jpeg, optimize=True, quality=50. Changed to png with 30. I have runs with both the formats now.
And now the weird part is everything works smoothly as it should with runs using both the formats. I cant say what changes - perhaps some query got reset somehow. I know this is not very helpful but will share if i observe it again.
Edit: And there isnt output in the terminal that runs aim up
Hey @bsridatta. Thanks for sharing the details. Will debug further and try to reproduce the issue. Will get back to you soon.
That's great @VkoHov thank you! My main intention for the comment was to highlight the work around of using compare feature without directly opening the image explorer. I found it out by chance so just sharing :)
I have the same(or similar) problem. Safari has the same problem. It doesn't seem like a browser related issue.
When storing many images (for example, when learning by 30000 steps, 10-20 image results are saved for every 1000 steps) When I move to the image page of the run, the images displayed on the top (ex. step 30000, 29000, 28000 step images are visible on the page at once, up to 27000 and 26000 step images are not visible on the page, but are loaded at once and scroll down It looks good right away.) It loads and looks good, but if I scroll below it, it takes a very long time to load from 25000 and 24000 step images.
In this state, the server does not receive any requests after that, and even if I connect from another browser, I cannot connect. However, it seems to be piled up in the queue, and the work of reading the image is very slow, but when it is finished, requests that have been delayed since then are processed at once.
The following is the log that appears in the situation described above when the server is started in the debug logging state.
Cannot index Run 8c924d585f164800b74169e7. Index is locked. Cannot index Run 8c924d585f164800b74169e7. Index is locked. Cannot index Run 32dff8c6567b4eb782501460. Index is locked. Cannot index Run 32dff8c6567b4eb782501460. Index is locked. Cannot index Run 32dff8c6567b4eb782501460. Index is locked. Cannot index Run 2a38ac7512a54484b028fab0. Index is locked. Cannot index Run 32dff8c6567b4eb782501460. Index is locked. INFO: [IP]:58789 - "POST /runs/images/get-batch HTTP/1.1" 200 OK Cannot index Run 32dff8c6567b4eb782501460. Index is locked. Cannot index Run 32dff8c6567b4eb782501460. Index is locked. Cannot index Run 8c1e7409ee414dda8f195e1e. Index is locked. Cannot index Run efe94f87aa774888b4854a33. Index is locked. Cannot index Run efe94f87aa774888b4854a33. Index is locked. Cannot index Run 8c1e7409ee414dda8f195e1e. Index is locked. Cannot index Run 8c1e7409ee414dda8f195e1e. Index is locked. Cannot index Run 8c1e7409ee414dda8f195e1e. Index is locked. INFO: [IP]:58567 - "POST /runs/images/get-batch HTTP/1.1" 200 OK Cannot index Run 5150187ef9534463aeb0b17b. Index is locked. Cannot index Run 8c1e7409ee414dda8f195e1e. Index is locked. Cannot index Run 8c1e7409ee414dda8f195e1e. Index is locked. Cannot index Run 8c1e7409ee414dda8f195e1e. Index is locked.
env. amazonlinux2, aws arm cpu (c7g), python3.10, remote server, aim up and server with worker 30-100 setting, aim 3.15.0
If I find out anything more, I'll add more.