idr-metadata icon indicating copy to clipboard operation
idr-metadata copied to clipboard

idr-testing May 2024

Open will-moore opened this issue 1 year ago • 51 comments

Steps needed on idr-next for NGFF upgrade. NB: current checklist is for actions on idr-testing (newly redeployed on 21st May 2024)

Detailed workflow is at https://github.com/IDR/mkngff_upgrade_scripts but this is an outline, also includes study-specific jobs:

Manual Software updates (should be part of the original deployment for idr-next):

  • [x] Update ZarrReader if needed to include recent work
  • [x] Install mkngff in venv, including recent PR branches (if not yet merged)
  • [x] Install latest iviewer 0.14.0

NGFF and other udpates:

  • [x] Add bfoptions file to skip_wells for all idr0009 Plates
  • [x] idr0070: remove annotations, delete svs images, re-import and re-annotate as described
  • [x] Create an OME-NGFF Tag and add the Screens and Projects
  • [x] As omero-server, clone https://github.com/IDR/mkngff_upgrade_scripts
  • [x] idr0015 - delete one duplicate plate named TARA_HCS1_H5_G100004727_G100004940--2013_12_08_21_26_28_chamber--U00--V01
  • [x] Get $SECRET, update all sql commands with it and run them (see https://github.com/IDR/mkngff_upgrade_scripts)
  • [x] Run mkngff symlink on all studies, including bfoptions creation
  • [x] Update bfoptions files for idr0004 plates P115 and P124
  • ~Regenerate thumbnails for idr0015 plate https://idr.openmicroscopy.org/webclient/?show=plate-4653~
  • [ ] Generate thumbnails for idr0012 Plate 130-16, by going to an Image (e.g. http://localhost:1080/webclient/?show=image-3063425 and "Save To All"
  • [ ] Bio-Formats cache regeneration
  • [ ] Validation? E.g. https://github.com/IDR/idr-utils/pull/55 on some data?
  • [x] Move idr00XX.csv and sql files from https://github.com/IDR/idr-utils/pull/56 into separate repos, then review/merge the PR.

will-moore avatar May 21 '24 14:05 will-moore

Using today's build (23rd May) of OMEZarrReader: As omero-server... Ran on all 5 servers:

bash-5.1$ wget https://merge-ci.openmicroscopy.org/jenkins/job/BIOFORMATS-build/76/label=testintegration/artifact/bio-formats-build/ZarrReader/target/OMEZarrReader-0.4.2-SNAPSHOT-jar-with-dependencies.jar

bash-5.1$ mv OMEZarrReader-0.4.2-SNAPSHOT-jar-with-dependencies.jar OMEZarrReader-b76.jar
bash-5.1$ rm OMERO.server/lib/client/OMEZarrReader.jar && rm OMERO.server/lib/server/OMEZarrReader.jar 
bash-5.1$ cp OMEZarrReader-b76.jar OMERO.server/lib/client/ && cp OMEZarrReader-b76.jar OMERO.server/lib/server/
exit
[wmoore@test122-omeroreadwrite ~]$ sudo service omero-server restart

Install mkngff on omeroreadwrite...

sudo -u root -s
source /opt/omero/server/venv3/bin/activate
pip install 'omero-mkngff @ git+https://github.com/IDR/omero-mkngff@main'
...
Resolved https://github.com/IDR/omero-mkngff to commit d3f95c90b379a8f61a866539cdb0e1e490fad84b
Successfully installed omero-mkngff-0.1.0.dev0

will-moore avatar May 23 '24 13:05 will-moore

$ sudo -u omero-server -s
bash-5.1$ cd
bash-5.1$ git clone https://github.com/IDR/mkngff_upgrade_scripts.git

Setup screen and delete duplicate plate for idr0015

screen -S mkngff
source /opt/omero/server/venv3/bin/activate
export OMERODIR=/opt/omero/server/OMERO.server
omero config get omero.db.host
export DBHOST=192.168.10.231
omero config get omero.db.pass
export PGPASSWORD=[********]

omero login...
omero delete Plate:4801 --report > /tmp/delete_idr0015.log

mkngff...

omero mkngff setup > setup.sql
psql -U omero -d idr -h $DBHOST -f setup.sql
CREATE FUNCTION

cd mkngff_upgrade_scripts/ngff_filesets/idr0004
for i in $(ls); do sed -i 's/SECRETUUID/42650434-6eaa-45e5-9542-58247a45d8bc/g' $i; done

omero login

cd ngff_filesets
export IDRID=idr0004
for r in $(cat $IDRID.csv); do
  biapath=$(echo $r | cut -d',' -f2)
  uuid=$(echo $biapath | cut -d'/' -f2)
  fsid=$(echo $r | cut -d',' -f3 | tr -d '[:space:]')
  psql -U omero -d idr -h $DBHOST -f "$IDRID/$fsid.sql"
  omero mkngff symlink /data/OMERO/ManagedRepository $fsid "/bia-integrator-data/$biapath/$uuid.zarr" --bfoptions --clientpath="https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/$biapath/$uuid.zarr"
done

Fix idr0004 bfoptions (remove quick_read)

vi /data/OMERO/ManagedRepository/demo_2/2015-10/01/07-46-42.965_mkngff/35cfc0db-7795-497c-aed5-1ae591b2d9f1.zarr.bfoptions
vi /data/OMERO/ManagedRepository/demo_2/2015-10/01/07-57-40.271_mkngff/ee8872c8-e4b1-41fa-aa4f-a9e3e200c540.zarr.bfoptions

Repeat for all other studies...

will-moore avatar May 23 '24 14:05 will-moore

idr0010 (15:43...) idr0011 (17:32...) idrr012 (17:43...) idr0013 (17:57...) idr0016 (22:10...) idr0015 (06:41...) idr0025 8:07 idr0026 8:08 idr0033 9:25 idr0035 9:45 idr0036 9:49 idr0051 10:01 idr0054 10:02 idr0064 10:05 idr0090 10:07 idr0091 10:12

will-moore avatar May 23 '24 16:05 will-moore

BF cache memo generation...

$ ssh -A idr-testing.openmicroscopy.org -L 1080:omeroreadwrite:80
[wmoore@test122-proxy ~]$ grep -oE 'omero[^ ]+$' /etc/hosts > nodes
[wmoore@test122-proxy ~]$ cat nodes
omeroreadonly-1
omeroreadonly-2
omeroreadonly-3
omeroreadonly-4
omeroreadwrite

Delete (move) cache

ssh omeroreadwrite
sudo -u omero-server -s
cd /data/OMERO/BioFormatsCache/
mv data data_to_delete

Generate target IDs...

ssh omeroreadwrite
/opt/omero/server/OMERO.server/bin/omero login

/opt/omero/server/OMERO.server/bin/omero hql --limit -1 --ids-only --style csv 'SELECT MIN(field.image.id) FROM Screen as s JOIN s.plateLinks as plateLinks JOIN s.annotationLinks as annLinks join annLinks.child as ann join plateLinks.child as p join p.wells as w join w.wellSamples as field where  ann.id=38304992 GROUP BY field.well.plate' > ngff_plates.txt

/opt/omero/server/OMERO.server/bin/omero hql --limit -1 --ids-only --style csv 'SELECT d.id FROM Project as p JOIN p.datasetLinks as datasetLinks JOIN p.annotationLinks as annLinks join annLinks.child as ann join datasetLinks.child as d where  ann.id=38304992' > ngff_datasets.txt

vi ngff_plates.txt     # remove first row
vi ngff_datasets.txt     # remove first row

Back on proxy server... [wmoore@test122-proxy ~]

$ rsync -rvP omeroreadwrite:/home/wmoore/ngff_plates.txt ./
$ rsync -rvP omeroreadwrite:/home/wmoore/ngff_datasets.txt ./

$ cut -d ',' -f2 ngff_plates.txt | sed -e 's/^/Image:/' > ngff_ids.txt
$ cut -d ',' -f2 ngff_datasets.txt | sed -e 's/^/Dataset:/' >> ngff_ids.txt 
$ wc ngff_ids.txt 
 1643  1643 22938 ngff_ids.txt

$ screen -dmS cache parallel --eta --sshloginfile nodes -a ngff_ids.txt --results /tmp/ngff_cache_20240524/ -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'

will-moore avatar May 24 '24 10:05 will-moore

Unfortunately it seems that the cache generation is silently failing. Nothing is produced under /tmp/ngff_cache_20240524/, and immediately running screen -r shows no screens.

Checked for which parallel and it isn't installed: Installed...

sudo -u root -s
dnf install parallel

Ran again..

screen -dmS cache parallel --eta --sshloginfile nodes -a ngff_ids.txt --results /tmp/ngff_cache_20240524/ -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'

EDIT: (27th May) After running over the weekend didn't get many ok

[wmoore@test122-proxy 1]$ grep ok /tmp/ngff_cache_20240524/1/**/* | wc
     59     236    5927
[wmoore@test122-proxy ~]$ ls /tmp/ngff_cache_20240524/1/ | wc
   1643    1643   22938

Run again...

screen -dmS cache parallel --eta --sshloginfile nodes -a ngff_ids.txt --results /tmp/ngff_cache_20240527/ -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'

28th May: ...after a day - screen -r reports no screen. Exactly three times more ok this time: (59 -> 177).

[wmoore@test122-proxy ~]$ grep ok /tmp/ngff_cache_20240527/1/**/* | wc
    177     795   19336

Check a random output...

[wmoore@test122-proxy ~]$ cat /tmp/ngff_cache_20240527/1/Image:9822101/stderr
wmoore@omeroreadonly-3: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

Not sure what's going on here. I checked I can do ssh omeroreadonly-3 etc to all the servers. Run again!....

On completion (after OME 2024 meeting): We are getting a few more "ok" results, but still way below success:

[wmoore@test122-proxy ~]$ grep ok /tmp/ngff_cache_20240528/1/**/* | wc
    230     946   23708

$ grep denied /tmp/ngff_cache_20240528/1/**/* 
/tmp/ngff_cache_20240528/1/Dataset:11901/stderr:wmoore@omeroreadwrite: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
/tmp/ngff_cache_20240528/1/Dataset:11902/stderr:wmoore@omeroreadonly-2: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
/tmp/ngff_cache_20240528/1/Dataset:11903/stderr:wmoore@omeroreadwrite: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
/tmp/ngff_cache_20240528/1/Dataset:11904/stderr:wmoore@omeroreadonly-2: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
/tmp/ngff_cache_20240528/1/Dataset:11905/stderr:wmoore@omeroreadwrite: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
/tmp/ngff_cache_20240528/1/Dataset:11906/stderr:wmoore@omeroreadonly-2: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
...

$ grep denied /tmp/ngff_cache_20240528/1/**/* | wc
   1436    5744  190228

Only see "Permission denied" with omeroreadwrite and omeroreadonly-2, not the others:

[wmoore@test122-proxy ~]$ grep denied /tmp/ngff_cache_20240528/1/**/* | grep "only-2" | wc
    742    2968   98652
[wmoore@test122-proxy ~]$ grep denied /tmp/ngff_cache_20240528/1/**/* | grep "readwrite" | wc
    694    2776   91576

On the 27th (previous run) we see "Permission denied" for all servers:

[wmoore@test122-proxy ~]$ grep denied /tmp/ngff_cache_20240527/1/**/* | grep "readwrite" | wc
    294    1176   38795
[wmoore@test122-proxy ~]$ grep denied /tmp/ngff_cache_20240527/1/**/* | grep "readonly-1" | wc
    163     652   21675
[wmoore@test122-proxy ~]$ grep denied /tmp/ngff_cache_20240527/1/**/* | grep "readonly-2" | wc
    409    1636   54377
[wmoore@test122-proxy ~]$ grep denied /tmp/ngff_cache_20240527/1/**/* | grep "readonly-3" | wc
    295    1180   39219
[wmoore@test122-proxy ~]$ grep denied /tmp/ngff_cache_20240527/1/**/* | grep "readonly-4" | wc
    324    1296   43080

will-moore avatar May 24 '24 13:05 will-moore

@sbesson Any ideas what's causing those Permission denied errors? They seem to only come from 2 servers last time I ran this (omeroreadwrite and omeroreadonly-2):

$ grep denied /tmp/ngff_cache_20240528/1/**/* 
/tmp/ngff_cache_20240528/1/Dataset:11901/stderr:wmoore@omeroreadwrite: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
/tmp/ngff_cache_20240528/1/Dataset:11902/stderr:wmoore@omeroreadonly-2: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
/tmp/ngff_cache_20240528/1/Dataset:11903/stderr:wmoore@omeroreadwrite: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
/tmp/ngff_cache_20240528/1/Dataset:11904/stderr:wmoore@omeroreadonly-2: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
/tmp/ngff_cache_20240528/1/Dataset:11905/stderr:wmoore@omeroreadwrite: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
/tmp/ngff_cache_20240528/1/Dataset:11906/stderr:wmoore@omeroreadonly-2: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

Running again now...

$ ssh -A idr-testing.openmicroscopy.org
Last login: Mon Jun  3 08:47:51 2024 from 134.36.66.49
[wmoore@test122-proxy ~]$ screen -dmS cache parallel --eta --sshloginfile nodes -a ngff_ids.txt --results /tmp/ngff_cache_20240603/ -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'

EDIT: Didn't see any "Permission denied" errors this time, but plenty of others!

[wmoore@test122-proxy ~]$ grep rror /tmp/ngff_cache_20240603/1/**/* | wc
   1704   10341  231921
[wmoore@test122-proxy ~]$ grep ResourceError /tmp/ngff_cache_20240603/1/**/* | wc
   1669   10014  225473
[wmoore@test122-proxy ~]$ grep FileNotFoundError /tmp/ngff_cache_20240603/1/**/* | wc
     29     261    5626

But checking these in the webclient either viewed fine or showed spinner (no memo file)..?

Except for idr0064... Still haven't updated https://github.com/IDR/mkngff_upgrade_scripts/blob/main/ngff_filesets/idr0064.csv with original scripts for updating vanilla Filesets to data at BioImage Archive. See https://github.com/IDR/idr-metadata/issues/682#issuecomment-1895755941

will-moore avatar Jun 03 '24 10:06 will-moore

Another attempt (5th)...

$ ssh -A idr-testing.openmicroscopy.org -L 1080:omeroreadwrite:80

[wmoore@test122-proxy ~]$ screen -dmS cache parallel --eta --sshloginfile nodes -a ngff_ids.txt --results /tmp/ngff_cache_20240604/ -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'

EDIT: checking back a few days later - looks better: NB use ok: to avoid matching invoke(:

[wmoore@test122-proxy ~]$ grep "ok:" /tmp/ngff_cache_20240604/1/**/* | wc
   1829    7316  188495

But actually, most of these are Dataset images. If we exclude those, find only 239 Plate images are ok out of 1599

[wmoore@test122-proxy ~]$ grep "ok:" /tmp/ngff_cache_20240604/1/**/* | grep -v Dataset | wc
    239     956   24012
[wmoore@test122-proxy ~]$ grep Image ngff_ids.txt | wc
   1599    1599   22362

Also lots of ResourceErrors! 1346 (out of 1599 Plates), and 461 for Dataset Images

[wmoore@test122-proxy ~]$ grep "ResourceError" /tmp/ngff_cache_20240604/1/**/* | wc
   1807   10842  245921

[wmoore@test122-proxy ~]$ grep "ResourceError" /tmp/ngff_cache_20240604/1/**/* | grep Dataset | wc
    461    2766   63114
[wmoore@test122-proxy ~]$ grep "ResourceError" /tmp/ngff_cache_20240604/1/**/* | grep -v Dataset | wc
   1346    8076  182807

Many images with ResourceError are viewable in webclient. Picking an Image that isn't (idr0004, plate 171, http://localhost:1080/webclient/?show=image-698831), look for fileset name in logs on all servers...

for server in omeroreadwrite omeroreadonly-1 omeroreadonly-2 omeroreadonly-3 omeroreadonly-4; do echo $server && ssh $server "grep e3283a6a-d25b-41e1-8ab7-1837b89e3a6e /opt/omero/server/OMERO.server/var/log/Blitz-0.log"; done

Finds entries in logs on omeroreadwrite, including

2024-06-11 07:43:30,171 DEBUG [                   loci.formats.Memoizer] (.Server-10) saved memo file: /data/OMERO/BioFormatsCache/data/OMERO/ManagedRepository/demo_2/2015-10/01/08-51-39.709_mkngff/e3283a6a-d25b-41e1-8ab7-1837b89e3a6e.zarr/..zattrs.bfmemo (128925 bytes)

But No Errors with this search.

Also found 14 sessions errors:

[wmoore@test122-proxy ~]$ grep "rror" /tmp/ngff_cache_20240604/1/**/* | grep -v ResourceError | grep sessions
/tmp/ngff_cache_20240604/1/Image:12550005/stderr:FileNotFoundError: [Errno 2] No such file or directory: path('/home/wmoore/omero/sessions/localhost/public/2f9ce65b-a180-4828-a64b-f2c08a8f6743')
/tmp/ngff_cache_20240604/1/Image:1486790/stderr:FileNotFoundError: [Errno 2] No such file or directory: path('/home/wmoore/omero/sessions/localhost/public/2f4eff31-ac3c-4f91-bc7b-e8464463f3a7')
...

Found 708 errors like:

/tmp/ngff_cache_20240604/1/Image:9821901/stderr:!! 06/04/24 09:49:07.402 error: 9 communicators not destroyed during global destruction.

will-moore avatar Jun 04 '24 09:06 will-moore

@sbesson - updated the previous comment above with more details of errors on the last attempt at memo file generation for NGFF data (1599 Plate Images and 44 Datasets).

Summay:

  • 1807 ResourceErrors but don't know what caused them. Only 239 out of 1599 NGFF Plates are "ok" in memo generation logs, but data seems to be OK when viewed in webclient.
  • 14 No such file or directory: path('/home/wmoore/omero/sessions/localhost/public/...
  • 708 communicators not destroyed

will-moore avatar Jun 11 '24 08:06 will-moore

From Seb: omeroreadonly-4 I am seeing lots of [java.io](http://java.io/).FileNotFoundException...(Transport endpoint is not connected) in the logs. And

[sbesson@test122-omeroreadonly-4 ~]$ ls /bia-integrator-data 
ls: cannot access '/bia-integrator-data': Transport endpoint is not connected

Fixed by unmouting and re-mounting goofys:

[wmoore@test122-omeroreadonly-4 ~]$ sudo umount /bia-integrator-data
[wmoore@test122-omeroreadonly-4 ~]$ sudo goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data
[wmoore@test122-omeroreadonly-4 ~]$ ls /bia-integrator-data/S-BIAD815/51afff7c-eed4-44b4-95c7-1437d8807b97/51afff7c-eed4-44b4-95c7-1437d8807b97.zarr
0  OME

Also check this is mounted on all servers:

[wmoore@test122-proxy ~]$ for n in $(cat nodes); do echo $n && ssh $n "ls /bia-integrator-data/S-BIAD815/51afff7c-eed4-44b4-95c7-1437d8807b97/51afff7c-eed4-44b4-95c7-1437d8807b97.zarr"; done
omeroreadonly-1
0
OME
omeroreadonly-2
0
OME
omeroreadonly-3
0
OME
omeroreadonly-4
0
OME
omeroreadwrite
0
OME

will-moore avatar Jun 11 '24 08:06 will-moore

6th attempt:

screen -dmS cache parallel --eta --sshloginfile nodes -a ngff_ids.txt --results /tmp/ngff_cache_20240611/ -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'

Immediately there are a bunch of Image:XX dirs:

[wmoore@test122-proxy ~]$ ls /tmp/ngff_cache_20240611/1 | wc
    324     324    4535

A bunch of sessions errors e.g.

/tmp/ngff_cache_20240611/1/Image:3415781/stderr:FileNotFoundError: [Errno 2] No such file or directory: path('/home/wmoore/omero/sessions/localhost/public/f11b34f2-a63c-4926-a207-ec423209f5cf')

$ grep "FileNotFoundError" /tmp/ngff_cache_20240611/1/**/* | wc
     27     243    5239

And some "communicators not destroyed"

/tmp/ngff_cache_20240611/1/Image:1484119/stderr:!! 06/11/24 21:16:15.526 error: 9 communicators not destroyed during global destruction.

$ grep "rror" /tmp/ngff_cache_20240611/1/**/* | grep communicators | wc
     17     187    2328

No other errors initially... Overnight, 2 other ResourceErrors:

[wmoore@test122-proxy ~]$ grep "rror" /tmp/ngff_cache_20240611/1/**/* | grep -v communicators | grep -v FileNotFoundError
/tmp/ngff_cache_20240611/1/Image:1556033/stdout:fail: Pixels:1556033 Image:1556033 11470.436125516891 exception ::omero::ResourceError
/tmp/ngff_cache_20240611/1/Image:1573071/stdout:fail: Pixels:1573071 Image:1573071 10738.076765537262 exception ::omero::ResourceError

Both of these are from idr0013 and are NOT NGFF data! Fileset:18568 and Fileset:18655 are the 2nd and 3rd from last rows in idr0013.csv. Last row is 18728 which also is not NGFF converted. But checking earlier in idr0013.csv e.g. 4th from end and earlier, finds no unconverted Filesets, so it looks like just the last 3 failed?

Checking progress (12th June)... It looks like the process is not running. Only 311 "ok" and still no more errors.

[wmoore@test122-proxy ~]$ screen -r
There is no screen to be resumed.

[wmoore@test122-proxy ~]$ ls /tmp/ngff_cache_20240611/1/ | wc
   1643    1643   22938

[wmoore@test122-proxy ~]$ grep "ok:" /tmp/ngff_cache_20240611/1/**/* | wc
    311    1244   31235

# /tmp/ngff_cache_20240611/1/Image:696324/stdout:Previous session expired for public on localhost:4064
[wmoore@test122-proxy ~]$ grep "session expired" /tmp/ngff_cache_20240611/1/**/* | wc
     70     490    7142

Previous errors in stdout:

[wmoore@test122-proxy ~]$ cat /tmp/ngff_cache_20240611/1/**/stdout | grep -v "ok:" | grep -v "session expired"
fail: Pixels:1556033 Image:1556033 11470.436125516891 exception ::omero::ResourceError
fail: Pixels:1573071 Image:1573071 10738.076765537262 exception ::omero::ResourceError

But lots more in stderr - errors (with counts)

cat /tmp/ngff_cache_20240611/1/**/stderr
...
wmoore@omeroreadonly-3: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).


[wmoore@test122-proxy ~]$ grep "wmoore@omeroreadonly-3: Permission denied (publickey,gssapi-keyex,gssapi-with-mic)" /tmp/ngff_cache_20240611/1/**/stderr | wc
    492    1968   65421
[wmoore@test122-proxy ~]$ grep "wmoore@omeroreadwrite: Permission denied (publickey,gssapi-keyex,gssapi-with-mic)" /tmp/ngff_cache_20240611/1/**/stderr | wc
    495    1980   65317
[wmoore@test122-proxy ~]$ grep "Password check failed" /tmp/ngff_cache_20240611/1/**/stderr | wc
     27     243    3511

will-moore avatar Jun 11 '24 21:06 will-moore

session timeout

Let's try to log-in and set sessions timeout before running parallels

[wmoore@test122-proxy ~]$ for server in $(cat nodes); do ssh $server "/opt/omero/server/OMERO.server/bin/omero login -s localhost -u public -w public && /opt/omero/server/OMERO.server/bin/omero sessions timeout 3600"; done
Created session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
Previous session expired for public on localhost:4064
Using session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
cannot update timeout for 81a63a9a-1513-4a3a-95b0-0599b6c46d1a
Created session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
Previous session expired for public on localhost:4064
Using session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
cannot update timeout for bda1ab04-ea1c-4f39-838b-00bffe7795e8
Created session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
Previous session expired for public on localhost:4064
Using session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
cannot update timeout for 53ec8eb4-e947-49a2-b684-5b89ee8b02e3
Created session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
Previous session expired for public on localhost:4064
Using session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
cannot update timeout for 2480e613-d6da-4fbd-93d7-aa1974979cbf
Created session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
Previous session expired for public on localhost:4064
Using session for public@localhost:4064. Idle timeout: 10 min. Current group: Public

Hmmm - oh well....

Give another try anyway...

screen -dmS cache parallel --eta --sshloginfile nodes -a ngff_ids.txt --results /tmp/ngff_cache_20240613/ -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'

After few hours overnight - no errors except know 2 above - no FileNotFound or communicators not destroyed.

[wmoore@test122-proxy ~]$ grep "rror" /tmp/ngff_cache_20240613/1/**/*
/tmp/ngff_cache_20240613/1/Image:1556033/stdout:fail: Pixels:1556033 Image:1556033 1933.3922345638275 exception ::omero::ResourceError
/tmp/ngff_cache_20240613/1/Image:1573071/stdout:fail: Pixels:1573071 Image:1573071 1944.575980424881 exception ::omero::ResourceError

Some timeouts have changed to 60 min, approx 1/5th (none were 60 mins before):

$ grep timeout /tmp/ngff_cache_20240613/1/**/*
...
/tmp/ngff_cache_20240613/1/Image:692360/stderr:Using session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
/tmp/ngff_cache_20240613/1/Image:693716/stderr:Reconnected to session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
/tmp/ngff_cache_20240613/1/Image:696271/stderr:Using session for public@localhost:4064. Idle timeout: 60 min. Current group: Public
/tmp/ngff_cache_20240613/1/Image:696324/stderr:Using session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
/tmp/ngff_cache_20240613/1/Image:696797/stderr:Using session for public@localhost:4064. Idle timeout: 60 min. Current group: Public
/tmp/ngff_cache_20240613/1/Image:9822101/stderr:Using session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
[wmoore@test122-proxy ~]$ grep timeout /tmp/ngff_cache_20240613/1/**/* | wc
    326    3612   43591
[wmoore@test122-proxy ~]$ grep "timeout: 60" /tmp/ngff_cache_20240613/1/**/* | wc
     72     792    9573

Some "ok:", similar number as before (and NONE of these are Dataset Images). Still small portion of 1599 total Plates!

[wmoore@test122-proxy ~]$ grep "ok:" /tmp/ngff_cache_20240613/1/**/* | wc
    324    1296   32546

Lots of these - equally distributed between all 5 servers:

/tmp/ngff_cache_20240613/1/Image:9821951/stderr:wmoore@omeroreadonly-3: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
/tmp/ngff_cache_20240613/1/Image:9822001/stderr:wmoore@omeroreadonly-1: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
/tmp/ngff_cache_20240613/1/Image:9822051/stderr:wmoore@omeroreadonly-3: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

[wmoore@test122-proxy ~]$ grep "Permission denied" /tmp/ngff_cache_20240613/1/**/* | wc
   1269    5076  168461
[wmoore@test122-proxy ~]$ grep "Permission denied" /tmp/ngff_cache_20240613/1/**/* | grep "readwrite" | wc
    254    1016   33512
[wmoore@test122-proxy ~]$ grep "Permission denied" /tmp/ngff_cache_20240613/1/**/* | grep "readonly-1" | wc
    254    1016   33774
[wmoore@test122-proxy ~]$ grep "Permission denied" /tmp/ngff_cache_20240613/1/**/* | grep "readonly-2" | wc
    254    1016   33767
[wmoore@test122-proxy ~]$ grep "Permission denied" /tmp/ngff_cache_20240613/1/**/* | grep "readonly-3" | wc
    253    1012   33638
[wmoore@test122-proxy ~]$ grep "Permission denied" /tmp/ngff_cache_20240613/1/**/* | grep "readonly-4" | wc
    254    1016   33770

will-moore avatar Jun 13 '24 22:06 will-moore

Looking at the logs for the source of these Permission denied issues, I think the relevant events are

[sbesson@test122-omeroreadwrite ~]$ sudo grep -B 10 -A 10 fatal /var/log/secure-20240616 
Jun 11 21:30:56 test122-omeroreadwrite sshd[1335505]: Accepted publickey for wmoore from 192.168.2.190 port 32888 ssh2: RSA SHA256:Nurjq5im8jid017paB3gXF7nz31bCaoY0W9UImvhwX8
Jun 11 21:30:56 test122-omeroreadwrite sshd[1335505]: pam_unix(sshd:session): session opened for user wmoore(uid=5098) by wmoore(uid=0)
Jun 11 21:46:35 test122-omeroreadwrite sshd[1335130]: Received disconnect from 192.168.2.190 port 55788:11: disconnected by user
Jun 11 21:46:35 test122-omeroreadwrite sshd[1335130]: Disconnected from user wmoore 192.168.2.190 port 55788
Jun 11 21:46:35 test122-omeroreadwrite sshd[1335127]: pam_unix(sshd:session): session closed for user wmoore
Jun 11 21:46:36 test122-omeroreadwrite sshd[1336220]: Accepted publickey for wmoore from 192.168.2.190 port 35256 ssh2: RSA SHA256:Nurjq5im8jid017paB3gXF7nz31bCaoY0W9UImvhwX8
Jun 11 21:46:36 test122-omeroreadwrite sshd[1336220]: pam_unix(sshd:session): session opened for user wmoore(uid=5098) by wmoore(uid=0)
Jun 11 22:15:16 test122-omeroreadwrite sshd[1334064]: Received disconnect from 192.168.2.190 port 37106:11: disconnected by user
Jun 11 22:15:16 test122-omeroreadwrite sshd[1334064]: Disconnected from user wmoore 192.168.2.190 port 37106
Jun 11 22:15:16 test122-omeroreadwrite sshd[1334018]: pam_unix(sshd:session): session closed for user wmoore
Jun 11 22:17:16 test122-omeroreadwrite sshd[1337471]: fatal: Timeout before authentication for 192.168.2.190 port 41974
Jun 11 22:21:59 test122-omeroreadwrite sshd[1337726]: Connection closed by authenticating user wmoore 192.168.2.190 port 52128 [preauth]
Jun 11 22:21:59 test122-omeroreadwrite sshd[1337728]: Connection closed by authenticating user wmoore 192.168.2.190 port 52134 [preauth]
Jun 11 22:21:59 test122-omeroreadwrite sshd[1337730]: Connection closed by authenticating user wmoore 192.168.2.190 port 52146 [preauth]
Jun 11 22:21:59 test122-omeroreadwrite sshd[1337732]: Connection closed by authenticating user wmoore 192.168.2.190 port 52160 [preauth]
Jun 11 22:21:59 test122-omeroreadwrite sshd[1337734]: Connection closed by authenticating user wmoore 192.168.2.190 port 52166 [preauth]
Jun 11 22:22:00 test122-omeroreadwrite sshd[1337736]: Connection closed by authenticating user wmoore 192.168.2.190 port 52174 [preauth]
Jun 11 22:22:00 test122-omeroreadwrite sshd[1337738]: Connection closed by authenticating user wmoore 192.168.2.190 port 53496 [preauth]
Jun 11 22:22:01 test122-omeroreadwrite sshd[1337740]: Connection closed by authenticating user wmoore 192.168.2.190 port 53502 [preauth]
Jun 11 22:22:01 test122-omeroreadwrite sshd[1337742]: Connection closed by authenticating user wmoore 192.168.2.190 port 53514 [preauth]
Jun 11 22:22:01 test122-omeroreadwrite sshd[1337744]: Connection closed by authenticating user wmoore 192.168.2.190 port 53518 [preauth]
--
Jun 13 22:14:54 test122-omeroreadwrite sshd[1453987]: pam_unix(sshd:session): session closed for user wmoore
Jun 13 22:14:55 test122-omeroreadwrite sshd[1454271]: Received disconnect from 192.168.2.190 port 34842:11: disconnected by user
Jun 13 22:14:55 test122-omeroreadwrite sshd[1454271]: Disconnected from user wmoore 192.168.2.190 port 34842
Jun 13 22:14:55 test122-omeroreadwrite sshd[1454200]: pam_unix(sshd:session): session closed for user wmoore
Jun 13 22:14:55 test122-omeroreadwrite sshd[1454219]: Received disconnect from 192.168.2.190 port 34816:11: disconnected by user
Jun 13 22:14:55 test122-omeroreadwrite sshd[1454219]: Disconnected from user wmoore 192.168.2.190 port 34816
Jun 13 22:14:55 test122-omeroreadwrite sshd[1454197]: pam_unix(sshd:session): session closed for user wmoore
Jun 13 22:14:55 test122-omeroreadwrite sshd[1454244]: Received disconnect from 192.168.2.190 port 34832:11: disconnected by user
Jun 13 22:14:55 test122-omeroreadwrite sshd[1454244]: Disconnected from user wmoore 192.168.2.190 port 34832
Jun 13 22:14:55 test122-omeroreadwrite sshd[1454199]: pam_unix(sshd:session): session closed for user wmoore
Jun 13 22:16:52 test122-omeroreadwrite sshd[1454272]: fatal: Timeout before authentication for 192.168.2.190 port 34850
Jun 13 22:16:52 test122-omeroreadwrite sshd[1454298]: fatal: Timeout before authentication for 192.168.2.190 port 34854
Jun 13 22:16:53 test122-omeroreadwrite sshd[1454383]: fatal: Timeout before authentication for 192.168.2.190 port 34868
Jun 13 22:16:54 test122-omeroreadwrite sshd[1454386]: fatal: Timeout before authentication for 192.168.2.190 port 34872
Jun 13 22:16:54 test122-omeroreadwrite sshd[1454388]: fatal: Timeout before authentication for 192.168.2.190 port 34884
Jun 13 22:16:54 test122-omeroreadwrite sshd[1454390]: fatal: Timeout before authentication for 192.168.2.190 port 34886
Jun 13 22:16:54 test122-omeroreadwrite sshd[1454392]: fatal: Timeout before authentication for 192.168.2.190 port 34888
Jun 13 22:16:55 test122-omeroreadwrite sshd[1454394]: fatal: Timeout before authentication for 192.168.2.190 port 34892
Jun 13 22:16:55 test122-omeroreadwrite sshd[1454397]: fatal: Timeout before authentication for 192.168.2.190 port 34896
Jun 13 22:16:55 test122-omeroreadwrite sshd[1454399]: fatal: Timeout before authentication for 192.168.2.190 port 34902
Jun 13 22:19:22 test122-omeroreadwrite sshd[1454576]: Connection closed by authenticating user wmoore 192.168.2.190 port 43110 [preauth]
Jun 13 22:19:22 test122-omeroreadwrite sshd[1454578]: Connection closed by authenticating user wmoore 192.168.2.190 port 43120 [preauth]
Jun 13 22:19:22 test122-omeroreadwrite sshd[1454580]: Connection closed by authenticating user wmoore 192.168.2.190 port 43130 [preauth]
Jun 13 22:19:22 test122-omeroreadwrite sshd[1454582]: Connection closed by authenticating user wmoore 192.168.2.190 port 43144 [preauth]
Jun 13 22:19:22 test122-omeroreadwrite sshd[1454584]: Connection closed by authenticating user wmoore 192.168.2.190 port 43146 [preauth]
Jun 13 22:19:22 test122-omeroreadwrite sshd[1454586]: Connection closed by authenticating user wmoore 192.168.2.190 port 43148 [preauth]
Jun 13 22:19:22 test122-omeroreadwrite sshd[1454588]: Connection closed by authenticating user wmoore 192.168.2.190 port 43152 [preauth]
Jun 13 22:19:22 test122-omeroreadwrite sshd[1454590]: Connection closed by authenticating user wmoore 192.168.2.190 port 43160 [preauth]
Jun 13 22:19:22 test122-omeroreadwrite sshd[1454591]: Connection closed by authenticating user wmoore 192.168.2.190 port 43172 [preauth]
Jun 13 22:19:22 test122-omeroreadwrite sshd[1454593]: Connection closed by authenticating user wmoore 192.168.2.190 port 43178 [preauth]

But I haven't found yet a good explanation for what could cause these timeouts breaking the SSH connection

sbesson avatar Jun 17 '24 08:06 sbesson

#As omero-server, updated OMEZarrReader on all 5 idr-testing servers:

wget https://artifacts.openmicroscopy.org/artifactory/ome.releases/ome/OMEZarrReader/0.5.1/OMEZarrReader-0.5.1.jar
rm OMERO.server/lib/client/OMEZarrReader-b76.jar && rm OMERO.server/lib/server/OMEZarrReader-b76.jar 
cp OMEZarrReader-0.5.1.jar OMERO.server/lib/client/ && cp OMEZarrReader-0.5.1.jar OMERO.server/lib/server/
exit
$ sudo service omero-server restart

However, trying to view images in web (connected to omeroreadwrite:80) gave errors suggesting we're missing dependencies:

    serverExceptionClass = ome.conditions.InternalException
    message =  Wrapped Exception: (java.lang.NoClassDefFoundError):
com/amazonaws/services/s3/model/S3ObjectInputStream

Reverting back to latest daily build...

wget https://merge-ci.openmicroscopy.org/jenkins/job/BIOFORMATS-build/101/label=testintegration/artifact/bio-formats-build/ZarrReader/target/OMEZarrReader-0.5.2-SNAPSHOT-jar-with-dependencies.jar
mv OMEZarrReader-0.5.2-SNAPSHOT-jar-with-dependencies.jar OMEZarrReader-0.5.2_b101.jar
rm OMERO.server/lib/client/OMEZarrReader-0.5.1.jar && rm OMERO.server/lib/server/OMEZarrReader-0.5.1.jar
cp OMEZarrReader-0.5.2_b101.jar OMERO.server/lib/client/ && cp OMEZarrReader-0.5.2_b101.jar OMERO.server/lib/server/

will-moore avatar Jun 17 '24 10:06 will-moore

Yesterday: To avoid sshloginfile issues seen above, I manually split ngff_ids.txt (1643 rows) into blocks of 350 rows on each of the omeroreadonly servers (omeroreadwrite I just started running with existing ngff_plates.txt to cover the first 350+ rows). omeroreadonly-1: rows 351-700 omeroreadonly-2: rows 701-1050 omeroreadonly-3: rows 1051-1400 omeroreadonly-4: rows 1401-1643

ssh to each server in turn and ran

screen -dmS cache parallel --eta -a ngff_ids.txt --results /tmp/ngff_cache_20240617/ -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'

Later yesterday:

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240617/1/**/* | wc"; done
omeroreadonly-1
     19      76    1908
omeroreadonly-2
     20      80    2005
omeroreadonly-3
     20      80    2000
omeroreadonly-4
     21      84    2104
omeroreadwrite
    362    1448   36445

Today:

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240617/1/**/* | wc"; done
omeroreadonly-1
     43     172    4322
omeroreadonly-2
     39     156    3910
omeroreadonly-3
     50     200    5014
omeroreadonly-4
     41     164    4105
omeroreadwrite
    389    1556   39158

will-moore avatar Jun 18 '24 09:06 will-moore

Found ResourceError for idr0015 plates: Fileset:21118 (not NGFF converted). Corresponds to 3 plates missing from idr0015.csv: https://github.com/IDR/idr-metadata/issues/645#issuecomment-1849888173 Need to run these 3 as described there and and to idr0015.csv.

Created idr0015a on idr-testing as described there, then... etc

cd idr0015
for i in $(ls); do sed -i 's/42650434-6eaa-45e5-9542-58247a45d8bc/7eb6e556-9b77-45b6-b82b-c913cb85904e/g' $i; done
cd ../

(venv3) bash-5.1$ for r in $(cat idr0015a.csv); do
>   biapath=$(echo $r | cut -d',' -f2)
>   uuid=$(echo $biapath | cut -d'/' -f2)
>   fsid=$(echo $r | cut -d',' -f3 | tr -d '[:space:]')
>   psql -U omero -d idr -h $DBHOST -f "$IDRID/$fsid.sql"
>   omero mkngff symlink /data/OMERO/ManagedRepository $fsid "/bia-integrator-data/$biapath/$uuid.zarr" --bfoptions --clientpath="https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/$biapath/$uuid.zarr"
> done
UPDATE 396
BEGIN
 mkngff_fileset 
----------------
        6321119
(1 row)

COMMIT
Using session for demo@localhost:4064. Idle timeout: 10 min. Current group: Public
Checking for prefix_dir /data/OMERO/ManagedRepository/demo_2/2016-06/06/21-26-25.533
Creating dir at /data/OMERO/ManagedRepository/demo_2/2016-06/06/21-26-25.533_mkngff
Creating symlink /data/OMERO/ManagedRepository/demo_2/2016-06/06/21-26-25.533_mkngff/d69df538-4684-4b32-8ded-d2f2af43af9f.zarr -> /bia-integrator-data/S-BIAD861/d69df538-4684-4b32-8ded-d2f2af43af9f/d69df538-4684-4b32-8ded-d2f2af43af9f.zarr
Checking for prefix_dir /data/OMERO/ManagedRepository/demo_2/2016-06/06/21-26-25.533
write bfoptions to: /data/OMERO/ManagedRepository/demo_2/2016-06/06/21-26-25.533_mkngff/d69df538-4684-4b32-8ded-d2f2af43af9f.zarr.bfoptions
UPDATE 396
BEGIN
 mkngff_fileset 
----------------
        6321120
(1 row)

COMMIT
Using session for demo@localhost:4064. Idle timeout: 10 min. Current group: Public
Checking for prefix_dir /data/OMERO/ManagedRepository/demo_2/2016-06/10/13-37-45.953
Creating dir at /data/OMERO/ManagedRepository/demo_2/2016-06/10/13-37-45.953_mkngff
Creating symlink /data/OMERO/ManagedRepository/demo_2/2016-06/10/13-37-45.953_mkngff/0cc5dbe3-444a-4ea2-a335-b51cf89c1c53.zarr -> /bia-integrator-data/S-BIAD861/0cc5dbe3-444a-4ea2-a335-b51cf89c1c53/0cc5dbe3-444a-4ea2-a335-b51cf89c1c53.zarr
Checking for prefix_dir /data/OMERO/ManagedRepository/demo_2/2016-06/10/13-37-45.953
write bfoptions to: /data/OMERO/ManagedRepository/demo_2/2016-06/10/13-37-45.953_mkngff/0cc5dbe3-444a-4ea2-a335-b51cf89c1c53.zarr.bfoptions
UPDATE 396
BEGIN
 mkngff_fileset 
----------------
        6321121
(1 row)

COMMIT
Using session for demo@localhost:4064. Idle timeout: 10 min. Current group: Public
Checking for prefix_dir /data/OMERO/ManagedRepository/demo_2/2016-06/06/00-58-20.828
Creating dir at /data/OMERO/ManagedRepository/demo_2/2016-06/06/00-58-20.828_mkngff
Creating symlink /data/OMERO/ManagedRepository/demo_2/2016-06/06/00-58-20.828_mkngff/1a29207c-d50b-48b7-a7c0-54c6252bfd9c.zarr -> /bia-integrator-data/S-BIAD861/1a29207c-d50b-48b7-a7c0-54c6252bfd9c/1a29207c-d50b-48b7-a7c0-54c6252bfd9c.zarr
Checking for prefix_dir /data/OMERO/ManagedRepository/demo_2/2016-06/06/00-58-20.828
write bfoptions to: /data/OMERO/ManagedRepository/demo_2/2016-06/06/00-58-20.828_mkngff/1a29207c-d50b-48b7-a7c0-54c6252bfd9c.zarr.bfoptions

will-moore avatar Jun 18 '24 10:06 will-moore

Also did the same for idr0013 last 3 plates of idr0013.csv not converted (see above https://github.com/IDR/idr-metadata/issues/696#issuecomment-2161654156) These initially failed to run as the SESSION wasn't in the original sql scripts (so it didn't get correctly updated prior to running). Fixed in https://github.com/IDR/mkngff_upgrade_scripts/commit/be59fe98b57229f7f314162cd707be2551e1b483

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'rror' /tmp/ngff_cache_20240617/1/**/*"; done
omeroreadonly-1
/tmp/ngff_cache_20240617/1/Image:1964831/stdout:fail: Pixels:1964831 Image:1964831 1.9138693809509277 exception ::omero::ResourceError
/tmp/ngff_cache_20240617/1/Image:3103076/stdout:fail: Pixels:3103076 Image:3103076 16301.51702260971 exception ::omero::ResourceError
omeroreadonly-2
/tmp/ngff_cache_20240617/1/Image:3073892/stdout:fill: Pixels:3073892 Image:3073892 17484.479704618454 exception ::omero::ResourceError
omeroreadonly-3
/tmp/ngff_cache_20240617/1/Image:1962455/stdout:fail: Pixels:1962455 Image:1962455 2.024498224258423 exception ::omero::ResourceError
/tmp/ngff_cache_20240617/1/Image:2850468/stderr:!! 06/17/24 12:46:26.823 error: 4 communicators not destroyed during global destruction.
/tmp/ngff_cache_20240617/1/Image:2857814/stderr:!! 06/17/24 12:41:54.395 error: 2 communicators not destroyed during global destruction.
/tmp/ngff_cache_20240617/1/Image:3367911/stderr:!! 06/17/24 21:46:36.942 error: 2 communicators not destroyed during global destruction.
omeroreadonly-4
/tmp/ngff_cache_20240617/1/Image:1640587/stderr:!! 06/17/24 14:15:58.051 error: 6 communicators not destroyed during global destruction.
/tmp/ngff_cache_20240617/1/Image:2857533/stderr:!! 06/17/24 12:49:00.746 error: 6 communicators not destroyed during global destruction.
omeroreadwrite
/tmp/ngff_cache_20240617/1/Image:12546037/stderr:!! 06/17/24 12:17:22.525 error: 9 communicators not destroyed during global destruction.
/tmp/ngff_cache_20240617/1/Image:12550677/stderr:!! 06/17/24 12:17:53.251 error: 9 communicators not destroyed during global destruction.
...
/tmp/ngff_cache_20240617/1/Image:1556033/stdout:fail: Pixels:1556033 Image:1556033 12244.128482341766 exception ::omero::ResourceError
/tmp/ngff_cache_20240617/1/Image:1573071/stdout:fail: Pixels:1573071 Image:1573071 10876.942811727524 exception ::omero::ResourceError

Checked all these Images - All are viewable in webclient except for idr0015 and idr0013 Plates fixed above - viewed to trigger memo file... omeroreadwrite has lots of " communicators not destroyed" - Just checked ResourceErrors...

will-moore avatar Jun 18 '24 11:06 will-moore

Progress: current state...

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240617/1/**/* | wc"; done
omeroreadonly-1
     89     356    8945
omeroreadonly-2
     83     332    8328
omeroreadonly-3
     92     368    9232
omeroreadonly-4
     89     356    8919
omeroreadwrite
    437    1748   43980

Checking progress in webclient:

  • idr0004 - all done
  • idr0010 - done up to plate 49-06
  • idr0011 - ScreenA
  • ...
  • idr0064 - all done

will-moore avatar Jun 19 '24 06:06 will-moore

Need better way to monitor completed memo filesets.

e.g.

(venv3) (base) [wmoore@pilot-idr0138-omeroreadwrite scripts]$ /opt/omero/server/OMERO.server/bin/omero hql --limit -1 --ids-only --style csv 'select MIN(field.image.id) FROM WellSample AS field where field.well.plate.id in (select link.child.id from ScreenPlateLink as link where link.parent.id=2851) GROUP BY field.well.plate' > idr0090_plates.txt

cut -d ',' -f2 idr0090_plates.txt | sed -e 's/^/Image:/' > idr0090_ids.txt

On proxy server, make a list of all "ok" logs:

[wmoore@test122-proxy ~]$ rsync -rvP omeroreadwrite:/home/wmoore/idr0010_ids.txt ./

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240617/1/**/*" >> ok_ids.txt; done
[wmoore@test122-proxy ~]$ wc ok_ids.txt 
  842  3368 84631 ok_ids.txt

E.g. idr0010, check with Images are in ok logs - 80 / 148 are ok:

$ for i in $(cat idr0010_ids.txt); do echo $i && grep $i ok_ids.txt >> idr0010_ok.txt; done
$ wc idr0010_ok.txt 
  80  320 8036 idr0010_ok.txt
$ wc idr0010_ids.txt 
 148  148 2072 idr0010_ids.txt
for idr in idr0011a idr0011b idr0011c idr0011d idr0012 idr0013a idr0013b idr0015 idr0016 idr0025 idr0033 idr0035 idr0036 idr0064; do
echo $(echo $idr)_plates.txt && cut -d ',' -f2 $(echo $idr)_plates.txt | sed -e 's/^/Image:/' > $(echo $idr)_ids.txt;
done
for idr in idr0011a idr0011b idr0011c idr0011d idr0012 idr0013a idr0013b idr0015 idr0016 idr0025 idr0033 idr0035 idr0036 idr0064; do
echo $(echo $idr)_ids.txt && rsync -rvP omeroreadwrite:/home/wmoore/$(echo $idr)_ids.txt ./
done
  • idr0010 80/148
  • idr0011a 81/129
  • idr0011b 18/40
  • idr0011c 3/4
  • idr0011d 4/8
  • idr0012 38/68
  • idr0013a 283/510
  • idr0013b 9/38
  • idr0015 36/83
  • idr0016 112/413
  • idr0025 0/3
  • idr0033 5/12
  • idr0035 35/55
  • idr0064 5/9

will-moore avatar Jun 19 '24 12:06 will-moore

Updating iviewer to 0.14.0:

for server in omeroreadwrite omeroreadonly-1 omeroreadonly-2 omeroreadonly-3 omeroreadonly-4; do ssh $server "sudo /opt/omero/web/venv3/bin/pip uninstall -y omero-iviewer && sudo /opt/omero/web/venv3/bin/pip install omero-iviewer==0.14.0 && sudo service omero-web restart"; done

will-moore avatar Jun 19 '24 13:06 will-moore

Team testing on idr-testing with microservices today was "slower than expected" with a few odd image rendering failures. I didn't stop the memo generation before testing, which probably would have helped. Also tricky to test with incomplete memo file generation.

Now, stopped memo generation and will target the completion of individual studies...

Running individual Screens on different servers like this:

[wmoore@test122-omeroreadwrite ~]$ screen -dmS cache parallel --eta -a idr0004_ids.txt --results /tmp/ngff_cache_20240621_idr0004/ -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'
  • omeroreadonly-1 do idr0010 (148)
  • omeroreadonly-2 do idr0011a (129)
  • omeroreadonly-3 do idr0012 (68)
  • omeroreadonly-3 do idr0015 (83)
  • omeroreadwrite do idr0004: (46)

After a couple of minutes we have:

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240621_*/1/**/* | wc"; done
omeroreadonly-1
     29     116    3141
omeroreadonly-2
     36     144    3941
omeroreadonly-3
     21      84    2280
omeroreadonly-4
     23      92    2493
omeroreadwrite
     10      40    1053

But also lots of

stderr:FileNotFoundError: [Errno 2] No such file or directory: path('/home/wmoore/omero/sessions/localhost/public/e3906e3b-0282-482e-bd2c-c363ae0f1f79')

On omeroreadwrite, I cancelled the screen, deleted log dir and re-ran, hoping that existing session would avoid those errors - which seems to work. Got 19 initial "ok" in first minute and no FileNotFoundError errors.

Error counts are now:

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'rror' /tmp/ngff_cache_20240621_*/1/**/* | wc"; done
omeroreadonly-1
      9      81    1818
omeroreadonly-2
      9      81    1827
omeroreadonly-3
     11      97    2067
omeroreadonly-4
      9      81    1818
omeroreadwrite
      0       0       0

9-fewer errors for omeroreadwrite and 9 more oks!

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240621_*/1/**/* | wc"; done
omeroreadonly-1
     29     116    3141
omeroreadonly-2
     65     260    7119
omeroreadonly-3
     22      88    2386
omeroreadonly-4
     23      92    2493
omeroreadwrite
     19      76    1998

Also cancelled, delete logs and restarted other 4 readonly servers too... (12:35)

will-moore avatar Jun 21 '24 10:06 will-moore

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240621_*/1/**/* | wc"; done
omeroreadonly-1
     33     132    3576
omeroreadonly-2
    123     492   13464
omeroreadonly-3
     26     104    2819
omeroreadonly-4
     23      92    2494
omeroreadwrite
     42     168    4429

15:39...

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240621_*/1/**/* | wc"; done
omeroreadonly-1
     53     212    5744.    (out of 148)
omeroreadonly-2
    127     508   13902.   (out of 129 - idr0011a - DONE, see below)
omeroreadonly-3
     26     104    2819.     (out of 68 -  idr0012)
omeroreadonly-4
     30     120    3252.       (out of 38)
omeroreadwrite
     46     184    4849.    (all DONE 46 - idr0004)

on omeroreadwrite, rename previous log dir so it doesn't show up in grep, start idr0125... - DONE then idr0033... (12 plates)...

for the 2 not ok images from idr0011a, these both had DatabaseBusy exceptions at 10:40

[wmoore@test122-omeroreadonly-2 ~]$ cat /tmp/ngff_cache_20240621_idr0011a/1/Image:2852482/stderr
[wmoore@test122-omeroreadonly-2 ~]$ cat /tmp/ngff_cache_20240621_idr0011a/1/Image:2852626/stderr

Both viewed OK in webclient - idr0011A DONE. Start idr0011b (40 plates)...

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240621_*/1/**/* | wc"; done
omeroreadonly-1
     64     256    6940      (out of 148 - idr0010)
omeroreadonly-2
     19      76    2080.      (out of 40 - idr0011b)
omeroreadonly-3
     26     104    2819.     (out of 68 - idr0012) - no progress - lots of DatabaseBusyExceptions - restarted 16:50...
omeroreadonly-4
     30     120    3252        (out of 83 - idr0015) - no progress
omeroreadwrite
     10      40    1083.      (out of 12 - idr0033)

will-moore avatar Jun 21 '24 12:06 will-moore

Overnight...

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240621_*/1/**/* | wc"; done
omeroreadonly-1
    148     592   16041.     (out of 148 - idr0010. - DONE) - start with idr0035
omeroreadonly-2
     40     160    4379.      (out of 40 - idr0011b - DONE) - start with idr0036
omeroreadonly-3
     55     220    5958.    (out of 68 - idr0012)
omeroreadonly-4
     57     228    6184.    (out of 83 - idr0015)
omeroreadwrite
     12      48    1301     (out of 12 - idr0033 - DONE) - start with idr0090

After few mins...

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240621_*/1/**/* | wc"; done
omeroreadonly-1
     55     220    5968   (out of 55 - idr0035. - DONE)
omeroreadonly-2
     17      68    1846.    (out of 20 - idr0036)
omeroreadonly-3
     55     220    5958.   (out of 68 - idr0012)
omeroreadonly-4
     57     228    6184.   (out of 83 - idr0015)
omeroreadwrite
     16      64    1779.    (out of 20 - idr0090)

will-moore avatar Jun 22 '24 05:06 will-moore

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240621_*/1/**/* | wc"; done
omeroreadonly-1
     55     220    5968.    (out of 55 - idr0035. - DONE)
omeroreadonly-2
     20      80    2169.     (out of 20 - idr0036 - DONE)
omeroreadonly-3
     55     220    5958.    (out of 68 - idr0012) -  TODO: investigate!!
omeroreadonly-4
     57     228    6184       (out of 83 - idr0015) -  TODO: investigate!!
omeroreadwrite
     20      80    2225.     (out of 20 - idr0090 - DONE) - restart ngff_datasets...
screen -dmS cache parallel --eta -a ngff_dataset_ids.txt --results /tmp/ngff_cache_20240623_datasets/ -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'
...

[wmoore@test122-omeroreadwrite ~]$ grep "ok:" /tmp/ngff_cache_20240623_datasets/1/*/** | wc
   1580    6320  176656

Start ALL plates memo file generation...

[wmoore@test122-proxy ~]$ wc all_plate_ids.txt
 6762  6762 95866 all_plate_ids.txt

Start all fresh... in batches of 1500 rows with:

screen -dmS cache parallel --eta -a all_plate_ids.txt --results /tmp/ngff_cache_20240623_allplates/ -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'

# remove first 1500 rows
sed '1,1500 d' all_plate_ids.txt > all_plate_ids_1500.txt
  • omeroreadonly-1 - all_plate_ids.txt
  • omeroreadonly-2 - all_plate_ids.txt (first 1500 rows removed)
  • omeroreadonly-3 - all_plate_ids.txt (first 3000 rows removed)
  • omeroreadonly-4 - all_plate_ids.txt (first 4500 rows removed)

will-moore avatar Jun 23 '24 22:06 will-moore

Cancel all memo generation (terminate screens) in prep for testing later this morning... NB: for omeroreadonly-3 and omeroreadonly-4, screens were still active for idr0012 and idr0015 above.

Checking idr0012 logs on omeroreadonly-3.. Find the 13 images missing from the logs with:

[wmoore@test122-omeroreadonly-3 ~]$ for i in $(cat idr0012_ids.txt); do echo $i && grep $i /tmp/ngff_cache_20240621_idr0012/1/*/**; done

Manually checked each in webclient. All viewed OK except for Image:1819431.

Checking idr0015 logs on omeroreadonly-4..

for i in $(cat idr0015_ids.txt); do echo $i && grep $i /tmp/ngff_cache_20240621_idr0015/1/*/**; done

Took the Image:IDs not found in logs and put them in a file to run render test...

[wmoore@test122-omeroreadonly-4 ~]$ vi idr0015_retest.txt
[wmoore@test122-omeroreadonly-4 ~]$ for i in $(cat idr0015_retest.txt); do echo $i && /opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force $i; done
Image:1971645
Using session for public@localhost:4064. Idle timeout: 10 min. Current group: Public
ok: Pixels:1971645 Image:1971645 2.5951030254364014 
...

Stuck on Image:1957701... And the next image also lacks memo file. (update: http://localhost:1080/webclient/?show=image-1957701 is good now) Run the remaining via usual route...

[wmoore@test122-omeroreadonly-4 ~]$ screen -dmS cache parallel --eta -a idr0015_retest.txt --results /tmp/ngff_cache_20240624_idr0015 -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'

Cancelled after ~ an hour (8:09) in prep for testing...

will-moore avatar Jun 24 '24 05:06 will-moore

Current status Summary: Memo files all complete for:

  • idr0004
  • idr00010
  • idr0011a, idr0011b
  • idr0012
  • idr0033
  • idr0035
  • idr0036
  • idr0064
  • idr0090

will-moore avatar Jun 24 '24 08:06 will-moore

Testing delayed until 1:30 today, so let's restart idr0015 last few on omeroreadonly-4...

[wmoore@test122-omeroreadonly-4 ~]$ screen -dmS cache parallel --eta -a idr0015_retest.txt --results /tmp/ngff_cache_20240624_idr0015 -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'

Also run idr0011C and idr0011D on omeroreadonly-2 (4 plates, 8 plates).

screen -dmS cache parallel --eta -a idr0011c_ids.txt --results /tmp/ngff_cache_20240624_idr0011c -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'

All completed with "ok" (and checked in webclient).

will-moore avatar Jun 24 '24 10:06 will-moore

Restarting with e.g.:

[wmoore@test122-omeroreadwrite ~]$ screen -dmS cache parallel --eta -a idr0013a_ids.txt --results /tmp/ngff_cache_20240624_idr0013a -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'
  • omeroreadonly-1 idr0013b (28 plates)
  • omeroreadonly-2 idr0016 (413 plates)
  • omeroreadwrite idr0013a (510 plates)
[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240624_*/1/**/* | wc"; done
omeroreadonly-1
     28     112    3067.        (idr0013B - DONE)
omeroreadonly-2
     36     144    4022.      (idr0016 - 413 plates)
omeroreadwrite
    234     936   25623.     (idr0013a - 510 plates)

will-moore avatar Jun 24 '24 11:06 will-moore

idr-testing omeroreadonly-3.

Testing started at 13:30 (12:30 GMT on logs): Checking...

[wmoore@test122-omeroreadonly-4 ~]$ less /opt/omero/server/OMERO.server/var/log/Blitz-0.log.1

First ERROR after that (using grep)...

2024-06-24 12:32:12,119 ERROR [        ome.services.util.ServiceHandler] (.Server-32) Method interface ome.api.IQuery.findAllByQuery invocation took 28799

with no other errors,

Then....found a bunch of similar errors due to Database query times, starting at

2024-06-24 12:33:27,596 INFO  [        ome.services.util.ServiceHandler] (.Server-22)  Excp:    org.springframework.dao.DataAccessResourceFailureException: Hibernate operation: could not execute query; SQL [select image0_.id as id60_, image
2024-06-24 12:34:23,490 INFO  [        ome.services.util.ServiceHandler] (.Server-67)  Excp:    org.springframework.dao.DataAccessResourceFailureException: Hibernate operation: could not execute query; SQL [select image0_.id as id60_, image
0_.acquisitionDate as acquisit2_60_, image0_.archived as archived60_, image0_.description as descript4_60_, image0_.creation_id as creation10_60_, image0_.external_id as external11_60_, image0_.group_id as group12_60_, image0_.owner_id as o
wner13_60_, image0_.permissions as permissi5_60_, image0_.update_id as update14_60_, image0_.experiment as experiment60_, image0_.fileset as fileset60_, image0_.format as format60_, image0_.imagingEnvironment as imaging18_60_, image0_.instr
ument as instrument60_, image0_.name as name60_, image0_.objectiveSettings as objecti20_60_, image0_.partial as partial60_, image0_.series as series60_, image0_.stageLabel as stageLabel60_, image0_.version as version60_ from image image0_ l
eft outer join wellsample wellsample1_ on image0_.id=wellsample1_.image and 
( 
  1 = ? OR 
  1 = ? OR 
  (wellsample1_.group_id in (?)) OR 
  (wellsample1_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = wellsample1_.group_id)) OR 
  (wellsample1_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = wellsample1_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = wellsample1_.group_id)) 
) inner join well well2_ on wellsample1_.well=well2_.id inner join plate plate3_ on well2_.plate=plate3_.id left outer join screenplatelink screenlink4_ on plate3_.id=screenlink4_.child and 
( 
  1 = ? OR 
  1 = ? OR 
  (screenlink4_.group_id in (?)) OR 
  (screenlink4_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = screenlink4_.group_id)) OR 
  (screenlink4_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = screenlink4_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = screenlink4_.group_id)) 
) inner join screen screen5_ on screenlink4_.parent=screen5_.id left outer join imageannotationlink annotation6_ on image0_.id=annotation6_.parent and 
( 
  1 = ? OR 
  1 = ? OR 
  (annotation6_.group_id in (?)) OR 
  (annotation6_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = annotation6_.group_id)) OR 
  (annotation6_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = annotation6_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = annotation6_.group_id)) 
) inner join annotation annotation7_ on annotation6_.child=annotation7_.id where 
( 
  1 = ? OR 
  1 = ? OR 
  (image0_.group_id in (?)) OR 
  (image0_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = image0_.group_id)) OR 
  (image0_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = image0_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = image0_.group_id)) 
) and screen5_.id=? and annotation7_.textValue=? order by well2_."column", well2_."row"]; ERROR: canceling statement due to statement timeout; nested exception is org.postgresql.util.PSQLException: ERROR: canceling statement due to statemen
t timeout
2024-06-24 12:34:23,490 ERROR [        ome.services.util.ServiceHandler] (.Server-67) Method interface ome.api.IQuery.findAllByQuery invocation took 60052

2024-06-24 12:34:37,857 ERROR [        ome.services.util.ServiceHandler] (.Server-40) Method interface ome.api.IQuery.findAllByQuery invocation took 60046
2024-06-24 12:34:44,946 INFO  [                 org.perf4j.TimingLogger] (.Server-24) start[1719232424846] time[60100] tag[omero.call.exception]
2024-06-24 12:34:44,947 WARN  [        ome.services.util.ServiceHandler] (.Server-24) Unknown exception thrown.

org.springframework.dao.DataAccessResourceFailureException: Hibernate operation: could not execute query; SQL [select image0_.id as id60_, image0_.acquisitionDate as acquisit2_60_, image0_.archived as archived60_, image0_.description as des
cript4_60_, image0_.creation_id as creation10_60_, image0_.external_id as external11_60_, image0_.group_id as group12_60_, image0_.owner_id as owner13_60_, image0_.permissions as permissi5_60_, image0_.update_id as update14_60_, image0_.exp
eriment as experiment60_, image0_.fileset as fileset60_, image0_.format as format60_, image0_.imagingEnvironment as imaging18_60_, image0_.instrument as instrument60_, image0_.name as name60_, image0_.objectiveSettings as objecti20_60_, ima
ge0_.partial as partial60_, image0_.series as series60_, image0_.stageLabel as stageLabel60_, image0_.version as version60_ from image image0_ left outer join wellsample wellsample1_ on image0_.id=wellsample1_.image and 
( 
  1 = ? OR 
  1 = ? OR 
  (wellsample1_.group_id in (?)) OR 
  (wellsample1_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = wellsample1_.group_id)) OR 
  (wellsample1_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = wellsample1_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = wellsample1_.group_id)) 
) inner join well well2_ on wellsample1_.well=well2_.id inner join plate plate3_ on well2_.plate=plate3_.id left outer join screenplatelink screenlink4_ on plate3_.id=screenlink4_.child and 
( 
  1 = ? OR 
  1 = ? OR 
  (screenlink4_.group_id in (?)) OR 
  (screenlink4_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = screenlink4_.group_id)) OR 
  (screenlink4_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = screenlink4_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = screenlink4_.group_id)) 
) inner join screen screen5_ on screenlink4_.parent=screen5_.id left outer join imageannotationlink annotation6_ on image0_.id=annotation6_.parent and 
( 
  1 = ? OR 
  1 = ? OR 
  (annotation6_.group_id in (?)) OR 
  (annotation6_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = annotation6_.group_id)) OR 
  (annotation6_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = annotation6_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = annotation6_.group_id)) 
) inner join annotation annotation7_ on annotation6_.child=annotation7_.id where 
( 
  1 = ? OR 
  1 = ? OR 
  (image0_.group_id in (?)) OR 
  (image0_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = image0_.group_id)) OR 
  (image0_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = image0_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = image0_.group_id)) 
) and screen5_.id=? and annotation7_.textValue=? order by well2_."column", well2_."row"]; ERROR: canceling statement due to statement timeout; nested exception is org.postgresql.util.PSQLException: ERROR: canceling statement due to statemen
t timeout
        at org.springframework.jdbc.support.SQLStateSQLExceptionTranslator.doTranslate(SQLStateSQLExceptionTranslator.java:105)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:82)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:82)
        at org.springframework.orm.hibernate3.HibernateAccessor.convertJdbcAccessException(HibernateAccessor.java:428)
2024-06-24 12:35:47,050 INFO  [                 org.perf4j.TimingLogger] (.Server-56) start[1719232486582] time[60468] tag[omero.call.exception]
2024-06-24 12:35:47,051 WARN  [        ome.services.util.ServiceHandler] (.Server-56) Unknown exception thrown.

org.springframework.dao.DataAccessResourceFailureException: Hibernate operation: could not execute query; SQL [select image0_.id as id60_, image0_.acquisitionDate as acquisit2_60_, image0_.archived as archived60_, image0_.description as des
cript4_60_, image0_.creation_id as creation10_60_, image0_.external_id as external11_60_, image0_.group_id as group12_60_, image0_.owner_id as owner13_60_, image0_.permissions as permissi5_60_, image0_.update_id as update14_60_, image0_.exp
eriment as experiment60_, image0_.fileset as fileset60_, image0_.format as format60_, image0_.imagingEnvironment as imaging18_60_, image0_.instrument as instrument60_, image0_.name as name60_, image0_.objectiveSettings as objecti20_60_, ima
ge0_.partial as partial60_, image0_.series as series60_, image0_.stageLabel as stageLabel60_, image0_.version as version60_ from image image0_ left outer join wellsample wellsample1_ on image0_.id=wellsample1_.image and 
( 
  1 = ? OR 
  1 = ? OR 
  (wellsample1_.group_id in (?)) OR 
  (wellsample1_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = wellsample1_.group_id)) OR 
  (wellsample1_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = wellsample1_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = wellsample1_.group_id)) 
) inner join well well2_ on wellsample1_.well=well2_.id inner join plate plate3_ on well2_.plate=plate3_.id left outer join screenplatelink screenlink4_ on plate3_.id=screenlink4_.child and 
( 
  1 = ? OR 
  1 = ? OR 
  (screenlink4_.group_id in (?)) OR 
  (screenlink4_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = screenlink4_.group_id)) OR 
  (screenlink4_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = screenlink4_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = screenlink4_.group_id)) 
) inner join screen screen5_ on screenlink4_.parent=screen5_.id left outer join imageannotationlink annotation6_ on image0_.id=annotation6_.parent and 
( 
  1 = ? OR 
  1 = ? OR 
  (annotation6_.group_id in (?)) OR 
  (annotation6_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = annotation6_.group_id)) OR 
  (annotation6_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = annotation6_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = annotation6_.group_id)) 
) inner join annotation annotation7_ on annotation6_.child=annotation7_.id where 
( 
  1 = ? OR 
  1 = ? OR 
  (image0_.group_id in (?)) OR 
  (image0_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = image0_.group_id)) OR 
  (image0_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = image0_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = image0_.group_id)) 
) and screen5_.id=? and annotation7_.textValue=? order by well2_."column", well2_."row"]; ERROR: canceling statement due to statement timeout; nested exception is org.postgresql.util.PSQLException: ERROR: canceling statement due to statemen
t timeout
        at org.springframework.jdbc.support.SQLStateSQLExceptionTranslator.doTranslate(SQLStateSQLExceptionTranslator.java:105)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:82)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:82)
        at org.springframework.orm.hibernate3.HibernateAccessor.convertJdbcAccessException(HibernateAccessor.java:428)
2024-06-24 12:39:04,561 INFO  [                 org.perf4j.TimingLogger] (.Server-61) start[1719232684542] time[60019] tag[omero.call.exception]
2024-06-24 12:39:04,562 WARN  [        ome.services.util.ServiceHandler] (.Server-61) Unknown exception thrown.

org.springframework.dao.DataAccessResourceFailureException: Hibernate operation: could not execute query; SQL [select image0_.id as id60_, image0_.acquisitionDate as acquisit2_60_, image0_.archived as archived60_, image0_.description as des
cript4_60_, image0_.creation_id as creation10_60_, image0_.external_id as external11_60_, image0_.group_id as group12_60_, image0_.owner_id as owner13_60_, image0_.permissions as permissi5_60_, image0_.update_id as update14_60_, image0_.exp
eriment as experiment60_, image0_.fileset as fileset60_, image0_.format as format60_, image0_.imagingEnvironment as imaging18_60_, image0_.instrument as instrument60_, image0_.name as name60_, image0_.objectiveSettings as objecti20_60_, ima
ge0_.partial as partial60_, image0_.series as series60_, image0_.stageLabel as stageLabel60_, image0_.version as version60_ from image image0_ left outer join wellsample wellsample1_ on image0_.id=wellsample1_.image and 
( 
  1 = ? OR 
  1 = ? OR 
  (wellsample1_.group_id in (?)) OR 
  (wellsample1_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = wellsample1_.group_id)) OR 
  (wellsample1_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = wellsample1_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = wellsample1_.group_id)) 
) inner join well well2_ on wellsample1_.well=well2_.id inner join plate plate3_ on well2_.plate=plate3_.id left outer join screenplatelink screenlink4_ on plate3_.id=screenlink4_.child and 
( 
  1 = ? OR 
  1 = ? OR 
  (screenlink4_.group_id in (?)) OR 
  (screenlink4_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = screenlink4_.group_id)) OR 
  (screenlink4_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = screenlink4_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = screenlink4_.group_id)) 
) inner join screen screen5_ on screenlink4_.parent=screen5_.id left outer join imageannotationlink annotation6_ on image0_.id=annotation6_.parent and 
( 
  1 = ? OR 
  1 = ? OR 
  (annotation6_.group_id in (?)) OR 
  (annotation6_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = annotation6_.group_id)) OR 
  (annotation6_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = annotation6_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = annotation6_.group_id)) 
) inner join annotation annotation7_ on annotation6_.child=annotation7_.id where 
( 
  1 = ? OR 
  1 = ? OR 
  (image0_.group_id in (?)) OR 
  (image0_.owner_id = ? AND (select (__g.permissions & 1024) = 1024 from experimentergroup __g where __g.id = image0_.group_id)) OR 
  (image0_.group_id in (?,?) AND (select (__g.permissions & 64) = 64 from experimentergroup __g where __g.id = image0_.group_id)) OR 
  ((select (__g.permissions & 4) = 4 from experimentergroup __g where __g.id = image0_.group_id)) 
) and screen5_.id=? and annotation7_.textValue=? order by well2_."column", well2_."row"]; ERROR: canceling statement due to statement timeout; nested exception is org.postgresql.util.PSQLException: ERROR: canceling statement due to statement timeout
        at org.springframework.jdbc.support.SQLStateSQLExceptionTranslator.doTranslate(SQLStateSQLExceptionTranslator.java:105)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:82)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:82)
        at org.springframework.orm.hibernate3.HibernateAccessor.convertJdbcAccessException(HibernateAccessor.java:428)

cc @khaledk2 @sbesson

will-moore avatar Jun 24 '24 14:06 will-moore

Restart ALL memo file generation (includes idr0013A and idr0016 not completed).. with e.g.

screen -dmS cache parallel --eta -a all_plate_ids.txt --results /tmp/ngff_cache_20240624_allplates/ -j10 '/opt/omero/server/OMERO.server/bin/omero render -s localhost -u public -w public test --force'
  • omeroreadonly-1 - all_plate_ids.txt
  • omeroreadonly-2 - all_plate_ids.txt (first 1500 rows removed)
  • omeroreadonly-3 - all_plate_ids.txt (first 3000 rows removed)
  • omeroreadonly-4 - all_plate_ids.txt (first 4500 rows removed)
  • omeroreadwrite - idr0013a (510 plates)

EDIT (2 days later) - I realise I should have used e.g. all_plate_ids_1500.txt rather than all_plate_ids.txt on readonly 2, 3 & 4 since the original all_plate_ids.txt wasn't modified. So, all servers ran the same ids!

will-moore avatar Jun 24 '24 15:06 will-moore

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240624_*/1/**/* | wc"; done
omeroreadonly-1
    454    1816   48862
omeroreadonly-2
    464    1856   50313
omeroreadonly-3
    426    1704   46100
omeroreadonly-4
Connection to idr-testing.openmicroscopy.org closed by remote host.

Later - still can't ssh to omeroreadonly-4

[wmoore@test122-proxy ~]$ for n in $(cat nodes);   do echo $n && ssh $n "grep 'ok:' /tmp/ngff_cache_20240624_*/1/**/* | wc"; done
omeroreadonly-1
   1247    4988  134637
omeroreadonly-2
   1257    5028  136078
omeroreadonly-3
   1219    4876  131853
...
[wmoore@test122-omeroreadwrite ~]$ grep 'ok:' /tmp/ngff_cache_20240624_*/1/**/* | wc
    504    2016   55238

will-moore avatar Jun 24 '24 16:06 will-moore