galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

Enable multiple cache, job_work dirs per object-store

Open bgruening opened this issue 7 months ago • 4 comments

We would like to have the possibility to specify multiple cache dirs and job_work dirs per object stores. With the support of weight we could do some round-robin over those dirs and suspend/replace job_work dirs more easily. Having multiple cache dirs has also a few advantages imho.

The downside I think is that Galaxy needs to "search" for the correct dir for each job, so the worst case is probably that Galaxy needs to do a os.path.exists() per job per X cache/job_work dir. But this could be cached maybe in python?

type: generic_s3
auth:
  access_key: ...
  secret_key: ...
bucket:
  name: unique_bucket_name_all_lowercase
  use_reduced_redundancy: false
  max_chunk_size: 250
connection:
  host: swift.example.org
  port: 6000
  conn_path: /
  multipart: true
cache:
  - path: database/object_store_cache_01
    size: 1000
    cache_updated_data: true
    weight: 2
  - path: database/object_store_cache_02
    size: 1000
    cache_updated_data: true
    weight: 1
extra_dirs:
  - type: job_work
    path: database/job_working_directory_01
    weight: 2
  - type: job_work
    path: database/job_working_directory_02
    weight: 1

bgruening avatar Apr 21 '25 19:04 bgruening

We store object_store_id on the job table for the distributed object store, so if the extra dirs were implemented as distributed backends with their own IDs then it wouldn't need to search. And there should be no performance penalty for the default case anyway since there would be no need to search when there is only one dir of that type.

natefoo avatar Apr 21 '25 19:04 natefoo

I assume this is triggered by some limitation you've run into, can you tell us what that is ?

mvdbeek avatar Apr 22 '25 13:04 mvdbeek

There are multiple reasons. Space, performance and flexibility. In our setup we run most of the jobs on one system. We had and have a few boxes that are serving as JWD and now we also have those serving as cache for the S3. So space is one concern, we do like to distribute the jobs over multiple boxes to increase the space available. This is true for JWD and we assume this will be true for the S3 cache. Performance is another problem, we do have a few nasty tools and in conjunction with many nodes we have made good experience with separating the load to multiple shares. So we have multiple physical boxes, with separate network cards etc and we used to do round-robin via multiple backends (using weights for the same physical hard drive). However, the configuration was a bit hacky and it would be easier if we could do the configuration on the jwd/cache setting and not the entire object-stores. This setup also enabled us to retire/maintain JWDs by simple changing the weights, which gave us some nice extra flexibility. In addition, I don't see how I can configure an S3 cache that spans over multiple mounts.

bgruening avatar Apr 22 '25 13:04 bgruening

Galaxy Australia also has a use case for this. It would be useful to put the galaxy job working directories for jobs that run on pulsar in a different location from those for jobs that run locally, without using separate object store backends. I have tried overriding job_working_directory for individual jobs using TPV but this doesn't work because jwd from the object store backend is always used.

cat-bro avatar May 25 '25 13:05 cat-bro