tolgee-platform icon indicating copy to clipboard operation
tolgee-platform copied to clipboard

feat: implement project locks containing multiple job ids

Open ajeans opened this issue 4 months ago • 14 comments

Replace old format Map<Long, Long> with new format Map<Long, Set<Long>> to support configurable concurrent jobs per project.

One-time Redis Migration automatically runs on startup and performs one-time data conversion.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Configure how many batch jobs can run in parallel per project.
    • Project locks can track multiple concurrent jobs and return detailed info for each.
    • Lock status now treats an empty set as unlocked and existing locks are migrated automatically.
  • Bug Fixes

    • Missing locked jobs now emit a warning when job info is unavailable.
  • Tests

    • Tests converted to parameterized concurrency checks and updated to verify collection-based lock state.

ajeans avatar Oct 11 '25 08:10 ajeans

Walkthrough

Project batch locking changed from a single-job lock per project to a set of job IDs with configurable per-project concurrency. Public API models, locking manager, startup migration, and tests were updated to use Set<Long> for locks, fetch multiple JobInfo entries, and migrate legacy Redis entries.

Changes

Cohort / File(s) Summary
API: Project lock model and status
backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/administration/ProjectBatchLockController.kt
ProjectLockModel now exposes lockedJobIds: Set<Long> and jobInfos: List<JobInfo>; lock status derived from set emptiness; fetches JobInfo for all locked IDs and logs warnings for missing jobs.
Locking core & migration
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt
Lock map changed to ConcurrentMap<Long, Set<Long>>; methods updated to work with sets (getMap, getLockedForProject, getLockedJobIds, unlockJobForProject); added InitializingBean.afterPropertiesSet migration from legacy Long? entries; added REDIS_PROJECT_BATCH_JOB_LOCKS_KEY and logging updates; constructor now accepts BatchProperties.
Config: concurrency property
backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt
Added projectConcurrency: Int property controlling parallel jobs per project across instances.
Tests & utilities: adapt and parameterize
backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerCancellationTest.kt
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/batch/BatchJobManagementControllerTest.kt
backend/app/src/test/kotlin/io/tolgee/util/BatchDumper.kt
Tests updated to expect Set<Long> (use .isEmpty()), added injections for BatchProperties and BatchJobProjectLockingManager, parameterized tests over projectConcurrency (values [1,2]), added cancellation test for concurrency=2, and added finallyDumpAll/getAllJobs utilities.
Docs/comment
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt
Comment updated to describe configurable per-project concurrency; no functional change.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Controller as ProjectBatchLockController
  participant Manager as BatchJobProjectLockingManager
  participant Store as Redis/Local Map

  rect rgba(230,245,255,0.6)
  note over Manager: Startup migration (afterPropertiesSet)
  Manager->>Store: Read legacy Long? entries
  alt legacy format present
    Manager->>Store: Write converted Set<Long> entries
  else
    note over Manager: Already set-based
  end
  end

  Client->>Controller: GET /locks
  Controller->>Manager: getMap()/getLockedForProject()
  Manager->>Store: Read Set<Long> per project
  Store-->>Manager: Set<Long>
  Manager-->>Controller: Set<Long> per project
  Controller->>Manager: fetch JobInfo for each jobId
  Manager-->>Controller: List<JobInfo> (filtered)
  Controller-->>Client: ProjectLockModel{lockedJobIds, jobInfos, status}

  rect rgba(235,255,235,0.6)
  note over Client,Manager: Lock attempt
  Client->>Manager: lock(projectId, jobId)
  Manager->>Store: Atomic compute: read Set<Long>
  alt size < projectConcurrency
    Store-->>Manager: Updated Set + jobId
    Manager-->>Client: Locked
  else
    Manager-->>Client: Rejected (at capacity)
  end
  end

  rect rgba(255,240,240,0.6)
  note over Client,Manager: Unlock
  Client->>Manager: unlock(projectId, jobId)
  Manager->>Store: Atomic compute: remove jobId from Set<Long>
  Store-->>Manager: Updated Set<Long>
  Manager-->>Client: Unlocked/No-op
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • tolgee/tolgee-platform#3265 — touches BatchJobProjectLockingManager and initial job handling; likely overlaps with the set-based lock conversion and migration.

Suggested labels

enhancement

Suggested reviewers

  • JanCizmar

Poem

I hop through locks now counted by twos,
From single hops to a flock of queued views.
Carrots for concurrency, neat rows to tend,
Old burrows migrated, new paths to mend.
A rabbit cheers each job that can run—hooray! 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "feat: implement project locks containing multiple job ids" directly and accurately reflects the core functionality being introduced in this changeset. The primary change across all modified files is the replacement of single job ID locks (lockedJobId: Long?) with multiple job ID locks (lockedJobIds: Set<Long>), along with supporting infrastructure like the new projectConcurrency configuration property. The title is concise, specific, and clearly communicates the main objective without vague terms or unnecessary detail. A teammate scanning the pull request history would immediately understand that this PR introduces multi-job locking capability at the project level.
✨ Finishing touches
  • [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Oct 11 '25 08:10 coderabbitai[bot]

@bdshadow I took care of 3 comments, and replied on the big comment regarding redis data migration.

ajeans avatar Oct 15 '25 07:10 ajeans

@ajeans one more comment for discussion here. Jan has already written these concerns about multiple jobs per projects:

When implementing this I wanted to prevent some unexpected situations. For example, when user would start different jobs with the same keys and then 2 parallel jobs might modify data of the same keys. E.g., one job would translate the keys while another job would delete them. I also wanted to prevent one project to take over all the workers while other project would have to wait.

How do you think it must be handled? Did you try to run 2 jobs with the same keys? Or to do batch delete while batch translation is running? Actually, some tests for these cases would be good. It would show what is the expected behaviour in these cases.

bdshadow avatar Oct 16 '25 21:10 bdshadow

@ajeans one more comment for discussion here. Jan has already written these concerns about multiple jobs per projects:

When implementing this I wanted to prevent some unexpected situations. For example, when user would start different jobs with the same keys and then 2 parallel jobs might modify data of the same keys. E.g., one job would translate the keys while another job would delete them. I also wanted to prevent one project to take over all the workers while other project would have to wait.

How do you think it must be handled? Did you try to run 2 jobs with the same keys? Or to do batch delete while batch translation is running? Actually, some tests for these cases would be good. It would show what is the expected behaviour in these cases.

@bdshadow unfortunately, that is beyond my area of expertise.

We've been running with a project concurrency at 10 and have no seen no significant problems in production. I admit we have only one LLM configured to asynchronously modify translations, not several providers at once.

My assumption was that you would keep the project concurrency at 1. We could even add a warning about this, saying anything >1 is experimental.

Regards

ajeans avatar Oct 17 '25 15:10 ajeans

@ajeans thank you for the clarification. As far as i remember, the main problem you were trying to solve is that the batch translation of a huge number of keys took you too much time. Did you try to increase the BatchProperties.concurrency after the first fix is merged? I saw your email, where you said that you experienced problems with it, but i believe it was before the first pr was merged.

bdshadow avatar Oct 17 '25 16:10 bdshadow

@ajeans thank you for the clarification. As far as i remember, the main problem you were trying to solve is that the batch translation of a huge number of keys took you too much time. Did you try to increase the BatchProperties.concurrency after the first fix is merged? I saw your email, where you said that you experienced problems with it, but i believe it was before the first pr was merged.

@bdshadow

What we are currently running in production is

  • the branch in the original PR: https://github.com/tolgee/tolgee-platform/pull/3262
  • 2 servers that serve HTTP traffic running with TOLGEE_BATCH_CONCURRENCY=10 and TOLGEE_BATCH_PROJECT_CONCURRENCY=1
  • 2 servers that don't serve HTTP traffic running with TOLGEE_BATCH_CONCURRENCY=10 and TOLGEE_BATCH_PROJECT_CONCURRENCY=5

In terms of code, the running code is very close to https://github.com/tolgee/tolgee-platform/pull/3265 plus https://github.com/tolgee/tolgee-platform/pull/3271. Main differences are tests (much better in the upstream version) and some code style improvements that didn't change the logic.

Now if I explain the history on our production systems.

  • version v3.123.13 failed to process batch jobs, we would handle at most 10 a day
  • version v3.123.13+ https://github.com/tolgee/tolgee-platform/pull/3265 processed between 300 and 1_000 batch jobs a day
  • version v3.123.13+ https://github.com/tolgee/tolgee-platform/pull/3265 + https://github.com/tolgee/tolgee-platform/pull/3271 with the setup described above processes 10_000 to 20_000 batch jobs a day.

As we have 1 project in production, obviously the per project lock is a major bottleneck.

We are well on our way to get the batch jobs backlog down to 0 in a few days with this version.

My current plan is that when this PR is merged, and the backlog is gone in production, to move to the latest vanilla version and do the regression testing in QA, and the battle testing in production.

Hope it makes sense. 🙏

ajeans avatar Oct 18 '25 16:10 ajeans

@ajeans thank you very much for a detailed explanation.

I understand your motivation behind these changes. However, it will affect lots of other batch processes, which can lead to unpredictable results when launched simultaneously. I also understand that in your specific case it works, but i'm trying to find a solution which is more general.

Did you try to just increase the TOLGEE_BATCH_CONCURRENCY? I mean run it with just the first fix and for servers that don't serve HTTP traffic just increase the TOLGEE_BATCH_CONCURRENCY to, for example, 20, or even more. Maybe in this case, the BatchProperties.chunkQueuePopulationSize can be increased too

bdshadow avatar Oct 20 '25 21:10 bdshadow

@ajeans thank you very much for a detailed explanation.

Hi @bdshadow

I understand your motivation behind these changes. However, it will affect lots of other batch processes, which can lead to unpredictable results when launched simultaneously. I also understand that in your specific case it works, but i'm trying to find a solution which is more general.

Feel free to find other ways to address the scalability issue with big projects with a lot of translations. I'm eager to have Tolgee servers that can address my traffic in production, but I have no emotional attachment to this PR.

I only went this route and submitted because @JanCizmar suggested it was a way to address this. And if you make "1" the default, there is absolutely no downside to it. I don't understand the reluctance, but that is obviously your call.

Irrelevant of this PR or any other solution, do you think tolgee will support my performance requirements in the near future? I'd rather know now if this will not happen please. Production traffic will keep rising in the coming months, and I'd rather plan ahead the next steps.

Did you try to just increase the TOLGEE_BATCH_CONCURRENCY? I mean run it with just the first fix and for servers that don't serve HTTP traffic just increase the TOLGEE_BATCH_CONCURRENCY to, for example, 20, or even more. Maybe in this case, the BatchProperties.chunkQueuePopulationSize can be increased too

AFAIU, current main is identical to the version that managed 300 to 1000 batches a day. (see https://github.com/tolgee/tolgee-platform/pull/3271#issuecomment-3418621310). There were no changes in the batches in the last 3 months, and my patches applied cleanly.

I don't believe that playing with chunks will be a 10x improvement to the workload, especially as we are mostly bound to the active number of threads calling OpenAI translations.

I'm on a vacation next week. If you have more questions I will answer them when I'm back.

@JanCizmar by then, can you weigh in as whether you see tolgee handling our performance requirements?

ajeans avatar Oct 24 '25 12:10 ajeans

Hello @ajeans !

Thanks for raising this pull request and for the detailed context 😊

@bdshadow is part of the Tolgee team and he’s currently focused on improving batch operations performance across both self‑hosted and Tolgee Cloud instances. We absolutely agree that the current batch throughput is a bottleneck.

At the same time, we’re cautious about making the project lock limit configurable. Allowing multiple concurrent jobs per project could lead to race conditions or data inconsistencies when two jobs operate on the same keys. That’s why we originally kept the per‑project limit at one. If you need more throughput, increasing the global TOLGEE_BATCH_CONCURRENCY should have the same effect. If it doesn’t, it suggests another bottleneck that we need to fix, and we’ll look into that.

Our goal is to find a solution that keeps project locking safe while giving a significant performance boost. Dmitrii is working hard on this now, and we hope to have a robust fix soon. We’re should also improve our internal monitoring so we can see potential bottlenecks in Grafana.

Thanks again for tackling this and providing so much context; it really helps us prioritise and improve Tolgee 🚀

JanCizmar avatar Oct 24 '25 13:10 JanCizmar

@JanCizmar

Thank you for the detailed response. I'd like to provide more context on why increasing global TOLGEE_BATCH_CONCURRENCY alone doesn't solve the scalability issue for our production workload. I also want to share the performance requirements that we have and that would be expected from the upstream version of tolgee

Production Timeline

tolgee-production-timeline

This graph shows daily successful batch job completions from September through today:

  1. Baseline (Sept 1-22): ~10-100 jobs/day with various TOLGEE_BATCH_CONCURRENCY settings (1-20). The system was severely bottlenecked.

  2. Mass Cancellation (Sept 22): We manually cancelled ~110,000 stuck jobs to try to recover.

  3. PR #3265 Applied (~Sept 29): The phantom lock fixes brought us to ~1,000 jobs/day - a significant improvement, but still insufficient.

  4. PR #3265 + #3271 Applied (~Oct 6): Adding multi-job-per-project capability with projectConcurrency=5 jumped throughput to 15,000-21,000 jobs/day - a 15x improvement over #3265 alone.

Our Production Architecture

  • 4 Tolgee servers in a Redis-coordinated cluster

    • 2 servers handle HTTP traffic: TOLGEE_BATCH_CONCURRENCY=10, TOLGEE_BATCH_PROJECT_CONCURRENCY=1

    • 2 dedicated batch workers: TOLGEE_BATCH_CONCURRENCY=10, TOLGEE_BATCH_PROJECT_CONCURRENCY=5

  • 1 large project with >300,000 translation keys

  • Continuous batch translation workload (LLM-based)

Why Global Concurrency Alone Doesn't Work

The issue is the per-project lock bottleneck:

4 servers × 10 concurrent workers = 40 threads trying to process jobs ↓ All jobs belong to 1 project ↓ Only 1 job can hold the project lock at a time ↓ Only chunks from that 1 job can be processed across all servers ↓ Other threads spin waiting for chunks from the locked job

With the current single-job-per-project locking:

  • Only 1 job across all 4 servers can hold the project lock

  • Threads can process multiple chunks from that one job concurrently (good!)

  • BUT when chunks from the locked job aren't readily available in the local queue, threads must wait

  • Meanwhile, chunks from other jobs (which could be processed) are blocked by the project lock

  • The system becomes starved: threads are idle waiting for the "right" chunks while work sits in queue

  • Result: ~1,000 jobs/day maximum throughput despite 40 available threads

With projectConcurrency=5:

  • 5 different jobs can hold the project lock simultaneously

  • This means chunks from 5 different jobs are eligible for processing at any time

  • Threads have a much higher probability of finding processable chunks in their queue

  • Better work distribution across the 40 available threads

  • Result: ~15,000 jobs/day throughput

The bottleneck isn't CPU or I/O - it's work starvation caused by the single-job lock limiting which chunks are eligible for processing.

Addressing the Safety Concerns

I understand the concern about concurrent modifications to the same keys. However:

  1. Chunk-level isolation already exists: Each BatchJobChunkExecution is an atomic unit. Two chunks never operate on the same keys simultaneously - that's guaranteed by the chunking logic itself.

  2. Job-level conflicts are rare in practice: In our production workload, users don't typically launch conflicting job types (translate + delete) on the same keys simultaneously. When they do create a new batch job, it's usually after the previous one completes.

  3. Default value keeps existing behavior: With projectConcurrency defaulting to 1, existing deployments get zero behavior change. The safety model remains identical.

  4. Opt-in for advanced users: Deployments like ours with a single large project and controlled batch workflows can opt into higher concurrency after understanding the trade-offs.

Path Forward

I respectfully suggest:

  • Keep projectConcurrency=1 as the default (no behavior change for existing users)

  • Add a configuration warning in the docs: "Values >1 are experimental and should only be used in controlled environments where batch job conflicts are managed externally"

This unblocks high-throughput single-project deployments like ours

Gives the Tolgee team time to explore a more comprehensive solution for multi-project safety, while Rakuten France can safely operate during the shopping season.

The alternative is that deployments with large projects cannot scale beyond ~1,000 jobs/day on the current architecture, even with unlimited hardware resources (tolgee instances or postgres databases).

Thanks for your consideration!

ajeans avatar Oct 25 '25 10:10 ajeans

@ajeans thank you very much for such a detailed report. It is really helpful.

What I don't like is the part "With the current single-job-per-project locking" about the processing of chunks by threads. Especially the following:

BUT when chunks from the locked job aren't readily available in the local queue, threads must wait

And when i say i don't like it, i mean the way tolgee does it.

I think i found the root of a problem with populating the in-memory queue. Please check the first pull request: https://github.com/tolgee/tolgee-platform/pull/3287. It has a detailed explanation + it's duplicated in the commit messages.

Shortly: locking the db rows didn't work when selecting chunks for processing.

But as described in the commit messages, it still doesn't fully fix the problem. Let's imagine we have 2 pods running, each trying to populate the in-memory queue. The first one locks the rows in the chunks table and sends them to processing. Transaction is finished, rows in the chunks table are unlocked. Now the 2nd pod comes, before the actual processing of the chunks is actually started in the first pod(update of the status from PENDING to RUNNING). And the second pod selects absolutely the same chunks and sends them to its own queue. Now these pods are conflicting for processing the same chunks.

I see that it can be fixed by introducing a new status for chunks: NEW. And when BatchJobChunkExecutionQueue.populateQueue runs with locking db rows, it will also update the status from NEW to PENDING. This way, another pod won't be able to take them and chunks will be uniquely distributed among all k8s pods.

I hope with these changes there won't be need in allowing multiple batch jobs per project

bdshadow avatar Oct 26 '25 21:10 bdshadow

I've made a draft of the refactored batch job processing. Although i like it more :), the processing perfomance was just a little bit better (meanwhile i think it's a too big refactoring for smth that works - read further :) )

I've started testing the main branch locally. I've run 2 instances of the app, with redis, and tried the batch machine translation of 10k keys:

BatchProperties.concurrency BatchProperties.chunkQueuePopulationSize Time
1 1000 3:30
2 1000 2:03
3 1000 1:41
4 1000 1:42

So bumped into the same problem - increasing the number of threads doesn't improve the performance. The reason is

BUT when chunks from the locked job aren't readily available in the local queue, threads must wait

BUT! The solution is simple: increase BatchProperties.chunkQueuePopulationSize:

4 1200 1:20

Now the queue is populated quicker, never empty, threads don't wait, performance improved.

@ajeans could you please try the latest version from main, increase the BatchProperties.chunkQueuePopulationSize as well as the BatchProperties.concurrency.

Alternatively we can add an additional trigger to fetch pending chunks from db when queue is empty. For example, somwhere in the BatchJobConcurrentLauncher.getSleepTime like this:

  private fun getSleepTime(
    startTime: Long,
    somethingHandled: Boolean,
  ): Long {
    if (!batchJobChunkExecutionQueue.isEmpty() && jobsToLaunch > 0 && somethingHandled) {
      return 0
    }
    batchJobChunkExecutionQueue.populateQueue();
    return MIN_TIME_BETWEEN_OPERATIONS - (System.currentTimeMillis() - startTime)
  }

bdshadow avatar Nov 04 '25 20:11 bdshadow

I've made a draft of the refactored batch job processing. Although i like it more :), the processing perfomance was just a little bit better (meanwhile i think it's a too big refactoring for smth that works - read further :) )

I've started testing the main branch locally. I've run 2 instances of the app, with redis, and tried the batch machine translation of 10k keys: BatchProperties.concurrency BatchProperties.chunkQueuePopulationSize Time 1 1000 3:30 2 1000 2:03 3 1000 1:41 4 1000 1:42

So bumped into the same problem - increasing the number of threads doesn't improve the performance. The reason is

BUT when chunks from the locked job aren't readily available in the local queue, threads must wait

BUT! The solution is simple: increase BatchProperties.chunkQueuePopulationSize:

4 1200 1:20

Now the queue is populated quicker, never empty, threads don't wait, performance improved.

@ajeans could you please try the latest version from main, increase the BatchProperties.chunkQueuePopulationSize as well as the BatchProperties.concurrency.

Hello @bdshadow

Very interesting.

Unfortunately, I cannot test this in production, maybe early next year after the shopping season. Plus with the current production, my backlog problem has been solved, we translate asynchronously in near real time.

Several questions:

  • what happens if you create 100k keys, does it increase time linearly or is the degradation worse? If you have an easy way to test this (maybe even test scripts you can share)
  • just to make sure: in your tests, do all keys belong to the same project, and they all threads synchronize on the redis project lock?
  • in your tests, I assume you mocked the translations. Do your mocks return instantly or do they actually pause for seconds to simulate an actual API call. I believe the mean translation time of 30 seconds hurts my use case a lot.

Alternatively we can add an additional trigger to fetch pending chunks from db when queue is empty. For example, somwhere in the BatchJobConcurrentLauncher.getSleepTime like this:

  private fun getSleepTime(
    startTime: Long,
    somethingHandled: Boolean,
  ): Long {
    if (!batchJobChunkExecutionQueue.isEmpty() && jobsToLaunch > 0 && somethingHandled) {
      return 0
    }
    batchJobChunkExecutionQueue.populateQueue();
    return MIN_TIME_BETWEEN_OPERATIONS - (System.currentTimeMillis() - startTime)
  }

Oh you want to populate the queue faster. The problems I saw at the beginning were with too big queues blocked on the lock, but there you would focus on pending chunks so it might be a good idea.

ajeans avatar Nov 07 '25 14:11 ajeans

This PR is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Dec 08 '25 02:12 github-actions[bot]