Fix postgres queue issue 369
[X ] Bugfix Feature Refactoring (no functional changes, no api changes) Build related changes WHOSUSING.md Other (please describe): NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.
Changes in this PR change the query to always get the limit, and throw exception if return messages is more than the requested count
Describe the new behavior from this PR, and why it's needed Issue https://github.com/conductor-oss/conductor/issues/369 new query is honoring the limit in the query, it is also limiting the popped messages twice so even if the select part is returning more than the requested count, the query will update only the count out of it. it is also throwing if in any reason it will return more than the requested count - so that the clients won't fail
Thank you for addressing the PostgreSQL storage issue with this pull request. I have some concerns about the introduction of a new exception:
Necessity of the Exception: If the database query functions correctly with this PR, is the new exception necessary? Could it be removed to avoid potential disruptions?
Handling Raised Exceptions: How will the system handle scenarios where this exception is raised? Could it prevent workflows from being processed? How will operators be informed, and what steps should they take to resolve such issues without modifying the Conductor code?
Thank you for addressing the PostgreSQL storage issue with this pull request. I have some concerns about the introduction of a new exception:
Necessity of the Exception: If the database query functions correctly with this PR, is the new exception necessary? Could it be removed to avoid potential disruptions? It is working correctly with this PR, however, returning more tasks than clients asked can result in client crash or hang, and leaving popped tasks that will not be handled until timeout, the exception will revert the transaction and return the messages to the queue, I tested the exception with the previous query and when it was thrown the next poll was trying to handle the messages ( when it faced the wrong result from the query it did tend to give wrong result for few seconds, but it finally succeeded).
Handling Raised Exceptions: How will the system handle scenarios where this exception is raised? Could it prevent workflows from being processed? How will operators be informed, and what steps should they take to resolve such issues without modifying the Conductor code? no need to change the code - the exception is already handled - and the next poll will try to get messages again. workflows continue to be processed.
@v1r3n @jmigueprieto Please review
My concern regarding the introduction of a new exception in this pull request remains:
Query Accuracy: If the revised SQL query still necessitates exception handling due to returning multiple rows for the same message_id and queue_name, it suggests that the query may not be functioning correctly. The table is designed with a composite primary key on these columns, preventing duplicate messages in the same queue. Therefore, the query should be adjusted to respect this constraint and return only the intended results.
Potential Workflow Disruption: Introducing an exception in this context could lead to significant issues. If the query continues to return too many results, the exception would be repeatedly triggered, causing messages to remain unprocessed. This scenario represents a major regression, as workflows and events would cease to function without clear indications of the underlying problem.
Recommendation: It's crucial to ensure that the SQL query is correctly formulated to prevent the retrieval of excess tasks. By doing so, we can maintain data integrity and avoid potential regressions that could adversely affect other users. Additionally, introducing exceptions where they are not necessary can lead to code obfuscation and maintenance challenges. Therefore, it would be prudent to remove the unnecessary exception.
@Robban1980 maybe I wasn't clear before, the new query is accurate and returns the right amount of messages, it was tested with a scenario that was constantly showing the issue with the previous query, and the problem didn't reproduce with the new query.
regarding the exception, exceptions are intended to protect from unexpected results, in this case - a change in postgres DB that might break this query (even if its unlikely to happen), as I mentioned returning more messages than requested can result in clients hang or crash, hence this additional protection layer was added. removing the exception part from this PR - will not change the fact that it fix the issue (#369 ), so I can remove it if more reviewer think I should.
This is being closed due to: Orkes changes being backported here with a CTE - there was issues with this implementation in the PR when actually running some as noted above.