Test server timeouts improperly set to 10 years when not user set
Expected Behavior
Current server doesn't set WorkflowExecutionStartedEventAttributes.workflow_execution_timeout or PollActivityTaskQueueResponse.schedule_to_close_timeout if the user didn't
Actual Behavior
Test server sets these to 10 years when not set. While this may make sense internally to bound the timer, e.g. the API for https://www.javadoc.io/static/io.temporal/temporal-sdk/1.16.0/io/temporal/activity/ActivityInfo.html#getScheduleToCloseTimeout() should be accurate.
This is a "feature". Instead of just hanging the unit test, we force the workflow to time out in the test environment if nothing would happen with it and it will hang. But I see how this feature may be confusing. This needs a broader discussion.
I can understand it makes sense to internally bound the timer, but we should make the API responses accurate though if possible
Hello. This issue just took down all of our production workflows cause we made an incorrect assumption that all of our Workflow Tests would've caught this issue.
Are there any updates on this? It feels odd to call this a feature since the test environment should mimic the production environment for proper testing. Is there a workaround where maybe instead of actually setting these values on the server we could just validate the config is correct?
I am sorry to hear this caused an issue for you in production. There is no update here, but I am curious what the exact issue you hit was?
I would think of the test server as a good tool to unit test workflows, but it is no replacement for testing your workflows against a real Temporal cluster with the same settings as your production cluster before going to production.
It feels odd to call this a feature since the test environment should mimic the production environment for proper testing.
The overwhelming feedback we have received from users is the test server should optimize for the test experience over mimicking exact server behaviour.
setting these values on the server we could just validate the config is correct?
Not sure exact settings you are referring to, but in general the Temporal server is the authority on what settings are valid or not valid as for some things it can depend on the servers dynamic configs what values are allowed so the SDK cannot know ahead of time.
Thanks for the quick response @Quinn-With-Two-Ns!
Our issue was with the need to set either the ScheduleToStartTimeout or StartToCloseTimeout for every activity invocation:
BadScheduleActivityAttributes: A valid StartToClose or ScheduleToCloseTimeout is not set on ScheduleActivityTaskCommand.
I would think of the test server as a good tool to unit test workflows, but it is no replacement for testing your workflows against a real Temporal cluster with the same settings as your production cluster before going to production.
tbh this just feels like a LOT of work to test workflows then if this is the stance you are taking
Not sure exact settings you are referring to
I might have used the wrong verbiage. Referring to the ActivityOption settings for timeouts for ScheduleToStartTimeout or StartToCloseTimeout. My suggestion was maybe there could just be validation in the test server that these are set prior to activity execution:
// Psuedo-Code
func ExecuteActivity(...) error {
if workflowCtx.ValidateTimeout() {
// throw error
}
// Execute Activity as normal
}
I don't know enough about SDK Internals to know if this is feasible. Just throwing ideas out! In the interim, we are adding this validation on our side which feels like something we shouldn't have to worry about.
Our issue was with the need to set either the ScheduleToStartTimeout or StartToCloseTimeout for every activity invocation:
Hmm, well this is an options that the SDK can validate SDK side. So the Java SDK does validate that one of these options is set https://github.com/temporalio/sdk-java/blob/c8a27ce9073164141fc1f08ac4c80456f48b3c1d/temporal-sdk/src/test/java/io/temporal/activity/ActivityOptionsTest.java#L65, the test server also has a similar check as well. Can you share how a some self contained reproduction that passes against the test server but fails against a real server?