spring-cloud-dataflow icon indicating copy to clipboard operation
spring-cloud-dataflow copied to clipboard

CTR app props not resolving when using shell entry point

Open cppwfs opened this issue 3 years ago • 1 comments
trafficstars

When Spring Cloud Data Flow launches a CTR with deployment properties for its child tasks it base-64 encodes a portion of the key that will contain the prefix and name of the property for the child deployment property.
For example: When launching a CTR with a child task named task-a that has a deployment property of deployer.task-a.kubernetes.environment-variables=foo=bar using the default entrypoint, the command line are looks like this: --composed-task-app-properties.base64_ZGVwbG95ZXIudGFzay1hLmt1YmVybmV0ZXMuZW52aXJvbm1lbnQtdmFyaWFibGVz=foo=bar The above implementation works.

The problem occurs when a user specified the shell entrypoint. In this case the deployer will change properties such that they can be used as environment variables. It does this by capitalizing the key for the property. The side-effect is that it also capitalizes the base-64 encoded portion. Thus when CTR decodes the base-64 portion of the key it is incorrect. An example below shows an example in how the deployer property is translated for the environment (base-64 is in all caps, where above it is upper and lower case). COMPOSED-TASK-APP-PROPERTIES_BASE64_ZGVWBG95ZXIUDGLTZXN0YW1WLMT1YMVYBMV0ZXMUZW52AXJVBM1LBNQTDMFYAWFIBGVZ: foo=bar

Users have reported a workaround by adding the composed task runner name to the key for example: deployer.my-composed-task-name.task-a.deployer.timestamp.kubernetes.environment-variables=foo=bar. Thus CTR assumes the property is meant for it and it will set the child deployer as follows: COMPOSED-TASK-PROPERTIES: deployer.task-a.deployer.timestamp.kubernetes.environment-variables=foo=bar.

cppwfs avatar Aug 30 '22 17:08 cppwfs

Thanks a lot - also for the first investigation of the issue! Good that you spotted it by this side hint. 👍

klopfdreh avatar Aug 30 '22 19:08 klopfdreh

Not sure the best way to resolve this. This is the reason that base64 encoding was added

https://github.com/markpollack/spring-cloud-dataflow/commit/ce2601f316fca391911dac170b463b42e1d3e74c

https://github.com/spring-cloud/spring-cloud-dataflow/issues/4566 and the comments in

https://github.com/jvalkeal/spring-cloud-dataflow/commit/f19391307fc0b2bbd096b103785829bed6135fea

are interesting to refresh ourselves on.

- Property key part is now base64 encoded with prefix base64_ because
  boot with its binding will silenly remove special characters from
  map keys and we could not use wildcards like deployer.*.local.working-directories-root
- Encoded values are then decoded on ctr side.
- One onnoying this is that these base64 encoded values are visible in a manifest but
  never directly interact with a user.

Maybe even for the shell entrypoint option, we should be passing in the properties using a single environment variable, just like how it is done for the boot entry point with SPRING_APPLICATION_JSON. We could create a new entrypoint shell-json to reflect this change.

markpollack avatar Sep 22 '22 14:09 markpollack

After researching this issue the correct solution is to use the boot entry point. We need to update the documentation so that will say, "I you are using CTR on Kubernetes and you want to pass values via the environment variables, you should use the boot entrypoint."

cppwfs avatar Oct 24 '22 16:10 cppwfs

Hey @cppwfs - I am going to check if this is a possible solution for us, as we decided to use the entrypoint shell for some reasons. I will give an update here, soon.

klopfdreh avatar Oct 24 '22 16:10 klopfdreh

@klopfdreh Looking forward to your response.

cppwfs avatar Oct 24 '22 17:10 cppwfs

Hey @cppwfs,

we have tested two scenarios.

  1. With entryPointStyle shell and with deployer.test-compose-task.s3-uploader-app.kubernetes.environment-variables=... This one does run without any issue and the properties with * are stored in the compose task and the properties with test-compose-app are stored in test-compose-task-s3-uploader-app-task image

  2. With entryPointStyle boot and with deployer.s3-uploader-app.kubernetes.environment-variables=... This one caused big issues. No properties have been stored at all. and the task execution shows this. Note: The compose task hasn't even stopped nor executed both of the tasks mentioned in the compose task. Only the first run and than it hangs. image The properties are names like this now: image

One thing to mention is that we use spring cloud config server for our task applications. So our task application are configured bootstrap.yml:

spring:
  cloud:
    config:
      enabled: true
      uri: ...
      username: ...
      password: ...
      fail-fast: true
      name: ...
      label: ...

klopfdreh avatar Oct 25 '22 10:10 klopfdreh

Using the boot entrypoint without using config server the application resolves the original issue. Now we need to discern if the configserver + boot entrypoint have an issue. So can you provide some details so we can replicate.

  1. Can you provide the logs from the child app as well as the composed task runner?
  2. Version of SCDF you are using?
  3. Are you using config server on the Composed Task Runner and its child apps or just the child apps?

cppwfs avatar Oct 26 '22 18:10 cppwfs

Hey @cppwfs - we are working on this, but it will take some time, because we have to ensure that there are no security relevant data in the logs.

klopfdreh avatar Oct 27 '22 06:10 klopfdreh

Hello @klopfdreh , Thanks for looking into it. Would you be able to answer the other questions?

cppwfs avatar Oct 27 '22 12:10 cppwfs

Hey @cppwfs,

of course.

  1. We are going to prepare the logs as mentioned
  2. We are using the following versions
{
  "versions": {
    "implementation": {
      "name": "spring-cloud-dataflow-server",
      "version": "2.10.0-M2"
    },
    "core": {
      "name": "Spring Cloud Data Flow Core",
      "version": "2.10.0-M2"
    },
    "dashboard": {
      "name": "Spring Cloud Dataflow UI",
      "version": "3.3.0-M2"
    },
    "shell": {
      "name": "Spring Cloud Data Flow Shell",
      "version": "2.10.0-M2",
      "url": "https://repo1.maven.org/maven2/org/springframework/cloud/spring-cloud-dataflow-shell/2.10.0-M2/spring-cloud-dataflow-shell-2.10.0-M2.jar"
    }
  },
  "features": {
    "streams": true,
    "tasks": true,
    "schedules": true,
    "monitoringDashboardType": "NONE"
  },
  "runtimeEnvironment": {
    "appDeployer": {
      "deployerImplementationVersion": "2.9.0-M2",
      "deployerName": "Spring Cloud Skipper Server",
      "deployerSpiVersion": "2.9.0-M2",
      "javaVersion": "11.0.16.1",
      "platformApiVersion": "",
      "platformClientVersion": "",
      "platformHostVersion": "",
      "platformSpecificInfo": {},
      "platformType": "Skipper Managed",
      "springBootVersion": "2.7.3",
      "springVersion": "5.3.22"
    },
    "taskLaunchers": [
      {
        "deployerImplementationVersion": "2.8.0-M2",
        "deployerName": "KubernetesTaskLauncher",
        "deployerSpiVersion": "2.8.0-M2",
        "javaVersion": "11.0.16.1",
        "platformApiVersion": "v1",
        "platformClientVersion": "unknown",
        "platformHostVersion": "unknown",
        "platformSpecificInfo": {
          "namespace": ".......",
          "master-url": "https://...../"
        },
        "platformType": "Kubernetes",
        "springBootVersion": "2.7.3",
        "springVersion": "5.3.22"
      }
    ]
  },
  "monitoringDashboardInfo": {
    "url": "",
    "source": "default-scdf-source",
    "refreshInterval": 15
  },
  "security": {
    "isAuthentication": true,
    "isAuthenticated": true,
    "username": "........",
    "roles": [
      "ROLE_CREATE",
      "ROLE_DEPLOY",
      "ROLE_DESTROY",
      "ROLE_MANAGE",
      "ROLE_MODIFY",
      "ROLE_SCHEDULE",
      "ROLE_VIEW"
    ]
  }
}
  1. We are using Spring Cloud Config server only on the child apps, but not on the CTR. The CTR is configured with the following environment variables: SPRING_CLOUD_TASK_BATCH_FAIL_ON_JOB_FAILURE (true) / OAUTH2_CLIENT_CREDENTIALS_SCOPES / OAUTH2_CLIENT_CREDENTIALS_TOKEN_URI / OAUTH2_CLIENT_CREDENTIALS_CLIENT_ID / OAUTH2_CLIENT_CREDENTIALS_CLIENT_SECRET / OAUTH2_CLIENT_CREDENTIALS_CLIENT_AUTHENTICATION_METHOD (CLIENT_SECRET_POST) / TRANSACTION_ISOLATION_LEVEL (ISOLATION_READ_COMMITTED)

klopfdreh avatar Oct 27 '22 12:10 klopfdreh

Hi! We have prepared the (anonymized) logs for you to have a look at: ctr_runner_boot_entrypoint.log ctr_runner_child_task1_boot_entrypoint.log

Notably, we now also use Config Server for the CTR, though no difference in behavior as to not using it concerning this issue here was noticable.

einmartwa avatar Nov 15 '22 13:11 einmartwa

So with the sample logs above.
We see that CTR launched and i started child_task1.

Did child_task1 finish successfully or did it hang?

cppwfs avatar Nov 15 '22 20:11 cppwfs

The child_task1 finished successfully, but the CTR task did hang.

einmartwa avatar Nov 16 '22 06:11 einmartwa

The second task, child_task2, did not start. Instead, CTR task hangs indefinitely.

einmartwa avatar Nov 16 '22 14:11 einmartwa

Sorry for the noise on the last comment, I deleted immediately after I posted it (not fast enough though :-) ). I remembered from the log that CTR did not launch the second app.
I have seen this before but the circumstance was that the child app and CTR were pointing to 2 different databases. Thus CTR launched the task via SCDF, and monitored the task_execution table that dataflow was associated with. The child app updated the task_execution table in another database. Thus CTR appeared to be hung, because the task_execution entry was never updated in its DB.

I've not been able to replicate the behavior you have reported yet.
Is there another instance of your DB that the child app or CTR could be connecting to?

cppwfs avatar Nov 16 '22 14:11 cppwfs

The tasks we use here all just use one database connection, which is dictated by an envvar passed by the SCDF server itself, which should definitely make it the same everywhere. That can't be it in our case.

einmartwa avatar Nov 17 '22 10:11 einmartwa

We did some further investigations according to this issue, but couldn't solve it, yet. We are using only one database for the CTR and the child task applications.

Still, when we set the entryPointStyle to boot the SPRING_APPLICATION_JSON is correctly handed over to the CTR, but after the first child task is completed the CTR gets stuck and does not execute the second child task.

Here is the JSON of the CTR:

{
  "composed-task-app-properties.base64_ZGVwbG95ZXIuczMtdXBsb2FkZXItYXBwLmt1YmVybmV0ZXMuZW52aXJvbm1lbnRWYXJpYWJsZXM": "APPLICATION_NAME='s3-uploader-app',ENV='komo',RT_ENV='komo',DISABLE_SAFE_DELETE='true',TRANSACTION_LOG_MOUNT_PATH='/persistent/lf-batch-xa-transaction-logs',TRANSACTION_LOG_BASE_NAME='test-compose-task_20230315_111846-1',OTEL_SERVICE_NAME='s3-uploader-task',AWS_ACCOUNT_ENDPOINT='[https://xxxxxxx',AWS_BUCKET_NAME='xxxxxxx',DRY_RUN='true',S3_UPLOADER_RESOURCES_TO_UPLOAD_0_INPUT_RESOURCES_PATTERN_URI='file:/deployments/*.jar',S3_UPLOADER_RESOURCES_TO_UPLOAD_0_OUTPUT_FOLDER_RESOURCE_URI='s3://xxxxxxx/',SPRING_CLOUD_CONFIG_NAME_VALUE='s3-uploader-task',SPRING_CLOUD_CONFIG_LABEL_VALUE='komo',RUN_ID='test-compose-task_20230315_111846-1',TASK_IDENTIFIER='test-compose-task_20230315_111846'](https://xxxxxxx',aws_bucket_name='xxxxxxx',dry_run='true',s3_uploader_resources_to_upload_0_input_resources_pattern_uri='file/deployments/*.jar',S3_UPLOADER_RESOURCES_TO_UPLOAD_0_OUTPUT_FOLDER_RESOURCE_URI='s3://xxxxxxx/',SPRING_CLOUD_CONFIG_NAME_VALUE='s3-uploader-task',SPRING_CLOUD_CONFIG_LABEL_VALUE='komo',RUN_ID='test-compose-task_20230315_111846-1',TASK_IDENTIFIER='test-compose-task_20230315_111846')",
  "management.metrics.export.prometheus.enabled": "true",
  "spring.datasource.driverClassName": "oracle.jdbc.OracleDriver",
  "dataflowServerUri": https://xxxxxxx:8443/lf-batch-spring-cloud-data-flow-server/,
  "spring.cloud.task.name": "test-compose-task",
  "composed-task-app-properties.base64_ZGVwbG95ZXIuY2xlYW5lci1hcHAua3ViZXJuZXRlcy52b2x1bWVNb3VudHM": "[{name: 'lf-efs-pvc-komo', mountPath: '/persistent'}]",
  "management.metrics.tags.service": "task-application",
  "composed-task-app-properties.base64_ZGVwbG95ZXIuY2xlYW5lci1hcHAua3ViZXJuZXRlcy5lbnZpcm9ubWVudFZhcmlhYmxlcw": "APPLICATION_NAME='cleaner-app',ENV='komo',RT_ENV='komo',TRANSACTION_LOG_MOUNT_PATH='/persistent/lf-batch-xa-transaction-logs',TRANSACTION_LOG_BASE_NAME='test-compose-task_20230315_111846-2',OTEL_SERVICE_NAME='cleaner-task',DRY_RUN='true',RESOURCE_URIS='file:/deployments/*',SPRING_CLOUD_CONFIG_NAME_VALUE='cleaner-task',SPRING_CLOUD_CONFIG_LABEL_VALUE='komo',DISABLE_SAFE_DELETE='false',RUN_ID='test-compose-task_20230315_111846-2',TASK_IDENTIFIER='test-compose-task_20230315_111846'",
  "composed-task-app-properties.base64_ZGVwbG95ZXIuY2xlYW5lci1hcHAua3ViZXJuZXRlcy52b2x1bWVz": "[{name: 'lf-efs-pvc-komo', persistentVolumeClaim: { claimName: 'lf-efs-pvc-komo', readOnly: 'false' }}]",
  "management.metrics.export.prometheus.rsocket.host": "xxxxxxx",
  "composed-task-app-properties.base64_ZGVwbG95ZXIuczMtdXBsb2FkZXItYXBwLmt1YmVybmV0ZXMuZGVwbG95bWVudExhYmVscw": "xxxxxxx",
  "composed-task-app-properties.base64_ZGVwbG95ZXIuczMtdXBsb2FkZXItYXBwLmt1YmVybmV0ZXMuc2VjcmV0S2V5UmVmcw": "[{envVarName: 'AWS_IAM_USER_ACCESS_KEY_ID', secretName: 'aws-s3-credentials-komo', dataKey: 'AWS_IAM_USER_ACCESS_KEY_ID'},{envVarName: 'AWS_IAM_USER_SECRET_KEY', secretName: 'aws-s3-credentials-komo', dataKey: 'AWS_IAM_USER_SECRET_KEY'}]",
  "composed-task-app-properties.base64_ZGVwbG95ZXIuczMtdXBsb2FkZXItYXBwLmt1YmVybmV0ZXMudm9sdW1lcw": "[{name: 'lf-efs-pvc-komo', persistentVolumeClaim: { claimName: 'lf-efs-pvc-komo', readOnly: 'false' }}]",
  "management.metrics.tags.application": "${spring.cloud.task.name:unknown}-${spring.cloud.task.executionid:unknown}",
  "platform-name": "default",
  "graph": "test-compose-task-s3-uploader-app && test-compose-task-cleaner-app",
  "composed-task-app-properties.base64_ZGVwbG95ZXIuY2xlYW5lci1hcHAua3ViZXJuZXRlcy5kZXBsb3ltZW50TGFiZWxz": "xxxxxxx",
  "spring.datasource.url": "jdbc:oracle:xxxxxxx",
  "composed-task-app-properties.base64_ZGVwbG95ZXIuczMtdXBsb2FkZXItYXBwLmt1YmVybmV0ZXMudm9sdW1lTW91bnRz": "[{name: 'lf-efs-pvc-komo', mountPath: '/persistent'}]",
  "composed-task-app-properties.base64_ZGVwbG95ZXIuczMtdXBsb2FkZXItYXBwLmt1YmVybmV0ZXMubGltaXRzLmNwdQ": "375m",
  "composed-task-app-properties.base64_ZGVwbG95ZXIuczMtdXBsb2FkZXItYXBwLmt1YmVybmV0ZXMubGltaXRzLm1lbW9yeQ": "1000Mi",
  "management.metrics.export.prometheus.rsocket.port": "7001",
  "management.metrics.export.prometheus.rsocket.enabled": "true",
  "composed-task-app-properties.base64_ZGVwbG95ZXIuY2xlYW5lci1hcHAua3ViZXJuZXRlcy5saW1pdHMubWVtb3J5": "1000Mi",
  "composed-task-app-properties.base64_ZGVwbG95ZXIuY2xlYW5lci1hcHAua3ViZXJuZXRlcy5saW1pdHMuY3B1": "375m"
}

klopfdreh avatar Mar 15 '23 10:03 klopfdreh

We noticed that there are two entries in the database for the first child task which has been started.

So what happens is that the first one 2340 is started and checked all the time, but the end time is written to 2341. Because the CTR only reads the task with the execution 2340 it gets stuck - the end time is not written here.

image

Because we are using Oracle we had to change the TRANSACTION_ISOLATION_LEVEL to ISOLATION_READ_COMMITED - can this be the issue?

klopfdreh avatar Mar 15 '23 14:03 klopfdreh

That is possible with ISOLATION_READ_COMMITED in that a non-repeatable read is one in which data read twice inside the same transaction cannot be guaranteed to contain the same value.

cppwfs avatar Mar 15 '23 15:03 cppwfs

Hey @cppwfs - because we use Oracle and it does not support ISOLATION_REPEATABLE_READ - what should I use for a ISOLATION_LEVEL? Any advice here?

Edit: The same CTR is not going to run more than once at a time.

klopfdreh avatar Mar 15 '23 16:03 klopfdreh

Before we go down that path, is the spring.cloud.task.executionid set for the child application?

cppwfs avatar Mar 15 '23 21:03 cppwfs

Yes that is what I meant with 2340 and 2341 in this comment: https://github.com/spring-cloud/spring-cloud-dataflow/issues/5070#issuecomment-1470108752

you can also see it in the screenshot at this comment.

klopfdreh avatar Mar 15 '23 22:03 klopfdreh

Hi @klopfdreh , Let me ask a different way. SCDF creates an empty entry in the task_execution table when it launches a task, much like what we see in 2340. Then it passes the spring.cloud.task.executionid=2340 to the task. So when the task launches it will use task-execution-id of 2340. But if the task does not get the spring.cloud.task.executionid=2340 then it will create its own task_execution entry something like what we see in 2341.
So when CTR requests the launch for the task it will get the 2340 and use that to determine if the task is still running.
I'm curious if the child task is getting the spring.cloud.task.executionid=2340.

cppwfs avatar Mar 15 '23 23:03 cppwfs

I had a look at the CTR when it launches the task and remote debugged it: https://github.com/spring-cloud/spring-cloud-dataflow/blob/main/spring-cloud-dataflow-composed-task-runner/src/main/java/org/springframework/cloud/dataflow/composedtaskrunner/TaskLauncherTasklet.java#L177

There is no --spring.cloud.task.executionid but only a --spring.cloud.task.parent-execution-id which is handed over to the launch call via the args parameter. This one is received before from the SCDF server with the following call result = this.taskExplorer.getTaskExecutionIdByJobExecutionId(stepContribution.getStepExecution().getJobExecutionId()); in public String getParentTaskExecutionId(StepContribution stepContribution)

When the launch call is made the CTR receives the child executionId from the SCDF server, because it is a rest operation. The resulting executionId is put into the stepExecutionContext. see https://github.com/spring-cloud/spring-cloud-dataflow/blob/main/spring-cloud-dataflow-composed-task-runner/src/main/java/org/springframework/cloud/dataflow/composedtaskrunner/TaskLauncherTasklet.java#L184

The TaskLauncherTasklet also preserves the executionId in a field and checks if it is set. If it is already set the child task has been started and the Thread.sleep(this.composedTaskProperties.getIntervalTimeBetweenChecks()); is called.

So the CTR is not using a given child executionId but creates it by itself.

image

So my issue still is: Why are two entries are created. By the way I changed the isolation level to ISOLATION_SERIALIZABLE which also doesn't fix the issue.

klopfdreh avatar Mar 16 '23 06:03 klopfdreh

I had a look at the database again and the compose task is also written twice into the database when started.

This does not happen when the entryPointStyle is shell

image

klopfdreh avatar Mar 16 '23 13:03 klopfdreh

In comparison to the previous post. Here are the TASK_EXECUTION table entries with entryPointStyle shell and the deprecated way of applying the child task properties. deployer.test-compose-task.s3-uploader-app.kubernetes... and deployer.test-compose-task.cleaner-app.kubernetes...

image

As you can see: No duplicated entries and the START_TIME and END_TIME is applied correctly - also for the compose task

klopfdreh avatar Mar 16 '23 13:03 klopfdreh

Cool this helps, we'll take a look.

cppwfs avatar Mar 16 '23 14:03 cppwfs

May I ask if there are any updates, yet?

klopfdreh avatar Apr 11 '23 12:04 klopfdreh

Sorry for the delay in response.
If you create a new definition with a new name does this still occur?

cppwfs avatar Apr 24 '23 21:04 cppwfs

Yes when we switched the entry point we previously cleared the database and re-registered all task applications / ctr task definitions. The issue still occurs in this case. Later on when we noticed that there is no difference we continued testing without recreating the database.

klopfdreh avatar Apr 25 '23 04:04 klopfdreh