[Bug] [seatunnel-engine-server] slot申请时如果资源不够,已申请成功的资源未释放
Search before asking
- [X] I had searched in the issues and found no similar issues.
What happened
当设置seatunnel.engine.slot-service.slot-num=5时,提交一个需要6个slot的任务,前面5个slot申请成功,最后一个因为资源不足抛出NoEnoughResourceException. 此时任务失败结束但申请成功的5个slot没有释放.
SeaTunnel Version
2.3.4
SeaTunnel Config
seatunnel:
engine:
classloader-cache-mode: true
backup-count: 1
print-execution-info-interval: 120
print-job-metrics-info-interval: 10
queue-type: blockingqueue
slot-service:
dynamic-slot: false
slot-num: 5
checkpoint:
interval: 30000
timeout: 21474836460
max-concurrent: 10
tolerable-failure: 2
Running Command
java -Dseatunnel.config=/alidata1/za-seatunnel/apache-seatunnel-2.3.4-SNAPSHOT/config/seatunnel.yaml -Dhazelcast.config=/alidata1/za-seatunnel/apache-seatunnel-2.3.4-SNAPSHOT/config/hazelcast.yaml -Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j2.configurationFile=/alidata1/za-seatunnel/apache-seatunnel-2.3.4-SNAPSHOT/config/log4j2.properties -Dseatunnel.logs.path=/alidata1/za-seatunnel/apache-seatunnel-2.3.4-SNAPSHOT/logs -Dseatunnel.logs.file_name=seatunnel-engine-server -Xrunjdwp:server=y,transport=dt_socket,address=5001,suspend=y -Xms3g -Xmx3g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server -XX:MaxMetaspaceSize=1g -XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:/alidata1/za-seatunnel/logs/gc.log -XX:+PrintGCDateStamps -XX:MaxGCPauseMillis=3000 -cp /alidata1/za-seatunnel/apache-seatunnel-2.3.4-SNAPSHOT/lib/*:/alidata1/za-seatunnel/apache-seatunnel-2.3.4-SNAPSHOT/starter/seatunnel-starter.jar org.apache.seatunnel.core.starter.seatunnel.SeaTunnelServer -d
Error Exception
NoEnoughResourceException时未释放已成功资源
Zeta or Flink or Spark Version
No response
Java or Scala Version
No response
Screenshots
No response
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
@hailin0 海林麻烦看看!
cc @Hisoka-X
please temporarily assign me, thanks.
在SubPlan中执行完CANCELING的逻辑后,jobMaster的resourceManger的registerWorker中仍然存在assignedSlots.
跟踪PhysicalVertex的cancel逻辑,发现checkTaskGroupIsExecuting返回false,直接更改状态为CANCELED , 没有执行CancelTaskOperation逻辑
update some finding here:
in this method releasePipelineResource can't get the slot prifle. so the resource is not released.
update some finding here: in this method releasePipelineResource can't get the slot prifle. so the resource is not released.
this IMAP info is put on SCHEDULED status when resource apply successed. but in this case, apply resource is failed. so put method is never called.
