seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Bug] [seatunnel-engine-server] slot申请时如果资源不够,已申请成功的资源未释放

Open liangcw1111 opened this issue 1 year ago • 4 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

当设置seatunnel.engine.slot-service.slot-num=5时,提交一个需要6个slot的任务,前面5个slot申请成功,最后一个因为资源不足抛出NoEnoughResourceException. 此时任务失败结束但申请成功的5个slot没有释放.

SeaTunnel Version

2.3.4

SeaTunnel Config

seatunnel:
  engine:
    classloader-cache-mode: true
    backup-count: 1
    print-execution-info-interval: 120
    print-job-metrics-info-interval: 10
    queue-type: blockingqueue
    slot-service:
      dynamic-slot: false
      slot-num: 5
    checkpoint:
      interval: 30000
      timeout: 21474836460
      max-concurrent: 10
      tolerable-failure: 2

Running Command

java -Dseatunnel.config=/alidata1/za-seatunnel/apache-seatunnel-2.3.4-SNAPSHOT/config/seatunnel.yaml -Dhazelcast.config=/alidata1/za-seatunnel/apache-seatunnel-2.3.4-SNAPSHOT/config/hazelcast.yaml -Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j2.configurationFile=/alidata1/za-seatunnel/apache-seatunnel-2.3.4-SNAPSHOT/config/log4j2.properties -Dseatunnel.logs.path=/alidata1/za-seatunnel/apache-seatunnel-2.3.4-SNAPSHOT/logs -Dseatunnel.logs.file_name=seatunnel-engine-server -Xrunjdwp:server=y,transport=dt_socket,address=5001,suspend=y -Xms3g -Xmx3g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server -XX:MaxMetaspaceSize=1g -XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:/alidata1/za-seatunnel/logs/gc.log -XX:+PrintGCDateStamps -XX:MaxGCPauseMillis=3000 -cp /alidata1/za-seatunnel/apache-seatunnel-2.3.4-SNAPSHOT/lib/*:/alidata1/za-seatunnel/apache-seatunnel-2.3.4-SNAPSHOT/starter/seatunnel-starter.jar org.apache.seatunnel.core.starter.seatunnel.SeaTunnelServer -d

Error Exception

NoEnoughResourceException时未释放已成功资源

Zeta or Flink or Spark Version

No response

Java or Scala Version

No response

Screenshots

No response

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

liangcw1111 avatar Apr 26 '24 09:04 liangcw1111

@hailin0 海林麻烦看看!

VincentSleepless avatar Apr 26 '24 10:04 VincentSleepless

cc @Hisoka-X

hailin0 avatar Apr 26 '24 10:04 hailin0

please temporarily assign me, thanks.

liunaijie avatar Apr 26 '24 10:04 liunaijie

image 在SubPlan中执行完CANCELING的逻辑后,jobMaster的resourceManger的registerWorker中仍然存在assignedSlots. 跟踪PhysicalVertex的cancel逻辑,发现checkTaskGroupIsExecuting返回false,直接更改状态为CANCELED , 没有执行CancelTaskOperation逻辑

liangcw1111 avatar Apr 26 '24 10:04 liangcw1111

update some finding here: in this method releasePipelineResource can't get the slot prifle. so the resource is not released. image

liunaijie avatar Apr 27 '24 02:04 liunaijie

update some finding here: in this method releasePipelineResource can't get the slot prifle. so the resource is not released. image

this IMAP info is put on SCHEDULED status when resource apply successed. but in this case, apply resource is failed. so put method is never called.

image

liunaijie avatar Apr 27 '24 02:04 liunaijie