[LIVY-866] Optimizing Yarn GetApplications Query to prevent additional load on Yarn and Livy
What changes were proposed in this pull request?
Currently Livy queries Yarn applications by applicationType : Spark. This will put a heavy load on Yarn clusters if there are thousands or more Spark applications in all states (running, finished, failed, queued etc.). A better approach would be to query the applications by tags in addition to job type since Livy only needs to track applications with certain application tags. However, YarnClient does not expose any API to query applications by tags.
As part of this implementation, extending the YarnClientImpl and implementing getApplications method which can take GetApplicationRequest as parameter. Instead of querying all SPARK applications, query SPARK applications with required tags to avoid load on Yarn and Livy servers.
JIRA: https://issues.apache.org/jira/browse/LIVY-866
How was this patch tested?
Verified in a local Yarn cluster. Checked in the trace logs that the request is sent with the applicationTags and the response returns the application report. Please see the logs below.
Verified that other calls to Yarn client such as getApplicationAttemptReport, getContainerReport are successful. Updated existing tests to use the new YarnClientExt.
21/09/07 15:38:50 TRACE YarnClientExt: getApplications called in YarnClientExt with GetApplicationsRequest, calling rmClient to get Applications
21/09/07 15:38:50 TRACE ProtobufRpcEngine: 75: Call -> 0.0.0.0/0.0.0.0:8032: getApplications {application_types: "SPARK" applicationTags: "livy-batch-5-osefkl7m"}
21/09/07 15:38:50 DEBUG Client: IPC Client (72154307) connection to 0.0.0.0/0.0.0.0:8032 from Administrator sending #28
21/09/07 15:38:50 DEBUG Client: IPC Client (72154307) connection to 0.0.0.0/0.0.0.0:8032 from Administrator got value #28
21/09/07 15:38:50 DEBUG ProtobufRpcEngine: Call: getApplications took 8ms
21/09/07 15:38:50 TRACE ProtobufRpcEngine: 75: Response <- 0.0.0.0/0.0.0.0:8032: getApplications {applications { applicationId { id: 2 cluster_timestamp: 1631009244715 } user: "Administrator" queue: "default" name: "SparkBatchJobTest-8" host: "N/A" rpc_port: -1 yarn_application_state: ACCEPTED trackingUrl: "http://MININT-AHVKP1D:8088/proxy/application_1631009244715_0002/" diagnostics: "[Tue Sep 07 15:38:50 +0530 2021] Application is Activated, waiting for resources to be assigned for AM. Details : AM Partition = <DEFAULT_PARTITION> ; Partition Resource = <memory:14336, vCores:8> ; Queue\'s Absolute capacity = 100.0 % ; Queue\'s Absolute used capacity = 0.0 % ; Queue\'s Absolute max capacity = 100.0 % ; " startTime: 1631009330477 finishTime: 0 final_application_status: APP_UNDEFINED app_resource_Usage { num_used_containers: 0 num_reserved_containers: 0 used_resources { memory: 0 virtual_cores: 0 3: "\n\tmemory-mb\020\000\032\002Mi \000" 3: "\n\006vcores\020\000\032\000 \000" } reserved_resources { memory: 0 virtual_cores: 0 3: "\n\tmemory-mb\020\000\032\002Mi \000" 3: "\n\006vcores\020\000\032\000 \000" } needed_resources { memory: 0 virtual_cores: 0 3: "\n\tmemory-mb\020\000\032\002Mi \000" 3: "\n\006vcores\020\000\032\000 \000" } memory_seconds: 0 vcore_seconds: 0 8: 0x00000000 9: 0x00000000 10: 0 11: 0 12: "\n\tmemory-mb\020\000" 12: "\n\006vcores\020\000" 13: "\n\tmemory-mb\020\000" 13: "\n\006vcores\020\000" } originalTrackingUrl: "N/A" currentApplicationAttemptId { application_id { id: 2 cluster_timestamp: 1631009244715 } attemptId: 1 } progress: 0.0 applicationType: "SPARK" applicationTags: "livy-batch-5-osefkl7m" 21: 1 22: 0 23: "\b\000" 24: "<Not set>" 25: "<DEFAULT_PARTITION>" 26: "\b\001\022\030\b\001\022\tUNLIMITED\030\377\377\377\377\377\377\377\377\377\001" 27: 0 }}
@jerryshao @zjffdu @alex-the-man Could you kindly help in reviewing this PR? Thanks.