Add configuration of loggers for PySpark
Describe the feature you'd like Currently, everything for PySpark (Processor, training etc.) is logged at INFO level, including basic setup of cluster, for example:
<!-- Site specific YARN configuration properties --> <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>10.0.215.164</value> <description>The hostname of the RM.</description> </property> <property> <name>yarn.nodemanager.hostname</name> <value>algo-1</value> <description>The hostname of the NM.</description> </property> <property> <name>yarn.nodemanager.webapp.address</name> <value>algo-1:8042</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>5</value> <description>Ratio between virtual memory to physical memory.</description> </property> <property> <name>yarn.resourcemanager.am.max-attempts</name> <value>1</value> <description>The maximum number of application attempts.</description> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,YARN_HOME,AWS_CONTAINER_CREDENTIALS_RELATIVE_URI,AWS_REGION</value> <description>Environment variable whitelist</description> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>32768</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>8</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>32768</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>8</value> </property>
--
<br class="Apple-interchange-newline">
This results in a lot of CloudWatch logs, with major downsides:
- most of the logs are completely useless, printing internal details of infrastructure
- searching logs for anything actually important is very hard, which is very problematic for training and monitoring mission-critical models
- it is quite costly for running short jobs, making CloudWatch costs high compared to actual compute costs
Adding options to configure log4j logger, or at least some options to limit this (e.g. minimal logging level), would be to get rid of this. It is also very simple to implement.
How would this feature be used? Please describe.
Additional argument(s) passed to e.g. PySparkProcessor.
Describe alternatives you've considered Full customizability is not necessarily required, but setting minimal log level is very important.
I've had some success passing a configuration similar to this. You might need to adjust it, but after a bit of playing with it, it's configurable. Would be good if this were in the docs or possibly the default.
configuration = [
{
"Classification": "spark-log4j",
"Properties": {
"log4j.rootCategory": "WARN",
"log4j.logger.org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver": "WARN",
"log4j.logger.org.sparkproject.jetty": "WARN",
"log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle": "ERROR",
"log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper": "INFO",
"log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter": "INFO",
"log4j.logger.org.apache.parquet": "ERROR",
"log4j.logger.parquet": "ERROR",
},
},
{
"Classification": "hadoop-log4j",
"Properties": {
"log4j.rootCategory": "WARN",
"log4j.threshold": "WARN",
"hadoop.root.logger": "WARN,console",
"hadoop.security.logger": "WARN,NullAppender",
"hdfs.audit.logger": "WARN,NullAppender",
"namenode.metrics.logger": "WARN,NullAppender",
"datanode.metrics.logger": "WARN,NullAppender",
"rm.audit.logger": "WARN,NullAppender",
},
},
{
"Classification": "hadoop-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"HADOOP_ROOT_LOGGER": "WARN,console",
},
"Configurations": [],
}
],
},]