airflow
airflow copied to clipboard
Difference between logs which stores in files and logs in UI
Apache Airflow version
2.9.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
We print to log some information and saw that logs has difference between logs which stores in files and logs in UI (also downloaded logs). Difference is in "empty lines" (we need them in some cases).
stored in file -- real log
downloaded from UI
Web UI:
What you think should happen instead?
I think that they should be equal in information that was loggen and saved.
How to reproduce
from airflow import DAG
from airflow.decorators import task
import pendulum
with DAG(
dag_id="test_logs",
catchup=False,
start_date=pendulum.datetime(2024, 5, 17, tz="UTC"),
schedule=None,
tags=["logs"],
) as dag1:
@task()
def print_log1():
print("1\n2\n\n3\r\n4\n\r5\n \n6")
import logging
logger = logging.getLogger(__name__)
logger.info("1\n2\n\n3\r\n4\n\r5\n \n6")
print_log1()
Operating System
debian / docker
Versions of Apache Airflow Providers
No response
Deployment
Docker-Compose
Deployment details
standard apache/airflow:2.9.1
Anything else?
No response
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
This happen due to log deduplication, which might happen when logs streaming from remote logging
https://github.com/apache/airflow/blob/3938f71dfae21c84a3518625543a28ad02edf641/airflow/utils/log/file_task_handler.py#L130-L140
As I understood, the main problem is in log.splitlines(), which split log-string by simple lines and not by log-messages. Then function analyzes line by line and deduplicates lines, but we need to analyze and deduplicate log-messages. As Airflow can be configured with a custom log-format, then we need to store the pattern in config (custom patterns) to split the whole log into log-messages.
If you have a suggestion how improve logging feel free to raise a PR which will work with any type of existed loggers without breaking changes.
I had the same multiple times, for example using DockerOperator which logs all stdout of the container upon failure. Also the logs are messed up not only because of the split lines but also because the file log handler per default tries to sort messages. This not only causes a lot of overhead on the server, it also changes the order any makes a confusion.
Looking forward that somebody raises a PR allowing log sorting and merging to be turned off :-)