airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Difference between logs which stores in files and logs in UI

Open Zoynels opened this issue 1 year ago • 5 comments
trafficstars

Apache Airflow version

2.9.1

If "Other Airflow 2 version" selected, which one?

No response

What happened?

We print to log some information and saw that logs has difference between logs which stores in files and logs in UI (also downloaded logs). Difference is in "empty lines" (we need them in some cases).

stored in file -- real log downloaded from UI Web UI: image

What you think should happen instead?

I think that they should be equal in information that was loggen and saved.

How to reproduce

from airflow import DAG
from airflow.decorators import task
import pendulum

with DAG(
    dag_id="test_logs",
    catchup=False,
    start_date=pendulum.datetime(2024, 5, 17, tz="UTC"),
    schedule=None,
    tags=["logs"],
) as dag1:

    @task()
    def print_log1():
        print("1\n2\n\n3\r\n4\n\r5\n \n6")
        import logging        
        logger = logging.getLogger(__name__)
        logger.info("1\n2\n\n3\r\n4\n\r5\n \n6")
    print_log1()

Operating System

debian / docker

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

standard apache/airflow:2.9.1

Anything else?

No response

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

Zoynels avatar May 17 '24 14:05 Zoynels

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

boring-cyborg[bot] avatar May 17 '24 14:05 boring-cyborg[bot]

This happen due to log deduplication, which might happen when logs streaming from remote logging

https://github.com/apache/airflow/blob/3938f71dfae21c84a3518625543a28ad02edf641/airflow/utils/log/file_task_handler.py#L130-L140

Taragolis avatar May 17 '24 14:05 Taragolis

As I understood, the main problem is in log.splitlines(), which split log-string by simple lines and not by log-messages. Then function analyzes line by line and deduplicates lines, but we need to analyze and deduplicate log-messages. As Airflow can be configured with a custom log-format, then we need to store the pattern in config (custom patterns) to split the whole log into log-messages.

Zoynels avatar May 17 '24 17:05 Zoynels

If you have a suggestion how improve logging feel free to raise a PR which will work with any type of existed loggers without breaking changes.

Taragolis avatar May 17 '24 17:05 Taragolis

I had the same multiple times, for example using DockerOperator which logs all stdout of the container upon failure. Also the logs are messed up not only because of the split lines but also because the file log handler per default tries to sort messages. This not only causes a lot of overhead on the server, it also changes the order any makes a confusion.

Looking forward that somebody raises a PR allowing log sorting and merging to be turned off :-)

jscheffl avatar May 19 '24 15:05 jscheffl