osbuild-composer icon indicating copy to clipboard operation
osbuild-composer copied to clipboard

main: journald support with external correlation

Open lzap opened this issue 11 months ago • 15 comments

This patch adds native journald support through logrus which enables to work with structured logs. It uses a dependency that is already in the project (coreos/systemd-go) which is a pure Go implementation. New log type is introduced named "journal" and it is now the default with fallback to "text" if journald is not present on the system.

Structured logging enabled me to work with logrus fields that can help with correlating log records on standalone or hosted deployments. Already existing "operation_id" field (random string) is now fully propagated via echo context logger and new logrus hook extracts it from Go standard context.

A new field called "external_id" is extracted from HTTP header X-External-Id when present. The idea is to set this to X-Edge-Request-Id on hosted image builder so logs can be easily correlated across both services.

Additionally, each cloud API request is now logged using the built-in echo logging middleware which creates entries like Processed request GET /path. Together with journald this can be tested end-to-end:

# curl -H "X-External-Id: XXXX" --unix-socket /run/cloudapi/api.socket -XGET http://localhost/api/image-builder-composer/v2/openapi

This will create the following entries:

# journalctl -e -o json-pretty --output-fields=OPERATION_ID,MESSAGE,EXTERNAL_ID
{
        "OPERATION_ID" : "2eJwe1mdLa9s2rB3PVZbVAttQi1",
        "EXTERNAL_ID" : "XXXX",
        "MESSAGE" : "Returning OpenAPI document"
}
{
        "MESSAGE" : "Processed request GET /api/image-builder-composer/v2/openapi",
        "EXTERNAL_ID" : "XXXX",
        "OPERATION_ID" : "2eJwe1mdLa9s2rB3PVZbVAttQi1",
}

Users can easily dig logs per request or operation:

# journalctl -u osbuild-composer OPERATION_ID=2eJwe1mdLa9s2rB3PVZbVAttQi1 -o cat
Returning OpenAPI document
Processed request GET /api/image-builder-composer/v2/openapi

TODO

  • [ ] adequate testing for the new functionality or fixed issue
  • [ ] adequate documentation informing people about the change such as
    • [ ] submit a PR for the READMEs listed here
    • [ ] submit a PR for the osbuild.org website repository if this PR changed any behavior not covered by the automatically updated READMEs

lzap avatar Mar 28 '24 14:03 lzap

Moved all comments into the codebase, broken into three commits:

  • Documentation and gitignore update
  • Adding journald native support
  • Propagating operation and external id through context to logs (this one is hard to break up into two as I had to do updates to existing operation id in order to efficiently implement external id)

lzap avatar Apr 04 '24 07:04 lzap

Damn that was a misclick, amended a rebase.

lzap avatar Apr 09 '24 10:04 lzap

+ sudo composer-cli blueprints depsolve ostree
ERROR: Depsolve Error: Get "http://localhost/api/v1/blueprints/depsolve/ostree": EOF

and then an AWS test which I don’t understand why fails

lzap avatar Apr 09 '24 12:04 lzap

and then an AWS test which I don’t understand why fails

iot-bootable-container failed with

dial tcp: lookup cdn03.quay.io: no such host

which looks like a plain old network hiccup. Annoying.

achilleas-k avatar Apr 09 '24 13:04 achilleas-k

@ezr-ondrej please do not rebase, @achilleas-k plans to only restart failed tests. is that right? thanks.

lzap avatar Apr 12 '24 10:04 lzap

@ezr-ondrej please do not rebase, @achilleas-k plans to only restart failed tests. is that right? thanks.

It's a tricky decision. It's good to rebase when there's a lot of changes in main and there's a chance the rebase might break things, but OTOH the CI runtimes are long and annoying.

A merge queue would be great but that doesn't work with flaky testing.

achilleas-k avatar Apr 12 '24 11:04 achilleas-k

Yeah, sorry! I've got new buttons in Gitlab now, I'll push those instead xD

ezr-ondrej avatar Apr 12 '24 14:04 ezr-ondrej

Only schulzbot is red, seems only 8.4 is red, likely not caused by this PR.

lzap avatar Apr 15 '24 07:04 lzap

likely not caused by this PR.

Yeah, it's a timout, very unlikely to be caused by this PR, but it means the tests on 8.4 didn't run on this PR :(

ezr-ondrej avatar Apr 15 '24 10:04 ezr-ondrej

There's a known issue with that test but I thought we fixed it already.

achilleas-k avatar Apr 15 '24 14:04 achilleas-k

Rebased.

lzap avatar Apr 17 '24 12:04 lzap

The main pipeline is green!! Woooot!

ezr-ondrej avatar Apr 22 '24 13:04 ezr-ondrej

Well bunch of tests failed, someone help me to understand if these are relevant. These logs are hard to read.

lzap avatar Apr 22 '24 15:04 lzap

Since all the edge tests are failing but the main gitlab CI pipeline succeeded, it might something with the edge tests themselves or something in the PR changed something the tests rely on.

I'm seeing at least one instance of the issue described in #4094

achilleas-k avatar Apr 22 '24 16:04 achilleas-k

Oh, we're hitting the old:

error: Postprocessing and committing: Finalizing rootfs: Hardlinking rpmdb to base location: Hardlinking /usr/share/rpm to /usr/lib/sysimage/rpm-ostree-base-db: Analyzing /usr/share/rpm/ content: File exists (os error 17)\
Traceback (most recent call last):\
  File \\\"/run/osbuild/bin/org.osbuild.ostree.commit\\\", line 68, in <module>\
    r = main(args[\\\"inputs\\\"],\
  File \\\"/run/osbuild/bin/org.osbuild.ostree.commit\\\", line 55, in main\
    subprocess.run(argv,\
  File \\\"/usr/lib64/python3.9/subprocess.py\\\", line 528, in run\
    raise CalledProcessError(retcode, process.args,\
subprocess.CalledProcessError: Command '['rpm-ostree', 'compose', 'commit', '--repo=/run/osbuild/tree/repo', '--add-metadata-string=version=9-stream', '--add-metadata-string=rpmostree.inputhash=1e1524b04aaad2bbdc26c3d2a1d43e1bf5c95ae327bf399af3815389b277f64d', '--write-composejson-to=/run/osbuild/tree/compose.json', '/tmp/tmp1f6ir9sg.json', '/run/osbuild/tree/tmp22cqxixc']' returned non-zero exit status 1.\
\", \"error\": null}}}]" jobId=b11a4709-a108-40d8-8f81-69a07e79ac94

Commit 6f69cacdc5963aa772e62a37c00fa27157a5b7f3 updated the sources which introduces a version of rpm-ostree that causes an issue we're aware of. The issue has been fixed but we've been waiting for it to reach Fedora 39 and CS9 repos before updating sources. But since edge tests aren't required for merging and we needed newer sources for other reasons, I suppose all our edge tests are failing on F39 and CS9 now.

achilleas-k avatar Apr 22 '24 16:04 achilleas-k

This PR is stale because it has been open 30 days with no activity. Remove "Stale" label or comment or this will be closed in 7 days.

github-actions[bot] avatar May 23 '24 04:05 github-actions[bot]

This PR is stale because it has been open 30 days with no activity. Remove "Stale" label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jun 24 '24 04:06 github-actions[bot]

@achilleas-k WDYT about this one? it'd be great to get it in :)

ezr-ondrej avatar Jun 24 '24 10:06 ezr-ondrej

@achilleas-k WDYT about this one? it'd be great to get it in :)

Yes, we should get this merged. It seems it got caught in CI flakiness purgatory. Let's do a rebase and full run to get it in.

achilleas-k avatar Jun 24 '24 11:06 achilleas-k

Just rpm-build:centos-stream-10-ppc64le failed

lzap avatar Jun 25 '24 06:06 lzap

Thank you @achilleas-k and @lzap ! :orange_heart:

ezr-ondrej avatar Jun 25 '24 17:06 ezr-ondrej