redlib 🐛 Bug Report: Blank site, unhealthy container

Describe the bug

My instance l.opnxng.com is experiencing frequent downtime recently. The site is blank when visited, and firefox returns NS_BINDING_ABORTED.

There is nothing in the Docker container logs, but the container is unhealthy.

The same bug appears in AMD and ARM platforms. I'm using the latest images on quay.io.

Steps to reproduce the bug

Not sure, but the bug appears in l.opnxng.com

What's the expected behavior?

Able to visit normally

Additional context / screenshot

Please let me know if there are steps that I can take to give you more information, or if someone else is experiencing the same behavior.

Mar 31 '24 16:03 extremelyonline

Try adding RUST_LOG=redlib=trace (logging on the crate) or RUST_LOG=trace (logging on all dependencies as well) to your environment variables.

Mar 31 '24 18:03 sigaloid

I think the images on quay.io are just broken as they're also out of date. I'm running into the same issues and I've already enabled debugging but nothing out of the ordinary shows. Although when directly connecting to the container's port or it being behind my reverse proxy just gives me a blank page or a 503 on my reverse proxy.

Maybe someone needs to look into if something is up with the CI?

Apr 07 '24 20:04 spectrapulse

:disappointed: sounds related to #74. Will prioritize

Apr 07 '24 20:04 sigaloid

#71 also looks to be related, I've taken a peek at the workflow files and it seems like we currently have a lot of moving factors with many points of failures... Maybe it's a better idea to not tie everything so closely together?

Also is there a reason we're using quay.io instead of ghcr.io? I've not yet worked with quay before but there currently doesn't seem to be a history of the previous images which we can pull from or I might just have missed it.

Apr 07 '24 20:04 spectrapulse

I just already used quay for other uses so I was most familiar, though I'm open to adding ghcr. And yes the recent PR was supposed to simplify the actions but it may have broken it.

Though, we really do need a complicated CI, because we need to build images for a set of targets on release, but also run fmt/check/test on every commit, etc. I'm just not as familiar with GitHub's CI/CD yet and the fact that I can't test locally for releases is a bit frustrating.

Apr 07 '24 20:04 sigaloid

I mean I understand the need for complicated CI when we're talking about dealing with testing and publishing binaries. But right now if we fail publishing anything (and not just our tests) may it be to crates.io or GitHub's artifacts everything else fails. None of the container images get built or published even when they may not be faulty? Also sometimes GitHub Actions tends to be very inconsistent and fails with no reason at all and all that is needed is just a simple re-run of a job for it to be resolved so maybe not relying on every stage at all times would make the CI less prone to errors because we can't always control every point of failure.

Yeah CI time would take a bit longer if we build the images from source or build the binary in the same job and include it in the image using separate steps but at least this way not everything fails and most of the build-time issues can be resolved by using caching which depending on Rust handles it's cache might not be too hard to add.

Unfortunately I'm quite familiar using GitHub's CI (and CI in general) but I'm not too familiar with packaging and publishing Rust projects making it kind of hard for me to follow along with everything going on within the CI and setting it up to make a PR myself without doing a lot of research which I'm not exactly unwilling to do but I'm pretty sure there's others that know better approaches than me being someone reading related docs for the first time or doing a few Google Searches.

Apr 07 '24 22:04 spectrapulse

This is still an issue on the latest Quay.io release (from 7 days ago as of time of reporting.)

Jun 12 '24 07:06 Handrail9

I can't reproduce it. Can you try running it straight from the source code with the aforementioned logging variables set?

Jun 15 '24 16:06 sigaloid

I’ll try adding the logging variables since this happens to me too. I’ve been having to use de-unhealth because of it. The container gets restarted from unhealthy every 5-10m. Will update if I can figure it out. Running on Linux server headless with docker-compose, nginx reverse proxy, ports 127.0.0.1:8080:8080, and cloudflare is in use proxying the site via wildcard.

edit: Since I posted this during some major oauth problems, I waited a couple weeks, with logs on, to make sure I could get consistent info about the issue. Only problem I’m facing is that the issue hasn’t happened since! Almost 2 whole weeks of uptime so far, thank you guys!

Jun 18 '24 16:06 LucifersCircle