distribution-scripts icon indicating copy to clipboard operation
distribution-scripts copied to clipboard

Current nightly build of `shards` is broken

Open straight-shoota opened this issue 1 year ago • 5 comments

The shards binary in the current nightly build is broken.

$ wget https://output.circle-artifacts.com/output/job/249b749b-4d24-434a-8138-d0b3
530b7bf7/artifacts/0/dist_packages/crystal-1.14.0-dev-1-linux-x86_64.tar.gz
$ tar -xzf crystal-1.14.0-dev-1-linux-x86_64.tar.gz
$ crystal-1.14.0-dev-1/bin/shards --version
bash: crystal-1.14.0-dev-1/bin/shards: cannot execute: required file not found
$ ls -lh crystal-1.14.0-dev-1/bin/shards
-rwxr-xr-x 1 root root 3.3M Sep  3 00:12 crystal-1.14.0-dev-1/bin/shards
$ type crystal-1.14.0-dev-1/bin/shards
crystal-1.14.0-dev-1/bin/shards is crystal-1.14.0-dev-1/bin/shards

Not sure what's going on. Might just be a fluke and it'll be fixed in the next build. I'm not aware we changed anything in the build process of shards.

Anyway, we appear to be missing a validation of the build product. A broken build should never be published.

straight-shoota avatar Sep 03 '24 21:09 straight-shoota

In the current nightly build the shards executable does work again:

$ crystal-1.14.0-dev-1/bin/shards --version
Shards 0.18.0 [31b44d3] (2024-03-28
$ ls -lh crystal-1.14.0-dev-1/bin/shards
-rwxr-xr-x 1 root root 5.6M Sep  4 00:12 crystal-1.14.0-dev-1/bin/shards

So this appears to have been a random failure.

That still means we need validation of the build artifacts.

straight-shoota avatar Sep 04 '24 07:09 straight-shoota

And today's nightly build is broken again. So apparently this wasn't a fluke.

$ wget https://output.circle-artifacts.com/output/job/fae3e672-872b-473b-a555-5234fe773654/artifacts/0/dist_packages/crystal-1.14.0-dev-1-linux-x86_64.tar.gz
$ tar -xzf crystal-1.14.0-dev-1-linux-x86_64.tar.gz
$ crystal-1.14.0-dev-1/bin/shards --version
zsh: no such file or directory: crystal-1.14.0-dev-1/bin/shards
$ ls -lh crystal-1.14.0-dev-1/bin/shards
-rwxr-xr-x 1 johannes johannes 3.3M Sep  5 02:13 crystal-1.14.0-dev-1/bin/shards

straight-shoota avatar Sep 05 '24 07:09 straight-shoota

Digging a bit more into it, it seems the shards binary is actually a dynamically linked executable linking against musl libc. The "no such file or directory" error comes from the fact that the interpreter /lib/ld-musl-x86_64.so.1 is missing on a glibc system.

$ readelf -l crystal-1.14.0-dev-1/bin/shards

Elf file type is DYN (Position-Independent Executable file)
Entry point 0xa6d0
There are 12 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000002a0 0x00000000000002a0  R      0x8
  INTERP         0x00000000000002e0 0x00000000000002e0 0x00000000000002e0
                 0x0000000000000019 0x0000000000000019  R      0x1
      [Requesting program interpreter: /lib/ld-musl-x86_64.so.1]

The weird part about this is that we actually do have a check to ensure the shards binary is statically linked:

https://github.com/crystal-lang/distribution-scripts/blob/1b7fb7ff2a2a9d535ec95dd3aedbf8e1fc627212/linux/Dockerfile#L74

So I'm not sure how it could pass that test 😕 It fails when I test locally.

straight-shoota avatar Sep 05 '24 10:09 straight-shoota

Today's build is fine again 🤷

straight-shoota avatar Sep 06 '24 06:09 straight-shoota

And today's as well.

straight-shoota avatar Sep 07 '24 09:09 straight-shoota

And this is happening again.

https://github.com/athena-framework/demo/actions/runs/11063118636/job/30738844538

straight-shoota avatar Sep 27 '24 08:09 straight-shoota

I've encountered this in 2 of my projects recently:

  • https://github.com/devnote-dev/cling/actions/runs/11753866125/job/32747205715#step:4:88
  • https://github.com/devnote-dev/docr/actions/runs/11759771854/job/32759616408#step:3:90

devnote-dev avatar Nov 10 '24 15:11 devnote-dev

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/post-mortem-issues-in-the-crystal-1-14-1-release-process/7610/1

crysbot avatar Jan 14 '25 13:01 crysbot

@ggiraldez suggests the issue might be caused by make -C shards install rebuilding the executable. It's not clear why make would consider the build dependency out of date, though. A guess would be that it might be related to file timestamps in docker. This would certainly explain the observations, and particularly the sporadic nature.

Unfortunately, the CI logs have the docker output truncated so I'm afraid we cannot retrace whether that happened on previous builds (https://github.com/crystal-lang/distribution-scripts/pull/346 is supposed to fix that). I haven't been able to reproduce and observe this locally yet.

straight-shoota avatar Jan 22 '25 19:01 straight-shoota

Happened again: https://github.com/athena-framework/athena/actions/runs/12941655549/job/36098066968.

Blacksmoke16 avatar Jan 24 '25 03:01 Blacksmoke16

I think it's happening continuously at the moment, and the CI of install-crystal is monitoring the situation quite well, I'd say. https://github.com/crystal-lang/install-crystal/actions

oprypin avatar Jan 26 '25 12:01 oprypin

Well any CI workflow that regularly runs with crystal latest documents the effect.

We'd know more if we updated distribution-scripts to show full build logs: https://github.com/crystal-lang/crystal/pull/15368

straight-shoota avatar Jan 26 '25 12:01 straight-shoota

The logs from the latest nightly build confirm the suspicion: make install rebuilds shards, which then happens without the original configuration (e.g. static=1).

https://app.circleci.com/pipelines/github/crystal-lang/crystal/17043/workflows/7577b52c-6964-41f5-9d63-c46c042d4e00/jobs/88511?invite=true#step-102-137607_132

straight-shoota avatar Jan 29 '25 13:01 straight-shoota

make --trace shows it thinks shard.yml is newer than shard.lock: update target 'shard.lock' due to: shard.yml

straight-shoota avatar Jan 29 '25 13:01 straight-shoota

Okay so it seems to be the classic issue that timestamps might be slightly off after a git checkout. And there's an error in the Makefile: the shard.lock recipe doesn't touch the target if SHARDS=false and we're bootstrapping by downloading lib/molinillo with curl. Hence the original make build process doesn't mark shard.lock as fresh. This then triggers a rebuild on make install.

I am honestly surprised this issue has only started appearing quite recently.

straight-shoota avatar Jan 29 '25 14:01 straight-shoota