pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[Bug] Alpine-based images quit with fatal error on aarch64

Open ozangunalp opened this issue 1 year ago • 5 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Read release policy

  • [X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

Official Pulsar images with 3.3.0 and 3.3.1

Minimal reproduce step

Running alpine-based container images on aarch64 machine. We could reproduce it on RHEL 8 and raspberrypi but not not M1.

What did you expect to see?

Pulsar server continue to run

What did you see instead?

Here is the complete log of the container:

Here is the log: pulsar.txt

Last lines of log before fatal error :

2024-09-10T14:02:58,193+0000 [pulsar-io-18-4] INFO  org.apache.pulsar.broker.service.ServerCnx - [[id: 0xa1215d54, L:/127.0.0.1:6650 - R:/127.0.0.1:34536] [SR:127.0.0.1, state:Connected]] Subscribing on topic persistent://public/default/__change_events / reader-936c229a0f. consumerId: 0
2024-09-10T14:02:58,269+0000 [pulsar-io-18-4] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger public/default/persistent/__change_events
2024-09-10T14:02:58,271+0000 [bookkeeper-ml-scheduler-OrderedScheduler-2-0] INFO  org.apache.bookkeeper.mledger.impl.MetaStoreImpl - Creating '/managed-ledgers/public/default/persistent/__change_events'
2024-09-10T14:02:58,340+0000 [bookkeeper-ml-scheduler-OrderedScheduler-2-0] INFO  org.apache.bookkeeper.client.LedgerCreateOp - Ensemble: [192.168.144.2:46605] for ledger: 1
2024-09-10T14:02:58,344+0000 [BookKeeperClientWorker-OrderedExecutor-18-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/__change_events] Created ledger 1 after closed null
2024-09-10T14:02:58,352+0000 [bookkeeper-ml-scheduler-OrderedScheduler-2-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - [public/default/persistent/__change_events] Successfully initialize managed ledger
2024-09-10T14:02:58,394+0000 [bookkeeper-ml-scheduler-OrderedScheduler-2-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://public/default/__change_events] Disabled replicated subscriptions controller
2024-09-10T14:02:58,428+0000 [broker-topic-workers-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [public/default/persistent/__change_events] Cursor __compaction recovered to position 1:-1
2024-09-10T14:02:58,444+0000 [bookkeeper-ml-scheduler-OrderedScheduler-2-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/__change_events] Opened new cursor: ManagedCursorImpl{ledger=public/default/persistent/__change_events, name=__compaction, ackPos=1:-1, readPos=1:0}
2024-09-10T14:02:58,455+0000 [bookkeeper-ml-scheduler-OrderedScheduler-2-0] INFO  org.apache.pulsar.broker.service.BrokerService - Created topic persistent://public/default/__change_events - dedup is disabled
2024-09-10T14:02:58,501+0000 [bookkeeper-ml-scheduler-OrderedScheduler-2-0] INFO  org.apache.bookkeeper.client.LedgerCreateOp - Ensemble: [192.168.144.2:46605] for ledger: 2
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffffa0b43e78, pid=10, tid=280
#
# JRE version: OpenJDK Runtime Environment Corretto-21.0.3.9.1 (21.0.3+9) (build 21.0.3+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-21.0.3.9.1 (21.0.3+9-LTS, mixed mode, tiered, compressed class ptrs, z gc, linux-aarch64)
# Problematic frame:
# 2024-09-10T14:03:28,153+0000 [pulsar-io-18-5] INFO  org.apache.pulsar.broker.service.ServerCnx - [/192.168.144.1:44996] Closing consumer: consumerId=0
2024-09-10T14:03:28,154+0000 [pulsar-io-18-5] INFO  org.apache.pulsar.broker.service.ServerCnx - [/192.168.144.1:44996] Closed consumer before its creation was completed. consumerId=0
2024-09-10T14:03:28,174+0000 [pulsar-io-18-5] INFO  org.apache.pulsar.broker.service.ServerCnx - Closed connection from /192.168.144.1:44996
2024-09-10T14:03:28,174+0000 [pulsar-io-18-1] INFO  org.apache.pulsar.broker.service.ServerCnx - Closed connection from /192.168.144.1:44986

Anything else?

Originally posed on https://github.com/quarkusio/quarkus/issues/43187

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

ozangunalp avatar Sep 13 '24 13:09 ozangunalp

Thanks for reporting this issue @ozangunalp. Most of the Pulsar developers use Macs with Apple Silicon so I guess that's why we haven't caught this issue earlier.

Running alpine-based container images on aarch64 machine. We could reproduce it on RHEL 8 and raspberrypi but not not M1.

Any hints for what would be a practical to reproduce this? Using a cloud VM on aarch64? Any recommendations?

lhotari avatar Sep 16 '24 11:09 lhotari

Most of the Pulsar developers use Macs with Apple Silicon so I guess that's why we haven't caught this issue earlier.

Same for me. I was able to reproduce it with a Raspberry Pi running podman : Raspberry Pi 5 Model B Rev 1.0 Linux raspberrypi 6.1.0-rpi7-rpi-2712 #1 SMP PREEMPT Debian 1:6.1.63-1+rpt1 (2023-11-24) aarch64 GNU/Linux

But yes a cloud VM on aarch64 should work.

ozangunalp avatar Sep 17 '24 11:09 ozangunalp

I tried to reproduce on GCP t2a-standard-1 / Ampere Altra Arm64 with Debian Bookworm and docker installed with instructions from https://docs.docker.com/engine/install/debian/. I couldn't reproduce the issue.

lhotari avatar Oct 14 '24 09:10 lhotari

I tried to reproduce on GCP t2a-standard-1 / Ampere Altra Arm64 with Debian Bookworm and podman and couldn't reproduce the issue.

lhotari avatar Oct 14 '24 10:10 lhotari

It didn't reproduce with RHEL 9 on GCP t2a-standard-1 / Ampere Altra Arm64 GCP doesn't have RHEL 8 image available for Arm64, so I used RHEL 9 Arm64 image.

[lari_hotari@instance-20241014-100511 ~]$ uname -a
Linux instance-20241014-100511 5.14.0-427.37.1.el9_4.aarch64 #1 SMP PREEMPT_DYNAMIC Fri Sep 13 17:15:09 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

used these commands

yum install -y podman tmux
tmux
# in one tmux window
podman run --rm -it docker.io/apachepulsar/pulsar:3.3.1 bin/pulsar standalone
# in another CTRL-B C
podman exec -it pulsar bin/pulsar-perf produce test

@ozangunalp Do you have any suggestions for reproducing on a cloud VM? Which commands should I use?

lhotari avatar Oct 14 '24 10:10 lhotari

This is most likely resolved with #23762 and will be included in Pulsar 3.3.4 and Pulsar 4.0.2 releases.

lhotari avatar Jan 05 '25 11:01 lhotari