routinator icon indicating copy to clipboard operation
routinator copied to clipboard

Fatal Error: failed to fill whole buffer

Open TheKigen opened this issue 2 years ago • 6 comments

Hi,

I've been running Routinator and have noticed lately a constant crash that is occurring across my edges.

Aug 17 15:53:14 edge-dal01 routinator[311637]: Failed to read stored publication point at /var/lib/routinator/rpki-cache/stored/rrdp/rpki-repo.registro.br/026b1847a231e62a6bcf46a3794665f4dcbe0e609f90c0e54d7062dbf72114d9/rsync/rpki-repo.registro.br/repo/7bT49NANhaBtN8pntdNKGXgkSmJu1Y2LtvQMooH1o5Rv/0/E9213DD4021CE06AB92F4683859FB848C0A265FD.mft: failed to fill whole buffer
Aug 17 15:53:14 edge-dal01 routinator[311637]: Fatal error. Exiting.
Aug 17 15:53:14 edge-dal01 systemd[1]: routinator.service: Main process exited, code=exited, status=1/FAILURE
Aug 17 15:53:14 edge-dal01 systemd[1]: routinator.service: Failed with result 'exit-code'.
Aug 17 15:53:14 edge-dal01 systemd[1]: routinator.service: Consumed 1min 10.636s CPU time.
Aug 17 15:53:15 edge-dal01 systemd[1]: routinator.service: Scheduled restart job, restart counter is at 11166.
Aug 17 15:53:15 edge-dal01 systemd[1]: Stopped Routinator 3000.
Aug 17 15:53:15 edge-dal01 systemd[1]: routinator.service: Consumed 1min 10.636s CPU time.
Aug 17 15:53:15 edge-dal01 systemd[1]: Starting Routinator 3000...
Aug 17 15:53:15 edge-dal01 systemd[1]: Started Routinator 3000.
Aug 18 14:29:38 edge-dal01 routinator[330802]: Failed to read stored publication point at /var/lib/routinator/rpki-cache/stored/rrdp/rpki-repo.registro.br/026b1847a231e62a6bcf46a3794665f4dcbe0e609f90c0e54d7062dbf72114d9/rsync/rpki-repo.registro.br/repo/7bT49NANhaBtN8pntdNKGXgkSmJu1Y2LtvQMooH1o5Rv/0/E9213DD4021CE06AB92F4683859FB848C0A265FD.mft: failed to fill whole buffer
Aug 18 14:29:39 edge-dal01 routinator[330802]: Fatal error. Exiting.
Aug 18 14:29:39 edge-dal01 systemd[1]: routinator.service: Main process exited, code=exited, status=1/FAILURE
Aug 18 14:29:39 edge-dal01 systemd[1]: routinator.service: Failed with result 'exit-code'.
Aug 18 14:29:39 edge-dal01 systemd[1]: routinator.service: Consumed 1min 20.631s CPU time.
Aug 18 14:29:39 edge-dal01 systemd[1]: routinator.service: Scheduled restart job, restart counter is at 11790.
Aug 18 14:29:39 edge-dal01 systemd[1]: Stopped Routinator 3000.
Aug 18 14:29:39 edge-dal01 systemd[1]: routinator.service: Consumed 1min 20.631s CPU time.
Aug 18 14:29:39 edge-dal01 systemd[1]: Starting Routinator 3000...
Aug 18 14:29:39 edge-dal01 systemd[1]: Started Routinator 3000.
root@edge-dal01:/var/log# routinator --version
Routinator 0.11.2
root@edge-dal01:/var/log# df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           198M  1.2M  197M   1% /run
/dev/vda1        47G   15G   30G  33% /
tmpfs           989M  184K  989M   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           198M  4.0K  198M   1% /run/user/0
root@edge-dal01:/var/log# df --inodes
Filesystem       Inodes  IUsed    IFree IUse% Mounted on
tmpfs            253139    720   252419    1% /run
/dev/vda1      13004800 566939 12437861    5% /
tmpfs            253139      3   253136    1% /dev/shm
tmpfs            253139      3   253136    1% /run/lock
tmpfs             50627     26    50601    1% /run/user/0

The system is Ubuntu 22.04, installation is via the APT repo listed in the installation documentation.

TheKigen avatar Aug 18 '22 14:08 TheKigen

Hi and thanks for the report!

The error message means that file mentioned doesn’t contain all the data expected. Quickest way to fix this by deleting the file or – better yet – the entire store, i.e., /var/lib/routinator/rpki-cache/stored.

Before you do that, can you quickly check how big the file in question is?

partim avatar Aug 18 '22 14:08 partim

root@edge-dal01:~# ls -alh /var/lib/routinator/rpki-cache/stored/rrdp/rpki-repo.registro.br/026b1847a231e62a6bcf46a3794665f4dcbe0e609f90c0e54d7062dbf72114d9/rsync/rpki-repo.registro.br/repo/7bT49NANhaBtN8pntdNKGXgkSmJu1Y2LtvQMooH1o5Rv/0/
total 8.0K
drwxr-xr-x 2 routinator routinator 4.0K Aug  2 15:48 .
drwxr-xr-x 3 routinator routinator 4.0K Jul 13 13:38 ..
-rw-r--r-- 1 routinator routinator    0 Aug  2 15:48 E9213DD4021CE06AB92F4683859FB848C0A265FD.mft

TheKigen avatar Aug 18 '22 14:08 TheKigen

Right, so it is empty. I’ll double check if that can actually happen without Routinator also crashing.

In the mean time, just wiping the the cache should fix your issue.

partim avatar Aug 18 '22 15:08 partim

Yes, wiping the cache appears to have fixed it. Its no longer crashing and has been stable for the past 3 hours. Just curious how it got that 0 byte file, since I imagine it can happen again.

TheKigen avatar Aug 18 '22 17:08 TheKigen

So I figured out why this happened now that I looked further into the logs when this occurred. This appears to have been due to a unexpected shutdown of the host node that this VM was on.

Aug  2 15:48:23 edge-dal01 routinator[729]: rsync://rpki-repo.registro.br/repo/DhUX35ZMUCbRhYuECLQuQiyiqCX4z9LvRirJw3bBUwyX/0/8A2D68F59DA79C1167B96E60DB2A8AC7F897F91B.mft: failed to validate
Aug  2 15:48:23 edge-dal01 routinator[729]: rsync://rpki-repo.registro.br/repo/DhUX35ZMUCbRhYuECLQuQiyiqCX4z9LvRirJw3bBUwyX/0/8A2D68F59DA79C1167B96E60DB2A8AC7F897F91B.mft: No valid manifest found.
Aug  2 15:48:23 edge-dal01 routinator[729]: CA for rsync://rpki-repo.registro.br/repo/DhUX35ZMUCbRhYuECLQuQiyiqCX4z9LvRirJw3bBUwyX/0/ rejected, resources marked as unsafe:
Aug  2 15:48:23 edge-dal01 routinator[729]:    45.186.80.0/22
Aug  2 15:48:23 edge-dal01 routinator[729]:    2804:6440::/32
Aug  2 15:48:23 edge-dal01 routinator[729]:    AS269416
Aug  2 15:48:23 edge-dal01 routAug  2 18:46:03 edge-dal01 systemd-modules-load[407]: Inserted module 'msr'
root@edge-dal01:~# zcat /var/log/syslog.2.gz | grep "Aug  2 18:46"
Aug  2 15:48:23 edge-dal01 routAug  2 18:46:03 edge-dal01 systemd-modules-load[407]: Inserted module 'msr'
Aug  2 18:46:03 edge-dal01 systemd-modules-load[407]: Inserted module 'ipmi_devintf'
Aug  2 18:46:03 edge-dal01 systemd[1]: Finished Load Kernel Modules.
Aug  2 18:46:03 edge-dal01 systemd[1]: Finished Remount Root and Kernel File Systems.
Aug  2 18:46:03 edge-dal01 systemd[1]: Activating swap /swapfile...
Aug  2 18:46:03 edge-dal01 systemd[1]: Mounting FUSE Control File System...
Aug  2 18:46:03 edge-dal01 systemd[1]: Mounting Kernel Configuration File System...
Aug  2 18:46:03 edge-dal01 systemd[1]: Starting Device-Mapper Multipath Device Controller...
Aug  2 18:46:03 edge-dal01 systemd[1]: Starting Flush Journal to Persistent Storage...
Aug  2 18:46:03 edge-dal01 systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.

I guess the unexpected shutdown occurred right when Routinator was just opening that file for writing. But hopefully a code change can occur to dump such invalid files instead of crashing.

TheKigen avatar Aug 18 '22 18:08 TheKigen

That sounds like a reasonable change. I’ll add the issue to the milestone for the next release.

Again, thank you for reporting!

partim avatar Aug 19 '22 09:08 partim