lxd icon indicating copy to clipboard operation
lxd copied to clipboard

err="strconv.ParseInt: parsing \"\": invalid syntax" when scraping metrics

Open simondeziel opened this issue 2 years ago • 14 comments

The following error messages were noticed on different machine all running Ubuntu 20.04 with LXD's snap. The last one occurred with the snap version 4.23 rev 22652 with the following lxc info:

$ lxc info
config:
  core.https_address: 0.0.0.0:8443
  core.metrics_address: 0.0.0.0:9101
  storage.backups_volume: default/backups
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 192.168.1.8:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
[...]
    -----END CERTIFICATE-----
  certificate_fingerprint: 7be47923cf301ead3a3f0938530aade2486d5cecfd1052959faceb7addb74db2
  driver: lxc
  driver_version: 4.0.12
  firewall: xtables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.13.0-35-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "20.04"
  project: default
  server: lxd
  server_clustered: false
  server_name: mars.enclume.ca
  server_pid: 3684702
  server_version: "4.23"
  storage: zfs
  storage_version: 2.0.6-1ubuntu2
  storage_supported_drivers:
  - name: ceph
    version: 15.2.14
    remote: true
  - name: btrfs
    version: 5.4.1
    remote: false
  - name: cephfs
    version: 15.2.14
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.45.0
    remote: false
  - name: zfs
    version: 2.0.6-1ubuntu2
    remote: false

The error messages:

Feb 14 15:11:38 c2d lxd.daemon[1356]: t=2022-02-14T15:11:38+0000 lvl=warn msg="Failed to get total number of processes" err="strconv.ParseInt: parsing \"\": invalid syntax"
Feb 14 21:14:12 ocelot lxd.daemon[1588]: t=2022-02-14T21:14:12+0000 lvl=warn msg="Failed to get swap usage" err="strconv.ParseInt: parsing \"\": invalid syntax"
Feb 22 22:04:52 xeon lxd.daemon[1818]: t=2022-02-22T22:04:52+0000 lvl=warn msg="Failed to get swap usage" err="strconv.ParseInt: parsing \"\": invalid syntax"
Mar 14 09:03:34 mars lxd.daemon[3684702]: t=2022-03-14T09:03:34-0400 lvl=warn msg="Failed to get memory usage" err="strconv.ParseInt: parsing \"\": invalid syntax" instance=vpn instanceType=container project=default

When looking around the last one, there doesn't seem to be anything interesting around the time of the error:

root@mars:~# grep -5 -F 'strconv.ParseInt:' /var/snap/lxd/common/lxd/logs/lxd.log
t=2022-03-14T07:34:43-0400 lvl=info msg="Done pruning expired instance backups" 
t=2022-03-14T08:34:43-0400 lvl=info msg="Updating images" 
t=2022-03-14T08:34:43-0400 lvl=info msg="Pruning expired instance backups" 
t=2022-03-14T08:34:43-0400 lvl=info msg="Done updating images" 
t=2022-03-14T08:34:43-0400 lvl=info msg="Done pruning expired instance backups" 
t=2022-03-14T09:03:34-0400 lvl=warn msg="Failed to get memory usage" err="strconv.ParseInt: parsing \"\": invalid syntax" instance=vpn instanceType=container project=default
t=2022-03-14T09:34:43-0400 lvl=info msg="Pruning expired instance backups" 
t=2022-03-14T09:34:43-0400 lvl=info msg="Updating images" 
t=2022-03-14T09:34:43-0400 lvl=info msg="Done pruning expired instance backups" 
t=2022-03-14T09:34:43-0400 lvl=info msg="Done updating images" 
t=2022-03-14T09:44:45-0400 lvl=info msg="Creating scheduled container snapshots"

simondeziel avatar Mar 15 '22 14:03 simondeziel

@simondeziel once this is in the snap and you start seeing the new errors please can you update it here? Thanks

tomponline avatar Mar 18 '22 16:03 tomponline

@tomponline, I'm running logcheck so I get to see all those weird and infrequent errors so yes, I'll report back when I seem them.

simondeziel avatar Mar 18 '22 16:03 simondeziel

@simondeziel do you see any more on these errors now? Thanks

tomponline avatar Apr 07 '22 08:04 tomponline

@tomponline no new occurrence since then. I'll close the issue and will reopen it if/when needed.

simondeziel avatar Apr 07 '22 12:04 simondeziel

@tomponline I just got this one:

Apr 26 22:05:04 mars lxd.daemon[339520]: time="2022-04-26T22:05:04-04:00" level=warning msg="Failed to get swap usage" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=vpn instanceType=container project=default

The host in question runs with:

$ snap list lxd
Name  Version        Rev    Tracking    Publisher   Notes
lxd   5.0.0-b0287c1  22923  5.0/stable  canonical✓  -

simondeziel avatar Apr 27 '22 15:04 simondeziel

Is this occurring when the instance is just starting/stopping/restarting?

tomponline avatar Apr 28 '22 07:04 tomponline

No:

$ lxc exec mars:vpn -- uptime
 12:45:33 up 7 days,  8:12,  0 users,  load average: 0.13, 0.03, 0.01

It just seems random :/

simondeziel avatar Apr 28 '22 12:04 simondeziel

I just go another occurrence but this time the container was being stopped:

May  4 17:04:36 jupiter lxd.daemon[192550]: time="2022-05-04T17:04:36-04:00" level=warning msg="Failed to get total number of processes" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=ganymede instanceType=container project=default
May  4 17:04:39 jupiter kernel: [1185396.579522] audit: type=1400 audit(1651698279.469:119): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="lxd-ganymede_</var/snap/lxd/common/lxd>" pid=2304952 comm="apparmor_parser"

The above was with:

$ snap list lxd
Name  Version        Rev    Tracking    Publisher   Notes
lxd   5.0.0-b0287c1  22923  5.0/stable  canonical✓  -

simondeziel avatar May 04 '22 21:05 simondeziel

I rebooted the hosts c2d and xeon yesterday (once each) and got those:

May  5 05:03:56 c2d lxd.daemon[1888]: time="2022-05-05T05:03:56Z" level=warning msg="Failed to get swap usage" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=weechat instanceType=container project=default
May  5 09:04:22 xeon lxd.daemon[1867]: time="2022-05-05T09:04:22Z" level=warning msg="Failed to get swap usage" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=log instanceType=container project=default
May  5 10:04:37 xeon lxd.daemon[1867]: time="2022-05-05T10:04:37Z" level=warning msg="Failed to get memory usage" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=log instanceType=container project=default

Considering the ~1h difference between the 2 errors for the instance=log, I'm not sure there is a direct connection with the reboot.

simondeziel avatar May 05 '22 12:05 simondeziel

Do you see the issue only when the instance isn't running?

tomponline avatar Jul 19 '22 12:07 tomponline

No, those errors also happen in "steady state", distant from any lifecycle event. Here's another batch from last time:

May  6 09:00:37 xeon lxd.daemon[1867]: time="2022-05-06T09:00:37Z" level=warning msg="Failed to get swap usage" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=log instanceType=container project=default
May 23 00:48:43 ocelot lxd.daemon[1487]: time="2022-05-23T00:48:43Z" level=warning msg="Failed to get swap usage" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=pm instanceType=container project=default
May 25 03:00:07 xeon lxd.daemon[2034]: time="2022-05-25T03:00:07Z" level=warning msg="Failed to get memory usage" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=log instanceType=container project=default
May 31 14:19:40 c2d lxd.daemon[1491]: time="2022-05-31T14:19:40Z" level=warning msg="Failed to get swap usage" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=rproxy instanceType=container project=default
Jun 19 15:02:07 xeon lxd.daemon[1978]: time="2022-06-19T15:02:07Z" level=warning msg="Failed to get swap usage" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=log instanceType=container project=default
Jul  3 23:56:40 c2d lxd.daemon[1271]: time="2022-07-03T23:56:40Z" level=warning msg="Failed to get total number of processes" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=gw-home instanceType=container project=default

Since the beginning, it's always a problem of getting the swap usage, memory usage or total processes value.

simondeziel avatar Jul 19 '22 13:07 simondeziel

And to be clear, the instances are in stopped or started state?

tomponline avatar Jul 19 '22 13:07 tomponline

Do you see the issue only when the instance isn't running?

A stopped instance stops being reported in the metrics. And to be clear, those errors did not occur when the instances were stopping.

simondeziel avatar Jul 19 '22 13:07 simondeziel

They are always running, sometimes recently so (shortly after a host reboot) but often for a long while.

simondeziel avatar Jul 19 '22 13:07 simondeziel

I got a bunch of weirder errors:

Jan  9 13:04:23 xeon lxd.daemon[1307]: time="2023-01-09T13:04:23Z" level=warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:16 8:0\"): input does not match format" instance=puppet instanceType=container project=default
Jan  9 13:04:23 xeon lxd.daemon[1307]: time="2023-01-09T13:04:23Z" level=warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:0 8:16 rbytes=258048 wbytes=0 rios=9 wios=0 dbytes=0 dios=0\"): input does not match format" instance=squid instanceType=container project=default
Jan  9 13:04:23 xeon lxd.daemon[1307]: time="2023-01-09T13:04:23Z" level=warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:16 7:2 rbytes=3072 wbytes=0 rios=1 wios=0 dbytes=0 dios=0\"): input does not match format" instance=apt instanceType=container project=default
root@xeon:~# grep . /sys/fs/cgroup/lxc.payload.{apt,puppet,squid}/io.stat 
/sys/fs/cgroup/lxc.payload.apt/io.stat:8:16 7:2 rbytes=3072 wbytes=0 rios=1 wios=0 dbytes=0 dios=0
/sys/fs/cgroup/lxc.payload.puppet/io.stat:8:16 8:0 
/sys/fs/cgroup/lxc.payload.squid/io.stat:8:0 8:16 rbytes=258048 wbytes=0 rios=9 wios=0 dbytes=0 dios=0
/sys/fs/cgroup/lxc.payload.squid/io.stat:7:1 rbytes=60416 wbytes=0 rios=2 wios=0 dbytes=0 dios=0

It feels like the cgroup data is plain broken and there's nothing LXD can do about it. Sounds like a kernel bug.

Stopping and starting those 3 containers make their io.stat file empty, same as with other containers not showing any issue.

simondeziel avatar Jan 09 '23 18:01 simondeziel

Yeah, this looks weird. LXD expects a single MAJ:MIN. In theory, we could handle this by just using the last MAJ:MIN in the line. However, I don't know how reliable this would be.

I have the same on my machine:

$ cat /sys/fs/cgroup/io.stat
...
8:0 259:0 rbytes=11228777984 wbytes=59497067520 rios=357821 wios=2186721 dbytes=0 dios=0
...
$ lsblk
...
NAME                MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda                   8:0    1     0B  0 disk
nvme0n1             259:0    0 238.5G  0 disk
...

I believe they should just omit the 8:0 entirely. Or perhaps they just forgot to add a newline.

monstermunchkin avatar Jan 10 '23 07:01 monstermunchkin

Yeah, I also have weird stuff like multiple detached loop devices showing up on a single line in the host's io.stat:

sdeziel@xeon:~$ cat /sys/fs/cgroup/io.stat 
8:32 rbytes=999678391296 wbytes=198034948096 rios=3100795 wios=1051435 dbytes=0 dios=0
8:16 rbytes=9083418112 wbytes=72376709632 rios=388623 wios=11650553 dbytes=53315555840 dios=301674
8:0 rbytes=7480011776 wbytes=72376660480 rios=371580 wios=11547682 dbytes=53315555840 dios=301672
7:7 7:6 7:5 7:4 rbytes=28672 wbytes=0 rios=22 wios=0 dbytes=0 dios=0
7:3 rbytes=518539264 wbytes=0 rios=13405 wios=0 dbytes=0 dios=0
7:2 rbytes=1642633216 wbytes=0 rios=36241 wios=0 dbytes=0 dios=0
7:1 rbytes=158687232 wbytes=0 rios=4728 wios=0 dbytes=0 dios=0
7:0 rbytes=25744384 wbytes=0 rios=1091 wios=0 dbytes=0 dios=0
sdeziel@xeon:~$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0    7:0    0  49.8M  1 loop /snap/snapd/17950
loop1    7:1    0  63.3M  1 loop /snap/core20/1778
loop2    7:2    0   103M  1 loop /snap/lxd/23541
loop3    7:3    0  49.6M  1 loop 
sda      8:0    0 232.9G  0 disk 
├─sda1   8:1    0     1M  0 part 
├─sda2   8:2    0    24G  0 part 
├─sda3   8:3    0     2G  0 part [SWAP]
└─sda4   8:4    0   128G  0 part 
sdb      8:16   0 232.9G  0 disk 
├─sdb1   8:17   0     1M  0 part 
├─sdb2   8:18   0    24G  0 part /
├─sdb3   8:19   0     2G  0 part [SWAP]
└─sdb4   8:20   0   128G  0 part 
sdc      8:32   0   2.7T  0 disk 
├─sdc1   8:33   0   2.7T  0 part 
└─sdc9   8:41   0     8M  0 part 
sdeziel@xeon:~$ losetup -a
/dev/loop1: []: (/var/lib/snapd/snaps/core20_1778.snap)
/dev/loop2: []: (/var/lib/snapd/snaps/lxd_23541.snap)
/dev/loop0: []: (/var/lib/snapd/snaps/snapd_17950.snap)
/dev/loop3: []: (/var/lib/snapd/snaps/snapd_17883.snap (deleted))

Another weird thing is inside most of my containers, that io.stat file is completely empty but not always. It even changes upon container restarts.

@mihalicyn is this some area of the kernel you know by any chance?

simondeziel avatar Jan 10 '23 16:01 simondeziel

Hm, yep, it looks a little bit broken since this commit https://lore.kernel.org/all/[email protected]/

And users noticed this: https://lore.kernel.org/all/[email protected]/

From the kernel code, if follows, that's it's fully safe just to take the last device MAJ:MIN from the line.

It's already fixed by another patch https://github.com/torvalds/linux/commit/3607849df47822151b05df440759e2dc70160755

which allows output like this:

    253:10
    253:5 rbytes=0 wbytes=0 rios=0 wios=1 dbytes=0 dios=0

instead of

    253:10 253:5 rbytes=0 wbytes=0 rios=0 wios=1 dbytes=0 dios=0

I think we can try to handle all these options :-)

mihalicyn avatar Jan 10 '23 22:01 mihalicyn

@mihalicyn that really pleases me that you've found it to be fixed upstream, many thanks! I'll check if Canonical kernels that are currently in -proposed have the patch but if not, I'll ask for a backport/inclusion.

Thanks for looking into this!!

simondeziel avatar Jan 10 '23 22:01 simondeziel

@mihalicyn that really pleases me that you've found it to be fixed upstream, many thanks! I'll check if Canonical kernels that are currently in -proposed have the patch but if not, I'll ask for a backport/inclusion.

Thanks for looking into this!!

Always glad to help! ;-)

mihalicyn avatar Jan 11 '23 09:01 mihalicyn

@mihalicyn, https://github.com/torvalds/linux/commit/3607849df47822151b05df440759e2dc70160755 wasn't CC'ed to [email protected] and I couldn't find it in upstream's 5.15 changelogs so apparently nobody picked it up. I think it'd be best to send it to stable@ for upstream inclusion rather than fixing it in Canonical kernels only. What do you think?

simondeziel avatar Jan 11 '23 16:01 simondeziel

cc'ing @Blub author of https://github.com/torvalds/linux/commit/3607849df47822151b05df440759e2dc70160755

Yep, I think it's worth adding to -stable kernels. But I'm afraid that we'll need to have some workaround anyway because this process of taking the patch to stable and then waiting for it to be picked up downstream is not fast. BTW, my patch for shifts is still not landed on Ubuntu kernels, but I've done it almost 2 months ago :D

mihalicyn avatar Jan 11 '23 16:01 mihalicyn

Also see here https://discuss.linuxcontainers.org/t/lxc-query-not-showing-disk-stats-for-all-containers/16440

tomponline avatar Feb 27 '23 08:02 tomponline

@gabrielmougard I just checked my logs for the year and here's what I got:

root@log:~# grep -hF "lxd" /var/log/archives/2023/2023-*-syslog | grep -vF data/ | sed 's/.*level=//' | grep -F ' msg="Failed to get ' | sort | uniq -c | sort -nr
   5655 warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:0\"): unexpected EOF" instance=apt instanceType=container project=default
   4186 warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:16 7:1 rbytes=55296 wbytes=0 rios=1 wios=0 dbytes=0 dios=0\"): input does not match format" instance=log instanceType=container project=default
   3236 warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:0 8:16 rbytes=258048 wbytes=0 rios=9 wios=0 dbytes=0 dios=0\"): input does not match format" instance=squid instanceType=container project=default
   3212 warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:16\"): unexpected EOF" instance=metrics instanceType=container project=default
   1996 warning msg="Failed to get disk stats" err="Failed extracting io.stat \"\" (from \"8:0\")" instance=metrics instanceType=container project=default
   1889 warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:16 8:0\"): input does not match format" instance=puppet instanceType=container project=default
   1759 warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:16 7:2 rbytes=3072 wbytes=0 rios=1 wios=0 dbytes=0 dios=0\"): input does not match format" instance=apt instanceType=container project=default
   1467 warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:16\"): unexpected EOF" instance=log instanceType=container project=default
   1439 warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:0\"): unexpected EOF" instance=puppet instanceType=container project=default
   1431 warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:0 8:16\"): input does not match format" instance=metrics instanceType=container project=default
    553 warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:0 8:16 rbytes=8192 wbytes=0 rios=2 wios=0 dbytes=0 dios=0\"): input does not match format" instance=apt instanceType=container project=default
    482 warning msg="Failed to get disk stats" err="Failed extracting io.stat \"8:16\" (from \"8:0 8:16 rbytes=258048 wbytes=0 rios=9 wios=0 dbytes=0 dios=0\")" instance=metrics instanceType=container project=default
    269 warning msg="Failed to get disk stats" err="Failed parsing io.stat (\"8:0\"): unexpected EOF" instance=log instanceType=container project=default
      2 warning msg="Failed to get total number of processes" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=gw-home instanceType=container project=default
      1 warning msg="Failed to get total number of processes" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=git instanceType=container project=default
      1 warning msg="Failed to get memory usage" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=squid instanceType=container project=default
      1 warning msg="Failed to get memory usage" err="Failed parsing \"\": strconv.ParseInt: parsing \"\": invalid syntax" instance=gw-home instanceType=container project=default

The good news is that those newer messages include the content of the file that couldn't be parse successfully :)

So the bulk of it relates to the io.stat kernel issue you are trying to workaround but there are some other failures around process count and memory usage too.

My environment is 22.04 with HWE kernel.

simondeziel avatar May 02 '23 07:05 simondeziel