core icon indicating copy to clipboard operation
core copied to clipboard

netflow insight aggregator dies and cannot be restarted, reason: DatabaseError: database disk image is malformed

Open JLT032 opened this issue 3 weeks ago • 3 comments

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [v ] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
  • [v] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Describe the bug

no 'breaking changes' were found documented on the opnsense release pages.

insight aggregator stops and cannot be started

This has been going on since quite some time. The fix reported is to 'delete all netflow data' which does not sit well with me.

This looks like a reoccurring bug ? Not few posts on this board signal this error happened in the past.

flowd_aggregate.pyflowd aggregate died with message Traceback (most recent call last): File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 162, in run aggregate_flowd(self.config, do_vacuum) File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 80, in aggregate_flowd stream_agg_object.add(copy.copy(flow_record)) File "/usr/local/opnsense/scripts/netflow/lib/aggregates/source.py", line 69, in add super(FlowSourceAddrTotals, self).add(flow) File "/usr/local/opnsense/scripts/netflow/lib/aggregates/init.py", line 187, in add self._update_cur.execute(self._insert_stmt, flow) sqlite3.DatabaseError: database disk image is malformed

/var/netflow shows no broken files or lock files left behind

work-around with data loss

cd /usr/local/opnsense/scripts/netflow/
./flush_all.sh all

To Reproduce

Steps to reproduce the behavior:

  1. unknown, assumed to be something broken in the code base
  2. upgrade ?

Expected behavior

'insight aggregator' does not stop running sqlite3 files under /var/netflow are not reported as "DatabaseError: database disk image is malformed "

Describe alternatives you considered

restarting the flowd_aggregator service from the commandline ( did not work )

Screenshots

n/a

Relevant log files

see the message extract above

Additional context

observed since upgrade to 25.10

Environment

Software version used and hardware type if relevant, e.g.:

OPNsense 25.10 (amd64). KVM virtual machine virtio network-card-driver

JLT032 avatar Dec 07 '25 07:12 JLT032

Maybe you have an idea how the database could have been corrupted? If an Sqlite file corrupts it's unlikely to recover correctly. This happens especially with UFS and/or power outages / unclean shutdowns.

Cheers, Franco

fichtner avatar Dec 07 '25 08:12 fichtner

Maybe you have an idea how the database could have been corrupted? If an Sqlite file corrupts it's unlikely to recover correctly. This happens especially with UFS and/or power outages / unclean shutdowns.

Cheers, Franco

that's not unlike for an unclean shutdown to have happened, the Proxmox hypervisor keeps running into issues after a few weeks, clean shutdowns do not always happen

i do assume detection of a malformed sqlite db should trigger some or other remediation procedure

JLT032 avatar Dec 07 '25 10:12 JLT032

i did not bother to try and delete the last record added, this could probably fix this issue also ?

in case there's a procedure to do so and for me to test, i'm open to do so, I kept a copy of the data in /var/netflow and /var/log/netflow for such purpose.

JLT032 avatar Dec 07 '25 10:12 JLT032