netdata icon indicating copy to clipboard operation
netdata copied to clipboard

[Bug]: Lost all historical data upon update

Open ced1check opened this issue 3 years ago • 25 comments

Bug description

2 servers updated to 1.35.0-41 nighlty and lost all historical data, while another remained on 1.35.0-29. It already happened when updated from 1.34.* to 1.35 and again now. Graphs are also very odd compared to old version!? netdata_graphs

Is there something I can do to prevent data loss in the future? I'm keeping 7 days of history and plan to keep an entire month, but updates are regularly loosing all data.

Expected behavior

Data should not be lost upon updating netdata software

Steps to reproduce

  1. Update from 1.35.0-29 to 1.35.0-41

Screenshots

No response

Error Logs

No response

Desktop

OS: [e.g. iOS] Browser [e.g. chrome, safari] Browser Version [e.g. 22]

Additional context

No response

ced1check avatar Jun 17 '22 05:06 ced1check

FWIW, 2 servers upgraded to -41 automatically (not sure how) and this morning all data where gone. I then upgraded 2 other servers from -29 to -41 (using webmin UI) and one lost all data, the other did not !?

Is there a way to stay on stable versions and not receive every nightly updates constantly!? I have a server still running 1.32.0-32 and I'm quite happy with it and Webmin tells me there is no updates!? How come?

ced1check avatar Jun 17 '22 06:06 ced1check

@cpipilas @hugovalente-pm Do you guys think this is for the agent team or the visualizations team?

dimko avatar Jun 20 '22 09:06 dimko

Is there a way to stay on stable versions and not receive every nightly updates constantly!? I have a server still running 1.32.0-32 and I'm quite happy with it and Webmin tells me there is no updates!? How come?

@ced1check sorry for the issues caused, let's see if we can get inputs from the team in order to troubleshoot and understand what happened. regarding your question above when you install Netdata you can pass the following parameter to ensure you don't get update on nightlies.

image

In case you want to change this you can do it, please check https://learn.netdata.cloud/docs/agent/packaging/installer/update#control-automatic-updates

@dimko I would suggest to start with investigation on the Agent side since this seems related to updating and losing history-

hugovalente-pm avatar Jun 20 '22 09:06 hugovalente-pm

The --stable-channel or --no-updates do not seem to do anything in my config!?

Here is the repositories that are setup on servers that keep upgrading nightly

impish/main | https://packagecloud.io/netdata/netdata-edge/ubuntu/ impish/main | https://packagecloud.io/netdata/netdata-repoconfig/ubuntu/

On the servers that stays on 1.32.1-32, it is the same URL but hirsute/main repo.

Shouldn't I change on of those to stay on stable channel? I don't want to have to disable the netdata repositories, but I'd like to avoid updating so frequently.

ced1check avatar Jun 21 '22 06:06 ced1check

thanks for sharing that, will also move this ticket to netdata/netdata repo since it is related to the Agent specifically and someone from the team can probably help on this

hugovalente-pm avatar Jun 21 '22 08:06 hugovalente-pm

this could be linked to [Bug]: general config is frequently lost upon updating #13178

got the buildinfo from that other ticket @ced1check please rectify anything if needed

Version: netdata v1.35.0-49-nightly
Configure options:  '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' '--disable-dependency-tracking' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -ffile-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -ffile-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Install type: binpkg-deb
    Binary architecture: x86_64
    Packaging distro:
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    YES
    EBPF:                    YES
    IPMI:                    YES
    NFACCT:                  YES
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES

Running Ubuntu Linux 21.10, Linux 5.13.0-51-generic on x86_64

hugovalente-pm avatar Jun 21 '22 08:06 hugovalente-pm

Most likely not related to https://github.com/netdata/netdata/issues/13178 (that particular issue should not be affecting anything but configuration, and it’s an external issue in webmin).

Regarding updates, stable versions, and such:

  • Given that you’re using our native packages, you can switch the system from nightlies to stable updates by manually running apt-get install netdata-repo (this should ask about uninstalling netdata-repo-edge), updating repo metadata, and then running apt-get update netdata. Manually re-running the installer with --stable-version should also switch installation channels, but does not appear to work reliably in all cases at the moment.
  • You mention running on Ubuntu 21.10. We will be ending official support for Ubuntu 21.10 concurrently with it going EOL upstream on 2022-07-31 (just over a month from now). Past that point in time, you’ll still be able to install, but won’t see any newer version of Netdata there than what was available on that day unless you convert it to using static builds (see https://learn.netdata.cloud/docs/agent/packaging/installer/reinstall#changing-the-install-type-of-an-existing-installation for instructions on how to change the installation type).
  • You also indirectly mention running an Ubuntu 21.04 system (the system running 1.32.1-32). We have not officially supported Ubuntu 21.04 since it went EOL upstream on 2022-01-20. If you want to use a newer version of Netdata on that system, you’ll need to convert it to using a static build. There is also an open bug regarding handling of such platforms on installation that may be of interest to you: https://github.com/netdata/netdata/issues/12931
  • On an installed system, toggling automatic updates can be done as outlined at https://learn.netdata.cloud/docs/agent/packaging/installer/update#control-automatic-updates. Note that what this changes is not in the Netdata configuration itself, but in whatever mechanism your system uses for running scheduled tasks (probably cron or some cron compatibility layer on top of systemd timers).

Ferroin avatar Jun 23 '22 11:06 Ferroin

Thanks for the info on updates, I suppose removing the edge repo should do it and keep me on stable versions, less frequent updates.

Yes I'm using Ubuntu 21.10, and 21.04 was upgraded to that version a while back, however the repo was not updated. I'll be upgrading the systems to 22.04 sooner or later so it should be no problem, right?

ced1check avatar Jun 23 '22 11:06 ced1check

Yep, Ubuntu 22.04 should be no problem at all.

It sounds like you may have to manually update the repos though (unfortunately, unlike on RPM systems, there’s no way to template the repo URLs based on the distro version on DEB systems, so we can’t have them trivially track releases like we do on Fedora or Alma). Manually updating the repository configuration with the correct codename should be sufficient here, but it may be more reliable to do a clean reinstall as outlined at https://learn.netdata.cloud/docs/agent/packaging/installer/reinstall#performing-a-clean-reinstall (note that if you want to take that route, you will need to manually preserve any config/data, check the section right below that on switching install types for a list of what you need to copy for that).

Ferroin avatar Jun 23 '22 18:06 Ferroin

Quick update based on internal discussion:

My initial comment that this was probably unrelated to https://github.com/netdata/netdata/issues/13178 may be wrong. If non-default settings for the dbengine (or history if you’re using something other than dbengine for storage), then the issue in #13178 would result in that being reset to the defaults, which in turn would wipe any excess history when it happens.

We’re going to look further into this internally and try to confirm one way or the other.

Ferroin avatar Jun 29 '22 09:06 Ferroin

When this happens I loose all historical data, not just the default. And it may happen without loosing custom configuration.

ced1check avatar Jun 29 '22 09:06 ced1check

all historical data

Is it really all or MySQL only? Do you still have the issue @ced1check?

ilyam8 avatar Aug 16 '22 10:08 ilyam8

Yes, all data is lost. I recently had this issue upgrading from 1.35 to 1.36 (stable releases).

EDIT: It seems it affected mysql and apache (at least), however cpu/network/disk data were not lost!

ced1check avatar Aug 16 '22 11:08 ced1check

Is it on upgrade only or happens on restart too?

ilyam8 avatar Aug 16 '22 11:08 ilyam8

I've never seen this issue after a restart and I did many. Did restart netdata each time I had to restart mysql after a synchro. Never seen this issue either after any reboot (maintenance every other weeks for upgraded kernels on 6 servers).

ced1check avatar Aug 16 '22 11:08 ced1check

I 🤷 then, my assumption was: we had both python and go versions of MySQL collector and on restart (or upgrade) one of them (random order) starts collecting data. We have removed python versions from the repo, but it affects clean installs, you still have them.

If you hover on a chart date you can see what collector produces metrics, e.g. (go.d/mysql)

Screenshot 2022-08-16 at 14 32 50

Can you check your Apache and MySQL charts?

ilyam8 avatar Aug 16 '22 11:08 ilyam8

I tried to hover onto chart date, hower it never popped-up anything and the date always shows 'latest: ...' like this: image

It would seem you access those graphs from somewhere else? Maybe locally, which I can't. Any other way I can give you that information? Maybe from the server, launching a debug session manually?

ced1check avatar Aug 16 '22 13:08 ced1check

Yes, I used the local dashboard. You can check it from the single node view (not overview) by clicking on the Info icon: Screenshot 2022-08-16 at 16 15 16

ilyam8 avatar Aug 16 '22 13:08 ilyam8

Here's apache's info: image

And MariaDB's: image

ced1check avatar Aug 17 '22 07:08 ced1check

Ok. If you lose metrics only for web log apache/MySQL:

sudo rm /usr/libexec/netdata/python.d/mysql.chart.py
sudo rm /usr/libexec/netdata/python.d/web_log.chart.py
sudo rm /usr/libexec/netdata/python.d/apache.chart.py

If my assumption is correct this should fix the problem (I think there is an open issue for that, I will try to find it). If not - we can't reproduce the problem and there are no similar reports.

ilyam8 avatar Aug 17 '22 12:08 ilyam8

Actually I don't have those files. Could they have been removed during 1.36 update?

ced1check avatar Aug 18 '22 09:08 ced1check

Update from 1.35.0-29 to 1.35.0-41

Ah, I see, somehow I missed that you installed v1.35.0+. We removed those collectors in v1.35.0.

ilyam8 avatar Aug 18 '22 10:08 ilyam8

Well, I couldn't find this ticket so I opened a new one :( https://github.com/netdata/netdata/issues/13548

Now I realize I was on netdata-cloud so I couldn't see this one on netdata!

Last update was from 1.35 stable to 1.36 and I lost everything on 6 servers. I suppose this was execpted because of those files?

ced1check avatar Aug 18 '22 12:08 ced1check

No, it wasn't because of the files. It was just my assumption that proved to be wrong.


Maybe locally, which I can't.

Why? If you have the local dashboard disabled, can you enable it just for testing? What is your setup: parent/child or standalone instances? What configuration parameters did you change? Any non-default config parameters? What memory mode do you use (can be found in http://<IP>:19999/api/v1/info)?

ilyam8 avatar Aug 18 '22 12:08 ilyam8

When this happens I loose all historical data, not just the default. And it may happen without loosing custom configuration.

If we determine that you lose the data and but your configuration stays the same, then the only other reason it could happen if the dbengine files are lost (are you using dbengine?)

Can you do please an sudo ls -l /var/cache/netdata/dbengine on an affected server?

MrZammler avatar Aug 22 '22 12:08 MrZammler

Closing due to no feedback and no other similar reports.

ilyam8 avatar Dec 14 '22 10:12 ilyam8