libelektra testscr_check_kdb_internal_suite: very slow at least on a7

The testscr_check_kdb_internal_suite takes more than 30 minutes on a7.

        Start 181: testscr_check_kdb_internal_suite
218/262 Test #181: testscr_check_kdb_internal_suite ..............   Passed  1804.16 sec

We need to investigate why this takes so long and if things can be improved.

Oct 10 '20 17:10 mpranj

Thank you for creating the issue! As said in #3510 we should probably reduce the number of tests for normal builds and only have them run for special builds (but then really all of them). Ideally these builds can also be executed from PRs (e.g. if someone modifies a storage plugin).

Maybe these special builds also include fuzzing methods #185, which is also only needed from time to time (or when relevant changes are done).

Oct 11 '20 07:10 markus2330

Just an idea: maybe libeatmydata speeds up these tests as Elektra uses fsync excessively.

Oct 12 '20 10:10 markus2330

Another idea: we could use ramdisk for parts or all of the tests, at least on slow machines.

Oct 12 '20 10:10 mpranj

Yes, this is might be the easier solution!

Oct 12 '20 11:10 markus2330

We need to investigate why this takes so long

This is a result of #2595. Now we finally test all storage plugins. Maybe fsync is the root cause of this issue, but it might also be the case that some plugin is a very slow.

Currently the tests are executed with these pluigns:

dini
dump
ini
kconfig
ni
quickdump
simpleini
tcl
toml
xerces
xmltool
yajl

(based on the testdata in tests/shell/shell)

The tcl plugin only runs the basic tests, ini runs basic string umlauts, dump and quickdump run basic string umlauts binary naming and everything else run basic string.

The script also runs everything twice. Once for the user namespace and once for system.

On my machine the test takes only ~63s (Ryzen 3700X, 32GB RAM, NVMe SSD), but the distribution looks like this:

dump -> 26s
ini -> 9s
quickdump -> 26s

tcl doesn't work on my machine, and the remaining tests take less than 1s combined. The times where obtained by adding replacing "$KDB" test "$ROOT" $TESTS with time "$KDB" test "$ROOT" $TESTS in the test script.

The fact that dump and quickdump have nearly identical runtime combined with looking at the code for kdb test indicates to me that the problem is the kdb test itself. Specifically the umlauts, binary and naming tests. Each of these calls kdbGet and kdbSet more than 250 times. Everytime just with a single key. It also destroys all KeySets, Keys and KDB for each of these 250+ tests.

kdb test is just poorly written and will always take forever to run. It didn't profile it but I highly doubt that fsync is the issue and suspect that it's simply the overhead of kdbGet and kdbSet themselves.

PS. the meta test is even worse and that's why it is never used for any of the plugins.

Maybe disabling fsync or using a RAM disk will do something, but I don't think we can avoid changing kdb test itself.

Oct 12 '20 11:10 kodebach

The script also runs everything twice. Once for the user namespace and once for system.

Thanks for the hint, this is actually useless, it does not test the storage plugins. Furthermore, it would be a very easy fix.

Everytime just with a single key.

This is on purpose to test such KeySets. Further tests with bigger KeySets can be added (contributions welcome!) but this should not replace the simple tests.

It also destroys all KeySets, Keys and KDB for each of these 250+ tests.

Reusing KDB might make the tests faster but then they would be different tests. Tests that test the behavior of sequences of get/set are in tests/kdb.

Btw. using tmpfs might be positive for all tests, testscr_check_kdb_internal_suite is not the only test that takes some time because of I/O load.

Maybe Robert will look at these things.

Oct 12 '20 12:10 markus2330

This is on purpose to test such KeySets.

What exactly is the difference between testing a KeySet with just a single Key a then with Key b, etc. compared to testing with a KeySet containing a, b, c, etc. all at once? In the end, either the plugin could deal with all the key names and stored everything properly or it didn't.

I can't think of a reason why a plugin would be able to store KeySet containing a and b, but a KeySet with just a.

What exactly is the goal of kdb test anyway? It seems to be to verify that a mountpoint configuration works. But the goal of testscr_check_kdb_internal_suite seems to be to check that a storage plugin can be used as such. This could also be achieved by calling the plugin directly and doing some checks around that. The tests would also be a lot faster, if they didn't use kdbGet/kdbSet but instead called the plugin directly.

In particular, I don't see the point of running the kdbGet/kdbSet sequence for all 256 possible 1 byte key base names for every plugin. It would be much quicker to validate kdbGet/kdbSet and the plugins separately instead of revalidating kdbGet/kdbSet for every plugin.

Reusing KDB might make the tests faster but then they would be different tests. Tests that test the behavior of sequences of get/set are in tests/kdb.

Yes, reusing the KDB handle would be a different test. But at least the KeySet and Key instances could be reused. I'm not sure this would change a lot in practice, but at least in theory it saves a lot of malloc and free.

Btw. using tmpfs might be positive for all tests, testscr_check_kdb_internal_suite is not the only test that takes some time because of I/O load.

That is certainly true. But definitely needs further investigation. There are a few testmod_* tests for storage plugins that do file I/O that are still quite fast. I don't think any of them use fsync, but also don't call kdbGet/kdbSet and that also has a big influence.

Another thing I just noticed, umlauts may be a subset of naming. Therefore calling both for the same plugin is just a waste of time.

Oct 12 '20 12:10 kodebach

What exactly is the goal of kdb test anyway?

To test round-trip and other storage properties of a given mountpoint.

What exactly is the difference between testing a KeySet with just a single Key a then with Key b, etc. compared to testing with a KeySet containing a, b, c, etc. all at once?

The difference is for storage plugin developers: you get a very small example of when your plugin fails and not a huge test case where you need to find out yourself what actually the problem is.

It is possible that there is some overlap between the tests but I would not waste my time (nor I would ask anyone to waste their time) trying to eliminate this.

But at least the KeySet and Key instances could be reused

As long as we do not know for sure that this is a bottleneck (and I doubt it), such optimizations would be waste of development time.

Oct 12 '20 13:10 markus2330

The difference is for storage plugin developers: you get a very small example of when your plugin fails and not a huge test case where you need to find out yourself what actually the problem is.

Ok, but that should not be the goal of testscr_check_kdb_internal_suite. If that is the goal than testscr_check_kdb_internal_suite should not be run all time in our CI.

The tests we run in our CI are there to ensure PRs don't introduce regressions and adhere to some baseline requirements. The CI is not a debugging tool for developers.

We can provide a separate test suite that plugin developers can run themselves to get details about what is going wrong.

Oct 12 '20 13:10 kodebach

Exactly, thus the idea to extract the tests which can always run and other tests only to be run for releases (or development). This was my first comment on this issue.

Oct 12 '20 14:10 markus2330

The tests are running much faster on a7 since the ext4 formatted SSD was added:

        Start 178: testscr_check_kdb_internal_suite
215/258 Test #178: testscr_check_kdb_internal_suite ..............   Passed   94.78 sec

I'll change this to low priority until somebody can fix the real underlying problem as @kodebach pointed out.

Nov 28 '20 19:11 mpranj

The testscr_check_kdb_internal_suite test alone takes now about 40 minutes of time on i7. Using tmpfs for the config and cache directories (due to heavy I/O) reduces the test time from 40 minutes to about 6 seconds on i7.

Jul 09 '21 23:07 mpranj

Thank you very much! Can we close this issue or is separation of some tests still useful?

Jul 15 '21 07:07 markus2330

I mark this issue stale as it did not have any activity for one year. I'll close it in two weeks if no further activity occurs. If you want it to be alive again, ping the issue by writing a message here or create a new issue with the remainder of this issue. Thank you for your contributions :sparkling_heart:

Jul 30 '22 23:07 stale[bot]

I closed this issue now because it has been inactive for more than one year. If I closed it by mistake, please do not hesitate to reopen it or create a new issue with the remainder of this issue. Thank you for your contributions :sparkling_heart:

Sep 21 '22 03:09 stale[bot]

libelektra libelektra copied to clipboard

testscr_check_kdb_internal_suite: very slow at least on a7

libelektra
libelektra copied to clipboard