NAS-130619 / 25.10 / Use truenas_pylibzfs in pool.dataset.query
This replaces zfs.dataset.query in favor of truenas_pylibzfs (new C module). This was done in a way that would allow us to "drop-in" the new use of this library without having to make any major API changes. We will write a new query endpoint that is vastly simpler, even more efficient than this and much more ergonomic to use but this is 95% of the problem that we currently have with our process pool. This is, probably, the most called API endpoint internally and externally so the performance gains can't be understated. I did a very synthetic comparison between the 2x endpoints and the results are pretty substantial.
Old API Performance:
- Times: [0.228, 0.229, 0.216, 0.224, 0.211] seconds
- Average: 0.222 seconds
- Range: 0.211 - 0.229 seconds (0.018s spread)
- Standard Deviation: ~0.007 seconds
New API Performance:
- Times: [0.063, 0.062, 0.062, 0.062, 0.062] seconds
- Average: 0.062 seconds
- Range: 0.062 - 0.063 seconds (0.001s spread)
- Standard Deviation: ~0.0005 seconds
Speed Improvement:
- 3.58x faster (0.222s → 0.062s)
- 72.7% reduction in response time
Consistency Improvement:
- Much more consistent response times (±0.0005s vs ±0.007s)
- 14x more stable performance
The performance characteristics are also confirmed by the full CI test run. Usually the full suite runs in ~3ish hours. The run with these changes took ~2ish hours (a little over).
Finally, there are very minor differences with the old and the new implementation that should be noted. I actually consider them "cosmetic improvements".
- Size/Storage Formatting Changes: - Old API: Uses shorter format like '7.16G', '140K', '0B' - New API: Uses more explicit format like '7.16 GiB', '140 KiB', '0 bytes'
- Affected Fields: - available: '7.16G' → '7.16 GiB' - used: '2.04G' → '2.04 GiB' - usedbychildren: '2.04G' → '2.04 GiB' - usedbydataset: '140K' → '140 KiB' - usedbysnapshots: '0B' → '0 bytes' - usedbyrefreservation: '0B' → '0 bytes'
Jira URL: https://ixsystems.atlassian.net/browse/NAS-130619
This could make a large performance improvement to the Incus Storage Driver.
After the iscsi create/delete, pool.dataset.query is the next biggest bottleneck.
This PR has been merged and conversations have been locked. If you would like to discuss more about this issue please use our forums or raise a Jira ticket.