Optimize the nameMap of ZipFile
For most regular zip files, all of their entries will have different names.
The current implementation of ZipFile maps each name to a LinkedList in the nameMap in order to deal with the case of duplicate entry names. Normally (64-bit platform, compressed oops enabled), these LinkedLists occupy an additional 56 bytes of memory.
This PR optimizes the implementation of ZipFile, thereby:
- For most zip files, this will save memory;
- For most zip files, the speed of filling in nameMap becomes faster when opening
ZipFile; -
ZipFile::getEntry(String)will be faster.
For extremely rare situations, this PR may have slight negative effects, but I think this should not be a problem.
Codecov Report
Attention: 8 lines in your changes are missing coverage. Please review.
Comparison is base (
fa1b16d) 80.55% compared to head (d0e5365) 80.53%. Report is 20 commits behind head on master.
| Files | Patch % | Lines |
|---|---|---|
| ...apache/commons/compress/archivers/zip/ZipFile.java | 65.21% | 4 Missing and 4 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #378 +/- ##
============================================
- Coverage 80.55% 80.53% -0.03%
- Complexity 6764 6766 +2
============================================
Files 342 342
Lines 24860 24874 +14
Branches 4017 4022 +5
============================================
+ Hits 20027 20033 +6
- Misses 3298 3302 +4
- Partials 1535 1539 +4
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
No tests. Code coverage is down. No proof of performance change one way or another. Non-trivial additional complexity. So -1 as it stands.
"For extremely rare situations, this PR may have slight negative effects,"
Which are?
"But I think this should not be a problem."
Because?
@garydgregory
No tests. Code coverage is down. No proof of performance change one way or another. Non-trivial additional complexity.
The core intention of this PR is to reduce memory allocation.
As mentioned earlier, currently we allocate a LinkedList for each entry in ZipFile, resulting in an additional 32 (LinkedList) + 24 (LinkedList$Node) bytes of memory overhead. This PR no longer allocate them.
As for the side effect - performance improvement, this is the result of a simple JMH benchmark:
Benchmark source code
public class Compress {
private static final Path jarPath = Path.of(/* path of commons-compress-1.23.0.jar */);
private static final ZipFile globalFile ;
private static final String[] entryNames;
static {
try {
globalFile = new ZipFile(jarPath);
} catch (IOException e) {
throw new RuntimeException(e);
}
entryNames = Collections.list(globalFile.getEntries())
.stream()
.map(ZipArchiveEntry::getName)
.toArray(String[]::new);
}
@Benchmark
public void testOpen() throws IOException {
try (ZipFile file = new ZipFile(jarPath)) {
}
}
@Benchmark
public void testGetEntry(Blackhole blackhole) throws IOException {
for (String entryName : entryNames) {
blackhole.consume(globalFile.getEntry(entryName));
}
}
@Benchmark
public void testGetEntries(Blackhole blackhole) throws IOException {
for (String entryName : entryNames) {
blackhole.consume(globalFile.getEntries(entryName));
}
}
@Benchmark
public void testGetEntriesInPhysicalOrder(Blackhole blackhole) throws IOException {
for (String entryName : entryNames) {
blackhole.consume(globalFile.getEntriesInPhysicalOrder(entryName));
}
}
}
commons-compress 1.23.0:
Benchmark Mode Cnt Score Error Units
Compress.testGetEntries thrpt 3 657266.611 ± 1690.346 ops/s
Compress.testGetEntriesInPhysicalOrder thrpt 3 121514.308 ± 1001.179 ops/s
Compress.testGetEntry thrpt 3 504127.618 ± 1988.320 ops/s
Compress.testOpen thrpt 3 663.604 ± 24.088 ops/s
This PR:
Benchmark Mode Cnt Score Error Units
Compress.testGetEntries thrpt 3 465237.072 ± 37283.323 ops/s
Compress.testGetEntriesInPhysicalOrder thrpt 3 459856.583 ± 13165.000 ops/s
Compress.testGetEntry thrpt 3 661743.860 ± 1055.489 ops/s
Compress.testOpen thrpt 3 665.040 ± 23.971 ops/s
For the same jar file (commons-compress-2.13.0.jar):
- Little change in performance opening zip files;
-
getEntryis 30% faster; -
getEntriesInPhysicalOrder(String)is 280% faster.
The only problem is getEntries(String), which is 30% slower, because each call needs to create a new singleton list. But the old implementation was unsafe because it would return the internal mutable list that could be modified by the user simply by casting, thus breaking the ZipFile, the implementation in this PR fixes this issue.
"For extremely rare situations, this PR may have slight negative effects,"
Which are?
"But I think this should not be a problem."
Because?
Because this situation is too rare.
As far as I know, most tools (7zip, jar commands, etc.) don't create zip files with entries with duplicate names. Such unusual zip files are generally only possible to create using some low-level api, so such files themselves are very rare, I can't even easily find examples in the real world.
And, even for this unusual file, it's debatable whether this PR has a positive or negative effect.
If the majority of items do not have the same name, this PR still has a positive effect.
And if there are a large number of entries with the same name, the effect of this PR may be different under different conditions. But I don't think it's a good case to look into, as I can't imagine a real world use case for it.
If you guys know of a real use case for entries with duplicate name, I'll look into that further. But for now, I don't see much point in caring too much about this situation, since regular zip files are much, much more numerous.
Can someone review this PR again?
@garydgregory Can you review this PR again?
@garydgregory Hey, can you take a look at this PR?
Looks OK to me, though claims of performance improvements really need measurements and benchmarks.
Looks OK to me, though claims of performance improvements really need measurements and benchmarks.
Here are some JMH benchmark results: https://github.com/apache/commons-compress/pull/378#issuecomment-1501863937
Looks OK to me, though claims of performance improvements really need measurements and benchmarks.
Here are some JMH benchmark results: #378 (comment)
But where is the benchmark? You should make it part of this PR so thar anyone can run it.
But where is the benchmark?
The source code is in the Benchmark source code folding block of the comment.
Hello All I am -1 until tests are provided that show the new data structure is correctly populated. A package-private getter can be added and queried by new unit tests for example. Otherwise, it's just too easy undo in the future without visibility. If this were indeed undone in the future, then forcing a maintainer to remove tests will make it obvious that this was an intended feature. Edge cases should IMO account for no duplicates, some duplicates, and all duplicates use cases.