libzim
libzim copied to clipboard
Improve ZIM checksum handling
Hello everyone,
I believe the way the Creator generates the checksum of a newly created ZIM is highly inefficient and a bit suboptimal.
To my understanding, the Creator currently does the following:
- Create and write the ZIM file content
- Write headers
- Read the whole ZIM file again and generate the checksum, then write it to the ZIM file.
This process - more specifically step 3 - has two problems:
- It is highly inefficient to re-read the whole ZIM file. It's nearly twice the I/O necessary. In my usecase (big ZIM files and all content for the ZIM is already fully prepared) this nearly doubles the time it takes to create a ZIM file.
- Using this checksum generation method for file corruption detection is suboptimal. As the checksum is generated from the data already written to the disk, any data corruption occurring during the initial writing of the ZIM file to the disk can not be detected by the checksum.
Ideally, it would be possible to generate/update the checksum whenever a chunk is written to the disk. This would fix the problems mentioned above, However, it seems that the data (more specifically, the header) is written in a non-linear fashion (e.g. some fields seem to be only written after the rest of the ZIM was written.
Unfortunately, I'm unable to provide a better way for generating the ZIM file. Perhaps someone with some more in-depth knowledge of the ZIM file structure has an idea on how to avoid the re-reading of the data.
AFAIK it works like you describe. The checksum is not there to detect problems during the file creation but during a copy of this file during a further state. We have zimcheck to detect problems in a ZIM file and this is systematically run in the Zimfarm. There might be an other way to compute the checksum (1) from the data before they are written (2) allowing easier update of the value... but I have no clue how this would look like... like you.
A possible approach could be to first calculate/update the checksum of the clusters when they are written, then update the checksum with the header/... data afterwards instead of calculating the checksum of the whole file from start to finish. However, this approach has severe drawbacks: it would likely break backwards-compatibility (at least when checking the checksum) and the checksum would no longer be equivalent to the md5 checksum of the ZIM minus the final 16 bytes.
It seems that it is not the IO which in cause here. On my test case, the checksum takes around 16s. If I change the code to compute a dummy checksum without doing any IO, I go to 14s. But md5sum on the final file takes a bit more than 3s.
Using the openssl's implementation instead of the libzim one takes also 3s, but make libzim depends of another dependency.
@mgautierfr We could make openssl an optional dependence and keep this version of the code as fauktback?
Maybe it sould be an opportunity to use a super fast hash, like http://cyan4973.github.io/xxHash/.
On the other side, if at some point we need to sign content, see #40, linking to open SSL will be necessary.
Small update: I still think having a faster checksum algorithm would be a nice to have improvement. Maybe adding a dedicated ZIM header would help to handle this clearly (in case this is not md5sum anymore).That said, this does not belong to any roadmap so far. But, if a volunteer wants to take it, glad to see it implemented.
willing to do :>
@juuz0 Any news on this? Or this is harder than you though first?
@kelson42 Sorry for the delay, I got mixed up with college and other PRs, I'll send a PR for this by tomorrow definitely :+1: (have a draft, but need to fix a few things)
@juuz0 Have you been able to make progress or you are stuck somehow?
@kelson42 I'll be working on it.
@kelson42 Is there a specific reason why the checksum creation differs in terms of size of pieces in these two?
creator.cpp
struct zim_MD5_CTX md5ctx;
unsigned char batch_read[1024+1];
lseek(out_fd, 0, SEEK_SET);
zim_MD5Init(&md5ctx);
while (true) {
auto r = read(out_fd, batch_read, 1024);
if (r == -1) {
throw std::runtime_error(std::strerror(errno));
}
if (r == 0)
break;
batch_read[r] = 0;
zim_MD5Update(&md5ctx, batch_read, r);
}
unsigned char digest[16];
zim_MD5Final(digest, &md5ctx);
fileimpl.cpp
struct zim_MD5_CTX md5ctx;
zim_MD5Init(&md5ctx);
offset_type checksumPos = header.getChecksumPos();
offset_type currentPos = 0;
for(auto part = zimFile->begin();
part != zimFile->end();
part++) {
std::ifstream stream(part->second->filename(), std::ios_base::in|std::ios_base::binary);
char ch;
for(/*NOTHING*/ ; currentPos < checksumPos && stream.get(ch).good(); currentPos++) {
zim_MD5Update(&md5ctx, reinterpret_cast<const uint8_t*>(&ch), 1);
}
if (stream.bad()) {
perror("error while reading file");
return false;
}
if (currentPos == checksumPos) {
break;
}
}
if (currentPos != checksumPos) {
return false;
}
unsigned char chksumCalc[16];
auto chksumFile = zimReader->get_buffer(offset_t(header.getChecksumPos()), zsize_t(16));
zim_MD5Final(chksumCalc, &md5ctx);
@kelson42 Is there a specific reason why the checksum creation differs in terms of size of pieces in these two?
creator.cpp
struct zim_MD5_CTX md5ctx; unsigned char batch_read[1024+1]; lseek(out_fd, 0, SEEK_SET); zim_MD5Init(&md5ctx); while (true) { auto r = read(out_fd, batch_read, 1024); if (r == -1) { throw std::runtime_error(std::strerror(errno)); } if (r == 0) break; batch_read[r] = 0; zim_MD5Update(&md5ctx, batch_read, r); } unsigned char digest[16]; zim_MD5Final(digest, &md5ctx);fileimpl.cpp
struct zim_MD5_CTX md5ctx; zim_MD5Init(&md5ctx); offset_type checksumPos = header.getChecksumPos(); offset_type currentPos = 0; for(auto part = zimFile->begin(); part != zimFile->end(); part++) { std::ifstream stream(part->second->filename(), std::ios_base::in|std::ios_base::binary); char ch; for(/*NOTHING*/ ; currentPos < checksumPos && stream.get(ch).good(); currentPos++) { zim_MD5Update(&md5ctx, reinterpret_cast<const uint8_t*>(&ch), 1); } if (stream.bad()) { perror("error while reading file"); return false; } if (currentPos == checksumPos) { break; } } if (currentPos != checksumPos) { return false; } unsigned char chksumCalc[16]; auto chksumFile = zimReader->get_buffer(offset_t(header.getChecksumPos()), zsize_t(16)); zim_MD5Final(chksumCalc, &md5ctx);
@mgautierfr ?
No real reason. Probably an errant developer (me) picking a value.