libzim Improve ZIM checksum handling

Hello everyone,

I believe the way the Creator generates the checksum of a newly created ZIM is highly inefficient and a bit suboptimal.

To my understanding, the Creator currently does the following:

Create and write the ZIM file content
Write headers
Read the whole ZIM file again and generate the checksum, then write it to the ZIM file.

This process - more specifically step 3 - has two problems:

It is highly inefficient to re-read the whole ZIM file. It's nearly twice the I/O necessary. In my usecase (big ZIM files and all content for the ZIM is already fully prepared) this nearly doubles the time it takes to create a ZIM file.
Using this checksum generation method for file corruption detection is suboptimal. As the checksum is generated from the data already written to the disk, any data corruption occurring during the initial writing of the ZIM file to the disk can not be detected by the checksum.

Ideally, it would be possible to generate/update the checksum whenever a chunk is written to the disk. This would fix the problems mentioned above, However, it seems that the data (more specifically, the header) is written in a non-linear fashion (e.g. some fields seem to be only written after the rest of the ZIM was written.

Unfortunately, I'm unable to provide a better way for generating the ZIM file. Perhaps someone with some more in-depth knowledge of the ZIM file structure has an idea on how to avoid the re-reading of the data.

Aug 22 '21 16:08 IMayBeABitShy

AFAIK it works like you describe. The checksum is not there to detect problems during the file creation but during a copy of this file during a further state. We have zimcheck to detect problems in a ZIM file and this is systematically run in the Zimfarm. There might be an other way to compute the checksum (1) from the data before they are written (2) allowing easier update of the value... but I have no clue how this would look like... like you.

Aug 22 '21 16:08 kelson42

A possible approach could be to first calculate/update the checksum of the clusters when they are written, then update the checksum with the header/... data afterwards instead of calculating the checksum of the whole file from start to finish. However, this approach has severe drawbacks: it would likely break backwards-compatibility (at least when checking the checksum) and the checksum would no longer be equivalent to the md5 checksum of the ZIM minus the final 16 bytes.

Aug 22 '21 16:08 IMayBeABitShy

It seems that it is not the IO which in cause here. On my test case, the checksum takes around 16s. If I change the code to compute a dummy checksum without doing any IO, I go to 14s. But md5sum on the final file takes a bit more than 3s.

Using the openssl's implementation instead of the libzim one takes also 3s, but make libzim depends of another dependency.

Sep 06 '21 13:09 mgautierfr

@mgautierfr We could make openssl an optional dependence and keep this version of the code as fauktback?

Nov 28 '21 13:11 kelson42

Maybe it sould be an opportunity to use a super fast hash, like http://cyan4973.github.io/xxHash/.

On the other side, if at some point we need to sign content, see #40, linking to open SSL will be necessary.

Dec 04 '21 05:12 kelson42

Small update: I still think having a faster checksum algorithm would be a nice to have improvement. Maybe adding a dedicated ZIM header would help to handle this clearly (in case this is not md5sum anymore).That said, this does not belong to any roadmap so far. But, if a volunteer wants to take it, glad to see it implemented.

Nov 26 '22 08:11 kelson42

willing to do :>

Mar 11 '23 08:03 juuz0

@juuz0 Any news on this? Or this is harder than you though first?

Mar 22 '23 13:03 kelson42

@kelson42 Sorry for the delay, I got mixed up with college and other PRs, I'll send a PR for this by tomorrow definitely :+1: (have a draft, but need to fix a few things)

Mar 30 '23 12:03 juuz0

@juuz0 Have you been able to make progress or you are stuck somehow?

Apr 29 '23 05:04 kelson42

@kelson42 I'll be working on it.

Feb 24 '24 08:02 aryanA101a

@kelson42 Is there a specific reason why the checksum creation differs in terms of size of pieces in these two?

creator.cpp

struct zim_MD5_CTX md5ctx;
      unsigned char batch_read[1024+1];
      lseek(out_fd, 0, SEEK_SET);
      zim_MD5Init(&md5ctx);
      while (true) {
         auto r = read(out_fd, batch_read, 1024);
         if (r == -1) {
           throw std::runtime_error(std::strerror(errno));
         }
         if (r == 0)
           break;
         batch_read[r] = 0;
         zim_MD5Update(&md5ctx, batch_read, r);
      }
      unsigned char digest[16];
      zim_MD5Final(digest, &md5ctx);

fileimpl.cpp

struct zim_MD5_CTX md5ctx;
    zim_MD5Init(&md5ctx);

    offset_type checksumPos = header.getChecksumPos();
    offset_type currentPos = 0;
    for(auto part = zimFile->begin();
        part != zimFile->end();
        part++) {
      std::ifstream stream(part->second->filename(), std::ios_base::in|std::ios_base::binary);

      char ch;
      for(/*NOTHING*/ ; currentPos < checksumPos && stream.get(ch).good(); currentPos++) {
        zim_MD5Update(&md5ctx, reinterpret_cast<const uint8_t*>(&ch), 1);
      }
      if (stream.bad()) {
        perror("error while reading file");
        return false;
      }
      if (currentPos == checksumPos) {
        break;
      }
    }

    if (currentPos != checksumPos) {
      return false;
    }

    unsigned char chksumCalc[16];
    auto chksumFile = zimReader->get_buffer(offset_t(header.getChecksumPos()), zsize_t(16));

    zim_MD5Final(chksumCalc, &md5ctx);

Feb 24 '24 10:02 aryanA101a

@kelson42 Is there a specific reason why the checksum creation differs in terms of size of pieces in these two?

creator.cpp

struct zim_MD5_CTX md5ctx;
      unsigned char batch_read[1024+1];
      lseek(out_fd, 0, SEEK_SET);
      zim_MD5Init(&md5ctx);
      while (true) {
         auto r = read(out_fd, batch_read, 1024);
         if (r == -1) {
           throw std::runtime_error(std::strerror(errno));
         }
         if (r == 0)
           break;
         batch_read[r] = 0;
         zim_MD5Update(&md5ctx, batch_read, r);
      }
      unsigned char digest[16];
      zim_MD5Final(digest, &md5ctx);

fileimpl.cpp

struct zim_MD5_CTX md5ctx;
    zim_MD5Init(&md5ctx);

    offset_type checksumPos = header.getChecksumPos();
    offset_type currentPos = 0;
    for(auto part = zimFile->begin();
        part != zimFile->end();
        part++) {
      std::ifstream stream(part->second->filename(), std::ios_base::in|std::ios_base::binary);

      char ch;
      for(/*NOTHING*/ ; currentPos < checksumPos && stream.get(ch).good(); currentPos++) {
        zim_MD5Update(&md5ctx, reinterpret_cast<const uint8_t*>(&ch), 1);
      }
      if (stream.bad()) {
        perror("error while reading file");
        return false;
      }
      if (currentPos == checksumPos) {
        break;
      }
    }

    if (currentPos != checksumPos) {
      return false;
    }

    unsigned char chksumCalc[16];
    auto chksumFile = zimReader->get_buffer(offset_t(header.getChecksumPos()), zsize_t(16));

    zim_MD5Final(chksumCalc, &md5ctx);

@mgautierfr ?

Feb 26 '24 05:02 aryanA101a

No real reason. Probably an errant developer (me) picking a value.

Feb 26 '24 09:02 mgautierfr

libzim libzim copied to clipboard

Improve ZIM checksum handling

libzim
libzim copied to clipboard