zeal icon indicating copy to clipboard operation
zeal copied to clipboard

Compressed docset

Open char101 opened this issue 10 years ago • 34 comments

Hi,

Do you have a plan to support compressed docset, like zipped archive? It will help very much in reducing the size and disk reading time.

char101 avatar May 04 '14 04:05 char101

This is something I tried to add to Dash, so for what it's worth I'll share my progress. I'm writing this from memory, so sorry for any mistakes.

What seems to be needed is an indexed archive format which lets you extract individual files really fast.

Archive formats I've tried:

  1. Zip has the best index as far as I can tell and extraction of individual files occurs really fast. The problem with zip is that the compression benefits are minimal. Zip seems to be really bad at archiving folders with a lot of small files. Some docsets even get bigger when you compress them with zip.
  2. 7Zip has an index but it sucks. As far as I can tell when you ask 7Zip to unarchive an individual file, it searches through its entire index to find files that match. This takes a very long time for large docsets.
  3. Tar has no index at all
  4. There is a way to index data inside of a gzip-compressed file, using https://code.google.com/p/zran/, so what I've tried is to make my own archive format which appends all of the files into one huge file and then compresses it with gzip and indexes it. This works great, but during unarchival of individual files it sometimes takes a lot longer to unarchive some files than others (during tests on my Mac most files unarchived in 0.01s with this format, but some files unarchived in 0.1-0.2s). I couldn't figure out why.

If anyone has any experience with archive formats, help would be appreciated :+1:

Kapeli avatar May 04 '14 12:05 Kapeli

Hi,

Thanks for your explanation on this.

Personally I don't think the docset size itself really matters, with the size of current harddisks. The problem I'm facing is with the number of small files, the storage size could be much larger because each small files take at least a filesystem block (4 KB?), even when the file size is only 1 KB. Also moving the docsets takes a very long time.

I tried compressing several of the docsets with zip

Yii

Total size                       15 270 914
Size on disk                     15 904 768
Zipped (max)                      1 828 455
7z (PPMd)                           604 276

J2SE

Total size                      295 540 254 
Size on disk                    318 820 352
Zipped (max)                     52,252,656 
7z (PPMd)                        18,412,863

While zip does not compresses as small as 7z with PPMd, still it achieves a pretty good compression ratio.

char101 avatar May 05 '14 02:05 char101

Unfortunately, I have no interest in pursuing this other than for file size issues. The size of the current HDDs is not an issue, but the size of SSDs is.

Also there are hosting and bandwidth issues, so size matters there as well, but I think one way around that would be to compress using a format (zip) and then recompress that using tgz or other.

Zip does work for some docsets, but fails with others. I can't remember which. Sorry.

Kapeli avatar May 05 '14 03:05 Kapeli

I understand about your reason, but speaking about file size, surely for the user, having the docsets in zip format will still result in much less space than the uncompressed files.

What do you think about distributing the docsets in 7z format (less distribution bandwidth, faster download time - why 7z, because AFAIK only 7z supports PPMd algorithms and this is the fastest and smallest compression for text files), and converting them into zip after being downloaded by the user. Hopefully this can be done without using temporary file (more lifetime for SSD).

char101 avatar May 05 '14 03:05 char101

We use a zip format for storing text documents in Mono's documentation tool. We use our own indexing code. If you're interested, I can point you to the specific code in the mono project.

lobrien avatar Dec 06 '14 01:12 lobrien

Adding some thoughts...

I am planning to add QCH (Qt Assistant format) and CHM support to Zeal at some point in the future. Both formats provide everything in a single file.

CHM files are compressed with LZX algorithm. QCH is just an SQLite database and does not provide any compression.

As the next step I'd like to evaluate a Zeal-specific format (most likely extended from QCH) which would provide some compression level for data. I am not sure if that would work out with planned full-text search.

trollixx avatar Feb 09 '15 20:02 trollixx

Compressing a single row in sqlite database would be less effective since the dictionary will be limited to that single text, isn't it?

I think it's more practical to use zip as the archive format, then embed the toc and index as json files inside the zip. Full text search can be created when adding the documentation the first time. Converters can be created to convert from chm/qch to the zip format. This will also keep the binary size smaller since you do not have to embed the decoding libraries to zeal. User which want to create their own documentation can simply zip the html files and add it to zeal.

char101 avatar Feb 10 '15 02:02 char101

@Kapeli Could you try to test lrzip ? http://ck.kolivas.org/apps/lrzip/ You can use -l or -Ul to comparison(use lzo) if you want to fast decompress. benchmark: http://ck.kolivas.org/apps/lrzip/README.benchmarks

zjzdy avatar Mar 28 '15 23:03 zjzdy

A lot has changed since I last posted in this issue. I forgot it even exists. Sorry!

Anyways, Dash for iOS supports archived docsets right now. Dash for OS X will get support for archived docsets in a future update too. Archived docsets are only supported for my "official" docsets (i.e. the ones at https://kapeli.com/docset_links) and for user contributed docsets. This is enough, as these are the docsets that can be quite large, others are not really an issue.

I still use tgz for the archived docset format, the only difference is that I compress the docsets using tarix, which has proven to be very reliable.

Performance-wise, it takes about 5-10 times longer to read a file from archive than it takes to read it directly from disk. Directly from disk on my Mac it takes up to 0.001s for the larger doc pages, while from an archived docset it takes up to 0.01s.

Despite that, there's no noticeable impact as when a page is loaded the actual read of the files takes very very little time when compared to the loading of the WebView and the DOM and whatever (the WebView takes up about 90% of the load time).

Kapeli avatar Mar 29 '15 00:03 Kapeli

@Kapeli @trollixx Dash needn't all decompress that tgz file .(? I'm not sure whether you want to express this meaning? ) But zeal also need all decompress that tgz file .So I think you should understand what I want to express. :) I watch tarix project ,this is a good project,But I found it as thought long time not update ?

zjzdy avatar Mar 29 '15 03:03 zjzdy

Dash does not need to decompress the tgz file anymore, no.

Kapeli avatar Mar 29 '15 12:03 Kapeli

Sounds interesting. I'll look into handling of tarix indices to eliminate docset unpacking. I haven't heard about tarix before.

trollixx avatar Mar 30 '15 02:03 trollixx

Kind reminder, can extract the index file in advance, because the index file is IO intensive

zjzdy avatar Apr 01 '15 10:04 zjzdy

In the meantime Mac users can use HFS compression, and Linux users put their docset folder on a filesystem with transparent compression like btrfs or ZFS.

RJVB avatar Feb 04 '17 09:02 RJVB

about bundling, in numbers:

  • a VHD container (ntfs, compression enabled) with docsets has the size of 19 Gb
  • 700 thousands of files inside of the VHD have the summary size of ~9 Gb

It seems 10 Gb was spent to store file tables, attributes etc

I think the bundling (doesn't matter compressed or not) is a must.

reclaimed avatar Mar 31 '17 13:03 reclaimed

On Friday March 31 2017 06:35:40 evgeny g likov wrote:

about bundling, in numbers:

  • a VHD container (ntfs, compression enabled) with docsets has the size of 19 Gb
  • 700 thousands of files inside of the VHD have the summary size of ~9 Gb

That many files will almost unavoidably lead to disk space overhead ("waste") because changes are slim that the majority will be an exact multiple of the disk block size (4096 for most modern disks). Not to mention the free-space fragmentation they can cause.

RJVB avatar Mar 31 '17 13:03 RJVB

That many files will almost unavoidably lead to disk space overhead ("waste") because changes are slim that the majority will be an exact multiple of the disk block size (4096 for most modern disks). Not to mention the free-space fragmentation they can cause.

I think what you meant was the filesystem block. A disk block (sector) is only used for addressing while a single file cannot occupy less that a filesystem block.

char101 avatar Mar 31 '17 14:03 char101

How about using dar to store and compress docset?

livelazily avatar Aug 06 '18 02:08 livelazily

If the goal is not to preserve the docset bundle "as is", couldn't you use a lightweight key/value database engine like LMDB? File names (or paths) would be the keys, and then you can use whatever compression gives the desired cost/benefit trade-off to store the values (i.e. file content). I've used this approach (with LZ4 compression) to replace a file-based data cache in my personal KDevelop fork, and it works quite nicely (with an API that mimics the file IO API). This gives me 2 files on disk instead of thousands which is evidently a lot more efficient.

FWIW my docset collection is over 3Gb before HFS compression, just over 1Gb after. I have enough diskspace not to compress, but that doesn't mean I spit on saving 2Gb. "There are no small economies" as they say in France, and following that guideline is probably why I still have lots of free disk space.

RJVB avatar Nov 28 '18 16:11 RJVB

SQLite with LZ4 or zstd for blob compression is what I have in mind. There are also some larger goals that I hope to achieve with moving to the new docset format, such as embedded metadata, ToC support, etc.

trollixx avatar Dec 02 '18 03:12 trollixx

Zstandard supports precomputed dictionary which should be beneficial for compressing a lot of small files.

char101 avatar Dec 02 '18 09:12 char101

On 02 Dec 2018, at 10:34, Charles [email protected] wrote:

Zstandard supports precomputed dictionary which should be beneficial for compressing a lot of small files.

I think that argument is largely moot when you combine files in a single compressed file (which doesn’t mean there can be a benefit to using a dictionary; lz4 allows this too).

RJVB avatar Dec 02 '18 12:12 RJVB

I think that argument is largely moot when you combine files in a single compressed file

When storing the files in a key value database or sqlite, each file is compressed independently, which is why a precomputed dictionary will reduce the compression significantly, not to mention reducing the space usage required to store the dictionary multiple times in each row.

char101 avatar Dec 03 '18 01:12 char101

Using one dictionary per docset is an interesting idea, definitely worth benchmarking.

Regarding LZ4 and zstd, I just mentioned these two as an example, nothing has been decided so far.

trollixx avatar Dec 03 '18 05:12 trollixx

Just wanted to say that I feel this is the number 1 issue with Zeal and should be given much higher priority. Tens to hundreds of thousands of files means that any time I perform any large disk I/O tasks on any of my systems they get choked on the Zeal docsets.

If I try to use a large-directory-finding program (like WinDirStat or KDirStat) I have to wait as Zeal's docsets take up roughly 1/2 to 1/3rd of the total time searching. Making backups or copies of my home directory takes ages as the overhead for reading each of these files is incredible. I bet the the search cache must be much larger and slower on each of my systems because of having to index all of Zeals docsets.

Even Doom (a very, very early example of a game we have the source code to) solved this problem back in the day. Almost all of the games data is stored in a few "WAD" files (standing for "Wheres All the Data?"). If a user wants to play user-made mods or backup their game's data they just need to copy and paste a WAD file.

Sorry if this comes accross as bitching or complaining, I'm just trying to express how much this issue matters to me (and presumably many other users). I'm going to try to dust off my programming skills and work on this too.

fearofshorts avatar Mar 07 '20 11:03 fearofshorts

here is workaround for this issue:

#!/bin/bash cd /home/user/.local/share/Zeal/Zeal mkdir -p mnt mnt/lowerdir mnt/upperdir mnt/workdir #sudo mount docsets.sqsh mnt/lowerdir -t squashfs -o loop #sudo mount -t overlay -o lowerdir=mnt/lowerdir,upperdir=mnt/upperdir,workdir=mnt/workdir overlay docsets mount mnt/lowerdir mount docsets /usr/bin/zeal "$@" umount docsets umount mnt/lowerdir

#prepare: #mksquashfs docsets docsets.sqsh

#fstab: #/home/user/.local/share/Zeal/Zeal/docsets.sqsh /home/user/.local/share/Zeal/Zeal/mnt/lowerdir squashfs user,loop,ro 0 0 #/dev/loop0 /home/user/.local/share/Zeal/Zeal/mnt/lowerdir squashfs user,loop,ro 0 0 #overlay /home/user/.local/share/Zeal/Zeal/docsets overlay noauto,lowerdir=/home/user/.local/share/Zeal/Zeal/mnt/lowerdir,upperdir=/home/user/.local/share/Zeal/Zeal/mnt/upperdir,workdir=/home/user/.local/share/Zeal/Zeal/mnt/workdir,user 0 0

#name the file /usr/local/bin/zeal to have higher priority over /usr/bin/zeal #so docsets.sqsh will be mounted before running zeal and unmounted after it exits

coding-moding avatar Jul 03 '20 04:07 coding-moding

here is workaround for this issue:

I've got an even bigger/longer one ;)

  • migrate your entire root to ZFS
  • create a dataset for the docsets, with compression=gzip-9, decide where to mount it (I use /opt/docs/docsets)
  • move all docsets there, and point zeal to that path in its settings.

However, every solution that uses filesystem-based compression will still be suboptimal because even the tiniest file in the docset will still occupy the minimum filesystem or disk block - and it is not cross-platform. The way around that would be for Zeal itself to support compressed docsets, or simply use one of the existing libraries to access a compressed archive as a directory. Compressed archives can be packed much more compactly than a generic filesystem, and they're cross-platform/

RJVB avatar Jul 03 '20 08:07 RJVB

nope. you completely missed the point). look at the year:

char101 commented on 4 May 2014

and there wasn't a solution. just speculations like yours.)

coding-moding avatar Jul 03 '20 10:07 coding-moding

any progress or formal thoughts on this or where to start? Text compression would be 🔥. Currently using 1/3 of my 256gb expensive mac nvme.

ashtonian avatar Sep 14 '20 21:09 ashtonian

Maybe you can zip your data directory, mount it using fuse, then put an overlay filesystem over it to allow for modifications.

char101 avatar Sep 16 '20 12:09 char101