ugrep icon indicating copy to clipboard operation
ugrep copied to clipboard

request to search from 7z archive

Open charulix opened this issue 4 years ago • 6 comments

would like to request a feature to read from 7zip archive on search

Search the contents of archives (

Thanks

charulix avatar Jan 13 '22 06:01 charulix

Saw that one coming from a mile away :smile:

I had this on my TODO list. It is nice to have, BUT sadly this will not be technically easy to add to ugrep or may not even be legally possible.

Firstly, 7zip has no usable C/C++ library to link ugrep with. Also 7zip is rather complex in its internals, which seems to me has to be replicated in ugrep accurately as roughly described here: https://www.romvault.com/Understanding7z.pdf As you can see, it would be like writing the source code for 7zip from scratch. It is not as simple as forking 7z -x to extract files.

Secondly, we cannot use any of the 7zip source code for ugrep directly, because 7zip is LGPL and cannot be used with a BSD3 project like ugrep. The other (de)compression libraries are linked (not compiled) with ugrep, so these do not pose a legal problem from a licensing point of view.

To keep ugrep clean and unencumbered by licensing issues, I wrote my own tar, zip, pax and cpio unarchivers from scratch in C++ that call low-level external decompression functions of libraries (zlib,bzip2,lzma,lz4,zstd) linked with ugrep.

genivia-inc avatar Jan 13 '22 17:01 genivia-inc

As a follow on, and noting that I am not well versed on the complexities of software licensing, could you not link against 7z.dll (or 7za.dll) noting that your project uses it and that it (not your project) is LGPL? I guess my thought comes from https://www.7-zip.org/faq.html#developer_faq

Bladehawke avatar Jun 13 '22 05:06 Bladehawke

no worries, not sure on how to use the dll. not using ugrep on my project yet.

running all in pure batch at the moment. much easier to maintain.

thanks

On Mon, Jun 13, 2022 at 1:56 PM Xander @.***> wrote:

As a follow on, and noting that I am not well versed on the complexities of software licensing, could you not link against 7z.dll (or 7za.dll) noting that your project uses it and that it (not your project) is LGPL? I guess my thought comes from https://www.7-zip.org/faq.html#developer_faq

— Reply to this email directly, view it on GitHub https://github.com/Genivia/ugrep/issues/185#issuecomment-1153502862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUYX4RL3PQQH4HVUX5QYT6TVO3ESJANCNFSM5L25CREA . You are receiving this because you authored the thread.Message ID: @.***>

charulix avatar Jun 14 '22 06:06 charulix

The LZMA SDK includes a CPP/7zip/Archive/7z directory with the C++ source code to extract files, which is a good starting point to look into this, because this code is placed in the public domain. It includes the C++ source code for .7z compression and decompression, albeit a reduced version, so not everything that 7-zip supports may work.

genivia-inc avatar Jun 14 '22 17:06 genivia-inc

So it turns out the documentation is practically non-existent on how to use this 7-zip LZMA SDK's API to decompress archives. No source code comments either. That's awful. It may be free, but comes at a high cost (reverse engineer the API + dev time + debugging/testing).

At a minimum we need to use the lower-level API to decompress .7z archives incrementally in memory, fetch the compressed path names (to display) and decompress their content in memory to send to the search engine.

genivia-inc avatar Jun 15 '22 00:06 genivia-inc

The following works well enough:

ugrep --filter='7z:7z x -so %' pattern filename.7z

The trick can be included in the default configuration:

filter=7z:7z x -so %

hdatma avatar Sep 01 '22 10:09 hdatma

The following works well enough:

ugrep --filter='7z:7z x -so %' pattern filename.7z

Also specify option -W to search 7zip that may contain binary files. This option prevents bailing out with a "binary file matches" warning and instead shows binary matches in hex.

genivia-inc avatar Nov 14 '23 19:11 genivia-inc

thank you

charulix avatar Nov 16 '23 02:11 charulix

I'm reopening this request.

Upon closer inspection of the LZMA SDK it appears there is everything that I need to extract 7zip compressed files in memory to search. I was wrong about licensing. LZMA SDK is placed in the public domain. Part of the LZMA SDK source code files have to be copied in order to compile a decompressor. I will use the LZMA SDK C source code files, which produce smaller object files to link with ugrep.

I will give this a go in the coming days.

genivia-inc avatar Dec 25 '23 20:12 genivia-inc

thank you so much for the update

On Tue, 26 Dec 2023, 4:43 am Dr. Robert van Engelen, < @.***> wrote:

I'm reopening this request.

Upon closer inspection of the LZMA SDK it appears there is everything that I need to extract 7zip compressed files in memory to search. I was wrong about licensing. LZMA SDK is placed in the public domain. Part of the LZMA SDK source code files have to be copied in order to compile a decompressor. I will use the LZMA SDK C source code files, which produce smaller object files to link with ugrep.

I will give this a go in the coming days.

— Reply to this email directly, view it on GitHub https://github.com/Genivia/ugrep/issues/185#issuecomment-1869112470, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUYX4RPAMJGRMBQRUHHN5NTYLHQP7AVCNFSM5L25CREKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBWHEYTCMRUG4YA . You are receiving this because you modified the open/close state.Message ID: @.***>

charulix avatar Dec 26 '23 12:12 charulix

Got it all working 😀

Now I need to make sure the extra code is portable. It includes the 7zip LZMA SDK subset of C files. There is no proper lib7z or something I can (or want to) use.

I could have done this much earlier and faster, if not for the fact that the LZMA SDK is not developer friendly, to put it mildly. The code is convoluted and has no documentation. Just an example 7zMain.c to decompress an archive, which helped.

What's worse about 7zip LZMA SDK is that huge files are extracted in memory as one big chunk, allocating a chunk of memory of multiple GB when you hit a huge compressed file in a 7z archive. At least that's what I could find out with the examples and other info. Decompression of an 7z-archived file is not done by blocks. But that's what I do with all other compression formats, because incrementally decompressing by blocks has several advantages: a) incremental decompression has excellent spatial and temporal locality, which is best for cache/memory access, b) pipelining of the partial results to the search engine allows thread execution parallelism, and c) early termination is possible when a single match is found e.g. with option -l. We can't do that with 7z archives.

Another problem with 7zip overall is that it requires a 7z file to be seekable, i.e. a physical file. This means we can't search 7z files nested in archives. This limitation does not apply to other formats such as tar and zip. These (special?) streaming implementations I wrote allow any source, so can also be nested in archives.

genivia-inc avatar Dec 28 '23 02:12 genivia-inc

The Windows ug.exe and ugrep.exe -z now also work to search 7z files.

I created a static library project in MSVC++ to compile x64 and x86 libraries viiz-x32.lib and viiz-x64.lib with the 7z LZMA SDK parts that I need to link with ugrep. As usual, I will write up the build instructions in the ugrep project vs/ugrep/README.txt for future updates and maintenance.

genivia-inc avatar Dec 28 '23 18:12 genivia-inc

Awesome, thanks for the update!

On Fri, 29 Dec 2023, 2:53 am Dr. Robert van Engelen, < @.***> wrote:

The Windows ug.exe and ugrep.exe -z now also work to search 7z files.

I created a static library project in MSVC++ to compile x64 and x86 libraries viiz-x32.lib and viiz-x64.lib with the 7z LZMA SDK parts that I need to link with ugrep. As usual, I will write up the build instructions in the ugrep project vs/ugrep/README.txt for future updates and maintenance.

— Reply to this email directly, view it on GitHub https://github.com/Genivia/ugrep/issues/185#issuecomment-1871415805, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUYX4ROIYTDRFQAE27BQ66LYLW5ZVAVCNFSM5L25CREKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBXGE2DCNJYGA2Q . You are receiving this because you modified the open/close state.Message ID: @.***>

charulix avatar Dec 29 '23 03:12 charulix

7zip search is now available with ugrep v4.5.0.

After performing several tests with 7zip, I decided to cap files stored in 7zip to 1GB to search. Files larger than 1GB will be skipped with a warning message. All other archived files will be searched.

The reason for my decision is that 7zip LZMA SDK requires in-memory expansion. This has several disadvantages as I've commented on earlier. Also, when files are (much) larger than 1GB and we are searching in parallel, then memory is essentially thrashed. I tried a 13GB file for example, which basically locked my machine up. Fortunately, searching files up to 1GB takes a few seconds at the most on a reasonably fast machine, even when searching several 7zip archives in parallel with threads.

I could set the threshold of 1GB a bit larger, but eventually such increases will come at a cost that I find unacceptable.

genivia-inc avatar Jan 05 '24 16:01 genivia-inc