grep-windows FR: Implement direct processing of UTF-16 (a.k.a. Unicode) files

Hi,

Thank you very much for the Windows build!

I'm tasked with the following thing: I analyze a lot of large (20GB+) log files that are UTF-16 encoded (that is, a single char is 2 bytes). Today in order to grep them, I need an 'iconv' becore, such as:

iconv -f utf-16 -t utf-8 logfile | grep ... | ...

It takes a few minutes to finish. The majority of time is spent on the string conversion. My idea is - if there was a separate option, such as -utf16 that would natively read such files and natively grep them without converting first, it would make the thing waaaay faster. I think your grep would then be the fastest way on the planet to parse UTF-16 files.

PowerShell has this Select-String command with -Encoding parameter, but I had to stop it after ~25 minutes, whereas iconv|grep needed 3 minutes for a ~45 GB file. I think more than 2 minutes would be saved by not converting the string. When you do a lot of these, it adds up.

Just an idea. Thanks!

Oct 31 '23 08:10 Hermholtz

Hi.

I checked on a large file being converted via iconv via pipe to grep:

$ iconv -f utf16 -t utf8 y | ./grep aaa

In this case, grep spends a lot of time reading from pipe.

For example, in the read function, grep wants to read 50 megabytes at once, but pipe outputs only 65-kilobyte blocks.

One of the optimizations of grep is the read function - to read 65-kilobyte blocks until the entire 50-megabyte buffer is filled, and only then give this buffer for processing.

This optimisation increases the processing speed by an order of magnitude.

Also, I implemented inline utf16->utf8 conversion in the grep read function, but this feature is not ready yet: it requires more work to do - to add an option to set locale to allow grep expressions in utf8 (like I did for sed) and to open file input in binary mode (and manually convert CRLF->LF)...

Nov 01 '23 14:11 mbuilov

Unfortunately, I was wrong about optimizing the reading function - this does not increase processing speed.

Nov 02 '23 07:11 mbuilov

My observation is as follows.

An example source file has 45.401.170.734 bytes when saved as UTF-16, 22.700.585.788 bytes when saved as UTF-8 and 72.553.147 lines.

iconv -f UTF-16 -t UTF-8 takes 0h3m55s030 grep -F of the UTF-8 encoded file takes 0h0m22s590 - note that it's 1 GB per second, which is impressive! I/Os are negligible., I have a fast NVMe PCI-e 4.0 disk capable of doing 7 GB and 1 million IOPS per second. I used the following script to measure command time: https://stackoverflow.com/a/6209392

Therefore 91% of the total time is the conversion from UTF-16 to UTF-8 (and I believe iconv is doing it as fast as possible). Only 9% is the actual filtering time (and I think grep has no bottlenecks here).

If we could skip the conversion altogether, that would win 91% of the time. For that I think we would need a separate, native UTF-16 regex engine that just accepts wide chars of 2 bytes and internally processes them without doing any conversion. And that's my suggestion. Don't know if it's doable or not, that's for your consideration :)

Thanks!

Nov 02 '23 09:11 Hermholtz

Ok, I added a built-in UTF16->UTF8 conversion mode - enabled with the "--utf16" option.

To reduce memory copying, the input file is mapped directly into memory.

The time measurements were as follows:

my test file "y", encoded UTF16-BE, 9 GB in size:

$ ls -l y
9454188700 Nov 2 09:53 y

$ file y
y: Unicode text, UTF-16, big-endian text, with very long lines (338)

$ wc -l y
55000000

convert it to UTF-8:

$ time iconv -f utf16 -t utf8 y > z
real 0m52.928s
user 0m43.234s
sys0m9.452s

(the file contains mostly Russian letters - 2 UTF8 bytes per letter)

$ ls -l z
8597350300 Nov 3 18:02 z

$ file z
z: Unicode text, UTF-8 text, with very long lines (338)

$ wc -l z
55000000 z

call grep via iconv:

$ time iconv -f utf16 -t utf8 y | ./grep-3.11-x64 -F 2023 > q3
real 0m52.634s
user 0m44.593s
sys0m7.952s

call grep on a UTF-8 file:

$ time ./grep-3.11-x64 -F 2023 ./z > q2
real 0m7.518s
user 0m0.000s
sys 0m0.015s

and finally, calling grep on a UTF-16 file using built-in UTF16->UTF8 conversion:

$ time ./grep-3.11-x64 --utf16 -F 2023 ./y > q1
real 0m13.381s
user 0m0.000s
sys 0m0.000s

the results of all three searches are the same:

$ diff q1 q2

$ diff q1 q3

$ wc -l q1
8683950 q1

$ wc -l q2
8683950 q2

$ wc -l q3
8683950 q3

Nov 03 '23 17:11 mbuilov

I agree that working directly with UTF16 strings in grep will give speed comparable to searching in UTF8 text.

But I’m not ready to dig through the entire regex engine yet :) Then, maybe, I’ll get to it when I have time...

An interesting result was obtained with mapping the input file into memory - if this mode finds application, it may be worth adding it as an option for regular search.

Nov 03 '23 17:11 mbuilov

Wow so by implementing the --utf16 there's a lot of gain! That would be a great improvement! Now if you can only release it :-)

Nov 18 '23 10:11 Hermholtz

I have already compiled versions with --utf16 support.

Interestingly, grep can now be used as a faster alternative to iconv (assuming the input file does not contain invalid utf-16 characters):

$ time ./grep-3.11-x64 --utf16 -F "" ./yt > xx1
real    0m0.927s
user    0m0.000s
sys     0m0.015s

$ time iconv -f utf-16 -t utf-8 ./yt > xx2
real    0m1.798s
user    0m1.515s
sys     0m0.249s

$ tr -d '\r' < ./xx1 > ./xx3

$ ls -l xx1 xx2 xx3
-rw-rw-r--+ 1 292811098 Nov 20 09:15 xx1
-rw-rw-r--+ 1 290949800 Nov 20 09:15 xx2
-rw-rw-r--+ 1 290949800 Nov 20 09:16 xx3

$ diff --binary ./xx2 ./xx3

(tr is needed because the input file ./yt has LF line endings, but ./grep-3.11-x64 as a Windows application prints lines with CRLF line endings).

Nov 20 '23 06:11 mbuilov

grep-windows grep-windows copied to clipboard

FR: Implement direct processing of UTF-16 (a.k.a. Unicode) files

grep-windows
grep-windows copied to clipboard