grep-windows
grep-windows copied to clipboard
FR: Implement direct processing of UTF-16 (a.k.a. Unicode) files
Hi,
Thank you very much for the Windows build!
I'm tasked with the following thing: I analyze a lot of large (20GB+) log files that are UTF-16 encoded (that is, a single char is 2 bytes). Today in order to grep them, I need an 'iconv' becore, such as:
iconv -f utf-16 -t utf-8 logfile | grep ... | ...
It takes a few minutes to finish. The majority of time is spent on the string conversion.
My idea is - if there was a separate option, such as -utf16
that would natively read such files and natively grep them without converting first, it would make the thing waaaay faster. I think your grep would then be the fastest way on the planet to parse UTF-16 files.
PowerShell has this Select-String
command with -Encoding
parameter, but I had to stop it after ~25 minutes, whereas iconv|grep
needed 3 minutes for a ~45 GB file. I think more than 2 minutes would be saved by not converting the string. When you do a lot of these, it adds up.
Just an idea. Thanks!
Hi.
I checked on a large file being converted via iconv via pipe to grep:
$ iconv -f utf16 -t utf8 y | ./grep aaa
In this case, grep spends a lot of time reading from pipe.
For example, in the read function, grep wants to read 50 megabytes at once, but pipe outputs only 65-kilobyte blocks.
One of the optimizations of grep is the read function - to read 65-kilobyte blocks until the entire 50-megabyte buffer is filled, and only then give this buffer for processing.
This optimisation increases the processing speed by an order of magnitude.
Also, I implemented inline utf16->utf8 conversion in the grep read function, but this feature is not ready yet: it requires more work to do - to add an option to set locale to allow grep expressions in utf8 (like I did for sed) and to open file input in binary mode (and manually convert CRLF->LF)...
Unfortunately, I was wrong about optimizing the reading function - this does not increase processing speed.
My observation is as follows.
An example source file has 45.401.170.734 bytes when saved as UTF-16, 22.700.585.788 bytes when saved as UTF-8 and 72.553.147 lines.
iconv -f UTF-16 -t UTF-8
takes 0h3m55s030
grep -F
of the UTF-8 encoded file takes 0h0m22s590 - note that it's 1 GB per second, which is impressive!
I/Os are negligible., I have a fast NVMe PCI-e 4.0 disk capable of doing 7 GB and 1 million IOPS per second.
I used the following script to measure command time: https://stackoverflow.com/a/6209392
Therefore 91% of the total time is the conversion from UTF-16 to UTF-8 (and I believe iconv
is doing it as fast as possible). Only 9% is the actual filtering time (and I think grep
has no bottlenecks here).
If we could skip the conversion altogether, that would win 91% of the time. For that I think we would need a separate, native UTF-16 regex engine that just accepts wide chars of 2 bytes and internally processes them without doing any conversion. And that's my suggestion. Don't know if it's doable or not, that's for your consideration :)
Thanks!
Ok, I added a built-in UTF16->UTF8 conversion mode - enabled with the "--utf16" option.
To reduce memory copying, the input file is mapped directly into memory.
The time measurements were as follows:
- my test file "y", encoded UTF16-BE, 9 GB in size:
$ ls -l y
9454188700 Nov 2 09:53 y
$ file y
y: Unicode text, UTF-16, big-endian text, with very long lines (338)
$ wc -l y
55000000
- convert it to UTF-8:
$ time iconv -f utf16 -t utf8 y > z
real 0m52.928s
user 0m43.234s
sys0m9.452s
(the file contains mostly Russian letters - 2 UTF8 bytes per letter)
$ ls -l z
8597350300 Nov 3 18:02 z
$ file z
z: Unicode text, UTF-8 text, with very long lines (338)
$ wc -l z
55000000 z
- call grep via iconv:
$ time iconv -f utf16 -t utf8 y | ./grep-3.11-x64 -F 2023 > q3
real 0m52.634s
user 0m44.593s
sys0m7.952s
- call grep on a UTF-8 file:
$ time ./grep-3.11-x64 -F 2023 ./z > q2
real 0m7.518s
user 0m0.000s
sys 0m0.015s
- and finally, calling grep on a UTF-16 file using built-in UTF16->UTF8 conversion:
$ time ./grep-3.11-x64 --utf16 -F 2023 ./y > q1
real 0m13.381s
user 0m0.000s
sys 0m0.000s
- the results of all three searches are the same:
$ diff q1 q2
$ diff q1 q3
$ wc -l q1
8683950 q1
$ wc -l q2
8683950 q2
$ wc -l q3
8683950 q3
I agree that working directly with UTF16 strings in grep will give speed comparable to searching in UTF8 text.
But I’m not ready to dig through the entire regex engine yet :) Then, maybe, I’ll get to it when I have time...
An interesting result was obtained with mapping the input file into memory - if this mode finds application, it may be worth adding it as an option for regular search.
Wow so by implementing the --utf16
there's a lot of gain! That would be a great improvement! Now if you can only release it :-)
I have already compiled versions with --utf16
support.
Interestingly, grep
can now be used as a faster alternative to iconv
(assuming the input file does not contain invalid utf-16 characters):
$ time ./grep-3.11-x64 --utf16 -F "" ./yt > xx1
real 0m0.927s
user 0m0.000s
sys 0m0.015s
$ time iconv -f utf-16 -t utf-8 ./yt > xx2
real 0m1.798s
user 0m1.515s
sys 0m0.249s
$ tr -d '\r' < ./xx1 > ./xx3
$ ls -l xx1 xx2 xx3
-rw-rw-r--+ 1 292811098 Nov 20 09:15 xx1
-rw-rw-r--+ 1 290949800 Nov 20 09:15 xx2
-rw-rw-r--+ 1 290949800 Nov 20 09:16 xx3
$ diff --binary ./xx2 ./xx3
(tr
is needed because the input file ./yt
has LF
line endings, but ./grep-3.11-x64
as a Windows application prints lines with CRLF
line endings).