Kansa
Kansa copied to clipboard
Replace hashing function with faster implementation
Hey, I found that hashing function used throughout Kansa is ineffective and slow. It's okay for small operations like getting hashes of processes, but very slow when you use hashing for a path or for a whole disk. Basically the problem is with ReadAllBytes function that you are using. In my tests i was able to get significant performance improvement by using function with StreamReader IO instead. Can you take a look at this commit - https://github.com/exp0se/Kansa/commit/13b9abaadaf69f6d37c65290ec715879d8ebc1d7 I mess up my commit and a whole bunch of other stuff also get commited - you only need to look at hashing related changes. Can you consider replacing Kansa hashing function? I could send you a pull request later if you agree.
I think it's a simple 1 line to change if I'm understanding this right:
#$fileData = [System.IO.File]::ReadAllBytes($FileName)
$fileData = ([IO.StreamReader]$FileName).BaseStream
I did some research here: http://learn-powershell.net/2013/03/25/use-powershell-to-calculate-the-hash-of-a-file/
On my Windows 7 host, I've been testing a portion of the ProcsNModules script that hashes the DLLs associated with each process using SHA256. My first test without IO.Streamreader:
Days : 0 Hours : 0 Minutes : 0 Seconds : 13 Milliseconds : 308 Ticks : 133089160 TotalDays : 0.00015403837962963 TotalHours : 0.00369692111111111 TotalMinutes : 0.221815266666667 TotalSeconds : 13.308916 TotalMilliseconds : 13308.916
and again with IO.StreamReader:
Days : 0 Hours : 0 Minutes : 0 Seconds : 12 Milliseconds : 927 Ticks : 129271589 TotalDays : 0.000149619894675926 TotalHours : 0.00359087747222222 TotalMinutes : 0.215452648333333 TotalSeconds : 12.9271589 TotalMilliseconds : 12927.1589
Thats not much of an increase, but in a larger environment, perhaps a bigger improvement? @exp0se is that the function you were referring to?
What does the performance difference look like across a bunch of tests? Those times are so close, it could be due to other activity on the host.
-----Original Message----- From: "Juan Romero" [email protected] Sent: 9/8/2015 23:00 To: "davehull/Kansa" [email protected] Subject: Re: [Kansa] Replace hashing function with faster implementation(#128)
I think it's a simple 1 line to change if I'm understanding this right: #$fileData = [System.IO.File]::ReadAllBytes($FileName) $fileData = ([IO.StreamReader]$FileName).BaseStreamI did some research here: http://learn-powershell.net/2013/03/25/use-powershell-to-calculate-the-hash-of-a-file/ On my Windows 7 host, I've been testing a portion of the ProcsNModules script that hashes the DLLs associated with each process using SHA256. My first test without IO.Streamreader: Days : 0 Hours : 0 Minutes : 0 Seconds : 13 Milliseconds : 308 Ticks : 133089160 TotalDays : 0.00015403837962963 TotalHours : 0.00369692111111111 TotalMinutes : 0.221815266666667 TotalSeconds : 13.308916 TotalMilliseconds : 13308.916 and again with IO.StreamReader: Days : 0 Hours : 0 Minutes : 0 Seconds : 12 Milliseconds : 927 Ticks : 129271589 TotalDays : 0.000149619894675926 TotalHours : 0.00359087747222222 TotalMinutes : 0.215452648333333 TotalSeconds : 12.9271589 TotalMilliseconds : 12927.1589 Thats not much of an increase, but in a larger environment, perhaps a bigger improvement? @exp0se is that the function you were referring to? — Reply to this email directly or view it on GitHub.
True, I was cross eyed late last night when looking at it. I can try to run a few modules that do hashing and see. I don't have access to a larger environment to test this on at the moment.
I did a little more research on MSDN and our favorite search engine. Some discussion I found mentioned that the real difference between the 2 are how they actually handle files. Both classes are in the System.IO namespace. I'll sum up what I've found, you can probably verify with some guys in Redmond which is better. IO.File deals with arrays of bytes, and can write text to files from pre-allocated buffers or arrays of strings. This implies a huge memory hit when the file being read is large (in GB). ReadAllBytes specifically reads the entire file into an array, then closes the file. I also found discussion that the File class methods are wrappers to StreamReader/Writer methods.
StreamReader/Writer can read and write strings and bytes, and you get around the memory hit by reading lines at a time (which will take multiple calls for larger files). What I can't figure out here is if the code is doing essentially the same thing- allocating the entire file as a stream (based on the way the StreamReader constructor is instantiated) because the base underlying stream is still being accessed. If so, if the file is small, it may not really matter. Maybe its a case of "6 to one, half dozen to another"?
Sorry for late reply - was busy at work. Here is what i tested: Kansa Get-Fileshashes module
PS C:\Users\exp0se\Downloads\Kansa-master> Measure-Command {.\Modules\Disk\Get-FileHashes.ps1 MD5 C:\Windows}
Days : 0
Hours : 0
Minutes : 42
Seconds : 52
Milliseconds : 561
Ticks : 25725613939
TotalDays : 0,0297750161331019
TotalHours : 0,714600387194444
TotalMinutes : 42,8760232316667
TotalSeconds : 2572,5613939
TotalMilliseconds : 2572561,3939
I suspect the problem here is Workflows rather than hashing function, it also consumes tons of resources(few gigs or ram, a lot of cpu as well) making it impossible to run during working hours.
Here is my alternative function from https://github.com/exp0se/Kansa/commit/e0780120756a4acf376b586e9cab31f13edc5809
PS C:\Users\exp0se\Downloads\Kansa-master> Measure-Command {.\Modules\Disk\Get-FileHash.ps1 C:\Windows MD5}
Get-ChildItem : Access to the path 'C:\Windows\CSC\v2.0.6' is denied.
At C:\Users\exp0se\Downloads\Kansa-master\Modules\Disk\Get-FileHash.ps1:155 char:1
+ Get-ChildItem -Path $BasePath -Recurse |Where-Object {$_.Name -match $extRegex}| ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : PermissionDenied: (C:\Windows\CSC\v2.0.6:String) [Get-ChildItem], UnauthorizedAccessEx
ption
+ FullyQualifiedErrorId : DirUnauthorizedAccessError,Microsoft.PowerShell.Commands.GetChildItemCommand
WARNING: Cannot calculate hash for directory: C:\Windows\Panther\setup.exe
Get-ChildItem : Access to the path 'C:\Windows\System32\LogFiles\WMI\RtBackup' is denied.
At C:\Users\exp0se\Downloads\Kansa-master\Modules\Disk\Get-FileHash.ps1:155 char:1
+ Get-ChildItem -Path $BasePath -Recurse |Where-Object {$_.Name -match $extRegex}| ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : PermissionDenied: (C:\Windows\Syst...es\WMI\RtBackup:String) [Get-ChildItem], Unauthor
edAccessException
+ FullyQualifiedErrorId : DirUnauthorizedAccessError,Microsoft.PowerShell.Commands.GetChildItemCommand
Days : 0
Hours : 0
Minutes : 3
Seconds : 24
Milliseconds : 590
Ticks : 2045902634
TotalDays : 0,00236794286342593
TotalHours : 0,0568306287222222
TotalMinutes : 3,40983772333333
TotalSeconds : 204,5902634
TotalMilliseconds : 204590,2634
Some hashing function benchmarks Built-in function into powershell 4
PS C:\Users\exp0se\Downloads\SysinternalsSuite> Measure-Command { ls | Get-FileHash -Algorithm MD5 }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 144
Ticks : 1440400
TotalDays : 1,66712962962963E-06
TotalHours : 4,00111111111111E-05
TotalMinutes : 0,00240066666666667
TotalSeconds : 0,14404
TotalMilliseconds : 144,04
Function from my code:
PS C:\Users\exp0se\Downloads\SysinternalsSuite> Measure-Command { ls | Get-FileHashCustom -Algorithm MD5 }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 205
Ticks : 2053184
TotalDays : 2,37637037037037E-06
TotalHours : 5,70328888888889E-05
TotalMinutes : 0,00342197333333333
TotalSeconds : 0,2053184
TotalMilliseconds : 205,3184
Get-hashes function from Kansa module Disk\Get-Filehashes
PS C:\Users\exp0se\Downloads\SysinternalsSuite> Measure-Command { Get-Hashes -BasePath . -HashType MD5 }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 147
Ticks : 1473163
TotalDays : 1,70504976851852E-06
TotalHours : 4,09211944444444E-05
TotalMinutes : 0,00245527166666667
TotalSeconds : 0,1473163
TotalMilliseconds : 147,3163
Turns out my function is even slower, well i guess we need to rename an issues to fix Get-Filehashes module rather replace hashing function as initially i tested it with Get-Filehashes full module. Still i would prefer IO.ReadAllBytes to be replaced with io.streamreader everywhere since reading everything into memory is not a good practice.