hashit icon indicating copy to clipboard operation
hashit copied to clipboard

Possibly adding mtime or more metadata

Open lispstudent opened this issue 1 year ago • 9 comments

Hello,

Hashit is one of the fastest check-summing tools. I would much like it to use it for about 200TB of data spread over several servers, as it would offer a significant time savings, especially on hard disks.

I would need also to collect a bit more metadata, at least mtime. To do that, at the moment I need to recursively run another tool.

Would you consider adding such option, perhaps including even more metadata?

A possible example that comes to mind is using printf syntax, like rhash format options.

Thank you for consider this.

lispstudent avatar Jul 21 '24 08:07 lispstudent

Sure more than happy to add this. What do you mean by mtime though? Could you give show me some examples? If so I can probably add it very quickly.

boyter avatar Jul 21 '24 23:07 boyter

Thank you!

This page helped me clarify mtime vs ctime aspects in Unix/Linux systems.

md5deep has t option for "creation time", which is "changed time" on Unix, but mtime would be more useful, as we are tracking hashes of content, not metadata. To have both would be even better.

I found rhash's printf format switches more useful, as one can better compose the output. But even a switch like md4deep would be excellent.

In the former case, as uninformed example, command hashit --format printf '%s,%{md5},%f,%d,%{mtime}\n' test.txt would return something like:

61537206,b2d3c0e03cd0c0e56e60c7a395d6f4cd,test.txt,/Users/lispstudent/,2019-11-25 22:02:37

lispstudent avatar Jul 22 '24 05:07 lispstudent

Ah gotcha. Yeah seems reasonable. Ill have a look at implementing this over the next few days then. I don't think it should be too hard, but then again I have been wrong before.

boyter avatar Jul 22 '24 06:07 boyter

Let me know if I can be of help testing. I am using MacOs, and I could try to compile on some FreeBSD servers too.

lispstudent avatar Jul 22 '24 06:07 lispstudent

Will do. I am able to test on macOS and Linux easily, and Windows with a bit of effort.

boyter avatar Jul 22 '24 07:07 boyter

https://github.com/djherbis/times

This looks like it might solve the issue

boyter avatar Jul 22 '24 07:07 boyter

Working with this on a branch, with the following outputs

$ hashit -f hashdeep --mtime processor 
%%%% HASHDEEP-1.0
%%%% size,md5,sha256,filename,mtime
## Invoked from: /Users/boyter/Documents/projects/hashit
## $ hashit -f hashdeep --mtime processor
##
2840,ce29ce9a95713628e1d8e43a51027ac1,7dcc785a34ce95c4e741e92177f221e6d05d9c1663481f35c54286fc6645934f,processor/workers_test.go,2021-11-15 09:03:56
412,fea2253b11e12a134efebee40c7ca544,d6d0410c9fd662f08ccf2586661ff3c9623c68d209dec680ac8553ae5ebcf899,processor/file.go,2021-11-15 09:18:30
406,b8db244d45fa9eb0f1d510b107c6cf03,f432b5f092b7082cd5c2f01cba61d093e57e46b72678f1cfc7eb1b17ff30e2f6,processor/structs.go,2024-07-26 12:04:56
8179,61feb67b40e75ffd0279478ac288d697,e3ee2622a4b7747969514ce7f2e7cb4d01bed1797904793bb23d34e685a2ef90,processor/formatters.go,2024-07-26 12:12:20
4841,22fa73c10faca3bf40e576a5699cc479,bbfaa989480ca68ffc134ca396f900cd6b21e7bd4f5652d03c0692b49e2d84fb,processor/processor.go,2024-07-26 12:13:06
20906,7febbe6c7b0fb1d1e1f5a98ab0199a0f,a194f7f7719bfa06c3a1fd084f1cc62d04e5f8879aa00af076366cb991b339f7,processor/workers.go,2024-07-26 12:06:27

Also added to JSON and such

$ hashit -f json --mtime main.go | jq
[
  {
    "File": "main.go",
    "MD4": "",
    "MD5": "80ce62ad784fdcacaee9b6ff30fd5f3e",
    "SHA1": "9378850dc3f9833f3a0462643485d96a86fca348",
    "SHA256": "dd9ec0cabad718fd1bb248ed0f77351a072c57a957ca7aa3c72bf23ce29816ea",
    "SHA512": "05624869306a86f2820c6ad4d6b2f47a3de24221b980240a20e750c6fc3172bd834b4ec9cf66fdaa95460aef035bbf3c4dedada7a65ee6ae7b5a700656e55ce7",
    "Blake2b256": "",
    "Blake2b512": "",
    "Blake3": "",
    "Sha3224": "",
    "Sha3256": "",
    "Sha3384": "",
    "Sha3512": "",
    "Bytes": 2147,
    "MTime": "2024-07-26T12:03:41.782624482+10:00"
  }
]

@lispstudent this sort of what you are thinking about?

boyter avatar Jul 26 '24 02:07 boyter

Note have not done a printf style format yet... because I am lazy, but its something I think I might add via a new format type, since it would not be compatible with the hashdeep format as I understand it.

boyter avatar Jul 26 '24 02:07 boyter

That is excellent. As long as --mtime switch is there, I am fine withoutprintf format.

If I run sha1deep with -t, this is the output I get

sha1deep -zt test.txt
   2291164  6dea7e33c15032c0db470e6d5efb9da2342d5c1b 2024:02:29:09:05:25  /Users/lispstudent/test.txt

Thank you!

lispstudent avatar Jul 26 '24 04:07 lispstudent

Latest release should have this for you.

boyter avatar Aug 12 '24 22:08 boyter

Thank you for the release!

Testing on MacOS on arm CPU it seems I can see the mtime value only with --format json.

e.g.

hashit --hash md5 --mtime --format json a.csv

[{"File":"a.csv","MD4":"31d6cfe0d16ae931b73c59d7e0c089c0","MD5":"08e30724d71d40b07e1b412ec30cda37","SHA1":"da39a3ee5e6b4b0d3255bfef95601890afd80709","SHA256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","SHA512":"cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e","Blake2b256":"0e5751c026e543b2e8ab2eb06099daa1d1e5df47778f7787faab45cdf12fe3a8","Blake2b512":"786a02f742015903c6c6fd852552d272912f4740e15847618a86e217f71f5419d25e1031afee585313896444934eb04b903a685b1448b755d56f701afe9be2ce","Blake3":"af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262","Sha3224":"6b4e03423667dbb73b6e15454f0eb1abd4597f9a1b078e3f5b5a6bc7","Sha3256":"a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a","Sha3384":"0c63a75b845e4f7d01107d852e4c2485c51a50aaaa94fc61995e71bbee983a2ac3713831264adb47fb6bd1e058d5f004","Sha3512":"a69f73cca23a9ac5c8b567dc185a756e97c982164fe25859e0d1dcc1475c80a615b2123af1f5f94c11e3e9402c3ac558f500199d95b6d3e301758586281dcd26","Bytes":1258715,"MTime":"2024-02-13T13:05:11.3714732+01:00"}]%

Instead, with --format text

hashit --hash md5 --mtime --format text a.csv
a.csv (1258715 bytes)
        MD5 08e30724d71d40b07e1b412ec30cda37

Also, --format json outputs all checksums, not just the one requested. So, it seems, hashit computes all checksms no matter what is the command line switch.

lispstudent avatar Aug 14 '24 04:08 lispstudent

I am so sorry @lispstudent I totally missed your response. I think I somehow screwed up a merge or something and never noticed. Mtime is now 100% in all the outputs as you would expect now for all types, with the exception of standard input processing where there is no file to calcualte the time, in which case you get this error

$ echo "hello" | hashit --mtime
ERROR 2025-12-01T07:24:23Z: cannot use --mtime option with standard input, ignoring flag
stdin (6 bytes)
        MD5 b1946ac92492d2347c6235b4d2611184
       SHA1 f572d396fae9206628714fb2ce00f72e94f2258f
     SHA256 5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03
     SHA512 e7c22b994c59d9cf2b48e549b1e24666636045930d3da7c1acb299d1c3b7f931f94aae41edda2c2b207a36e10f8bcb8d45223e54878f5b316e7ce3b6bc019629

thanks @clach04 for pointing this out.

boyter avatar Dec 01 '25 07:12 boyter

Thank you for fixing this.

I am still waiting for upstream to help us solve #15, then I will test this on our SmartOS servers. Much looking forward to it.

lispstudent avatar Dec 01 '25 07:12 lispstudent

If I ever forget again due to closing PLEASE ping me. I am not bothered by this and id rather fix the issue than have it hidden.

boyter avatar Dec 01 '25 07:12 boyter

IF the SQLite thing isnt resolved quickly, I may remove the reliance on it being built, and instead switch to using stdout redirects to run the SQL.

I wanted to avoid the issues with trying to escape and thought having it built in would be useful, and allow the audit to use the SQLite.

I think it would be better if it could be fixed upstream, but if not let me know.

boyter avatar Dec 01 '25 07:12 boyter

I am not sure if relevant, but there was a merge request last week.

Regarding using stdout redirects, please forgive my ignorance, would that mean to launch a subshell each time? Why using the native SQLite with a native Go driver was not deemed satisfactory?

lispstudent avatar Dec 01 '25 07:12 lispstudent

Yes, it would mean doing something like the below

hashit --format sql DIRECTORY | sqlite3 hashit.db

to populate the database. It means having sqlite installed however.

The reason for not using the mattn sqlite driver is because it means you have a CGO dependancy which makes cross compiling a bit of pain. While the modernc go port is slightly slower, it makes building trivial without needing to setup the C toolchains. Its why I use it in every project where I want SQLite these days with Go integration. Just so much easier to work with.

boyter avatar Dec 01 '25 21:12 boyter

Understood, thank you.

lispstudent avatar Dec 03 '25 10:12 lispstudent