DISCoHAsH
DISCoHAsH copied to clipboard
Incosistent buffer reading
Hello,
When I do discohash /dev/random discohash doesn't read /dev/random data as a non-stoppable data stream, rather it treats the file as empty. This behavior can be verified by hashing an empty file or /dev/null. Generally, this behavior is certainly a bug, at least when I compare this with sha utils.
Thanks @rilysh for this report. Agree that sounds like a bug. I’ll take a look. Is there anything else I should be aware of?
Also do you perchance know why this occurs and how to fix it?
Also do you perchance know why this occurs and how to fix it?
It happens because /dev/random isn't a normal file on disk, more generally it doesn't even exist on disk, thus when you do tellg() on that file, it returns the file offset 0. More importantly, tellg() doesn't tell the file size rather it returns the offset. I'm not a big fan of C++ I/O API, because of how messy and implicit they are.
Since read() doesn't do any blocking, one thing that can be done is do a blocking by peeking at the stream to check whether it ends or not. For example,
while (file.peek() != EOF) {
if (!file.read(reinterpret_cast<char *>(buffer.data()), size)) {
throw std::runtime_error("Failed to read the file: " + filename);
}
}
I'm not a big fan of this solution as it doesn't constantly read the buffer stream but it's still not that bad.
Another thing that can be done is to use filesystem APIs
For example,
#include <iostream>
#include <filesystem>
int main()
{
try {
auto a = std::filesystem::file_size("test.txt");
std::cout << "Size: " << a << " bytes\n";
} catch (std::filesystem::filesystem_error& e) {
std::cerr << e.what() << '\n';
}
}
However, file_size can't read pseudo devices. Reading /dev/random or any other pseudo file will result in a panic. One is just slightly better than the other solution.
I forgot to mention that the filesystem has some platform-dependent functions. Some operating systems may or may not provide the functionality, and depends on the implementation. But the majority of operating systems should work just fine!
Thank you so much @rilysh I'm learning a lot from your comments! :)
I had a go at reproducing it and I'm not sure I'm doing this right! 😂
Regarding the above issue, I tested just now and:
$ make && cd bin && ls
discosum discotool
$ ./discosum /dev/null
296480cd95fd7f5c18e422272c206a4b87c8d19421637ca8b59765b33332b532 /dev/null
$ ./discosum /dev/null
296480cd95fd7f5c18e422272c206a4b87c8d19421637ca8b59765b33332b532 /dev/null
$ ./discotool /dev/null
296480cd95fd7f5c18e422272c206a4b87c8d19421637ca8b59765b33332b532 /dev/null
$ touch xyz; ./discotool xyz
296480cd95fd7f5c18e422272c206a4b87c8d19421637ca8b59765b33332b532 xyz
$ touch xyz; ./discotool xyz
296480cd95fd7f5c18e422272c206a4b87c8d19421637ca8b59765b33332b532 xyz
$ ./discotool xyz
296480cd95fd7f5c18e422272c206a4b87c8d19421637ca8b59765b33332b532 xyz
$ ./discosum < xyz
296480cd95fd7f5c18e422272c206a4b87c8d19421637ca8b59765b33332b532
$ ./discotool < xyz
$ cat /dev/urandom | head -n 10000 | ./discotool
48cce2a30e1d0a41b628dd100dd2053f0b59c95c095c92f3c299365bc9234c8d
so I'm afraid I'm a little confused, as it does seem to operate in a way which works. I don't think I understand quite what you mean, although I get it, I think your meaning is (to paraphrase to check my understanding):
If reading /dev/random as a 'file' not a redirect, then it will treat it as empty, which is incorrect
I agree it's incorrect from the view that it should do this, but whether it should or not, I don't know. Also, I'm sorry but I'm not sure how to test it, and I'd like to if we're going to fix it.
Finally, I don't know how we'd actually make it part of the CLI. If you have ideas on how that would work, I'd really love for you to share it with me because I think it would be very useful to know!
I'm ok to not put it in, but I think if we can make it useful then it would be good to include.
Also, @rilysh, more generally, ultimately I would like to have the code and repo of the same quality as https://github.com/dosyago/rain (although I'm not sure you'd call that "quality": I think there is much that can be improved! but I don't know)
I tested:
$ rainsum /dev/null
b735f3165b474cf1a824a63ba18c7d087353e778b6d38bd1c26f7b027c6980d9 /dev/null
$ touch xyz; rainsum xyz
b735f3165b474cf1a824a63ba18c7d087353e778b6d38bd1c26f7b027c6980d9 xyz
$ cat /dev/random | head -n 10000 | rainsum
f0906f1654ef10dfb0ab67e6f4dfb2e52432d6f9f279c7c2cde54fabb81c4e68 stdin
and it looks like it handles it well over there.
Looking at the code I see that if no input is provided we:
std::istream& getInputStream() {
#ifdef _WIN32
// On Windows, use std::cin
return std::cin;
#else
// On Unix-based systems, use std::ifstream with /dev/stdin
static std::ifstream in("/dev/stdin");
return in;
#endif
}
or:
if (!inpath.empty()) {
infile.open(inpath, std::ios::binary);
if (infile.fail()) {
throw std::runtime_error("Cannot open file for reading: " + inpath);
}
// ...
Yet I did not know how to test the unlimited streaming from /dev/random case, at least not easily from the command line. I thought if I'd had an SIGINT handler in there to trigger a hash output that might be nice, but I don't know.
I don't grasp your idea yet I think! :)
@00000o1 I think you take a quite complicated route lol. The first answer using peek() is actually here way to go. I could've used the filesystem APIs but they seem to not support any kind of block device.
so I'm afraid I'm a little confused, as it does seem to operate in a way which works. I don't think I understand quite what you mean, although I get it
I actually mean that file data should be treated as a stream of data and until it doesn't stop there's no reason to stop ourselves either. The thing is that discohash treats /dev/random as empty (it should treat that file as I mentioned) which is why it can be confusing.
Using peek() is absolutely okay, as it just has to check whether there's more data or not. The runtime penalty is fairly low here.
Yet I did not know how to test the unlimited streaming from /dev/random case, at least not easily from the command line. I thought if I'd had an SIGINT handler in there to trigger a hash output that might be nice, but I don't know.
We don't need to test if there's any buffer exists or not. We can (as I mentioned above) peek at this and see if the file reaches the end of the line. Since /dev/random guarantees that random bits will be generated, we practically never see the end of it.
I thought if I'd had an SIGINT handler in there to trigger a hash output that might be nice, but I don't know.
It's quite a nice idea.
Finally, I don't know how we'd actually make it part of the CLI. If you have ideas on how that would work, I'd really love for you to share them with me because I think it would be very useful to know!
Right now, this issue is a very slight (low-priority) "issue". There are some other improvements that can be made with discohash. This will require some rewrite. Since it's fairly small, I believe it's quite easy to do a fresh rewrite conforming to C++17 (currently it doesn't use all C++17 features).
I think you take a quite complicated route lol.
@rilysh I know it seems like that 🤣
I trying to understand. I am not a C expert but want to learn more.
Right now, this issue is a very slight (low-priority) "issue". There are some other improvements that can be made with discohash. This will require some rewrite. Since it's fairly small, I believe it's quite easy to do a fresh rewrite conforming to C++17 (currently it doesn't use all C++17 features).
I'm happy to accept your rewrite if this is your intention, and learn from it, or points on how if it's not!
Thank you again for your contribution so far!
Thanks! I'll do a quick cleanup. I actually wrote a little wrong thereby "rewrite", so instead I'll clean up some bits here and there, and then you can take a look at whether it fits the need. Not the most efficient way, but still some issues can be addressed.
@rilysh that's fantastic! I'm looking forward to learning more from your work. Thank you very much!
@00000o1 I just noticed that something is fishy with the hashing algorithm. Are you sure the algorithm used here is correct? It just happens that I edit the compiler optimization and as soon as I recompile, it produces a different hash of that same file.
With -O3 enabled:
101fb5f5406e6801733dcbfec9a54f731f69eb0952f4312540d958f9aae98d35 Makefile
With -O2 enabled:
5f9e9f27d4b5f4c6e6fe30717270d0a5a3d47787c0a273b6eda94f7037a52178 Makefile
Removing only -march=native:
8221621400fde17e291779e774ada741e6bf171efefa1effe2cf09c546816405 Makefile
With defaults (removing explicit optimizations):
37029e7845c8d491322a86b04deb87dd2197406cb8de2198e173955e46be3ade Makefile
This is very common with hashing algorithms, that compiler may do X and it may strip away something, so being explicit always becomes mandatory. However, I'm not sure whether the algorithm is correct at all or not.
That's a good catch, @rilysh! It seem a bit too complex to look into right now to determine for sure, but I'm quite sure the algo is correct, although my implementation may be unreliable--it likely is!
The node.js code should be used as reference as it's less likely to suffer from these kinds of irregularities. If you're looking for something that is a little more stable as a ref to similar ideas, then I based rain on a more stable boilerplate, and the algorithm is quite similar.
Although, granted, I have not tested rain similarly! This compiler optimization idea is definitely one I might consider adding to the tests! 🥹
Rain seems to be okay! The hash it produces is correct w/ or w/o optimizations. Yeah, some added tests regarding this would be also nice!
Ok cool @rilysh that’s good news! That’s what hoped 🥹😂 and thought. Thank you for your contributions, I really appreciate it! ☺️