pifs icon indicating copy to clipboard operation
pifs copied to clipboard

Is fuzzing / error correction a possibility to decrease write times?

Open rm-you opened this issue 9 years ago • 7 comments

I want to start by saying that I am not a mathematician (in fact I am very bad with mathematics), so this may be really dumb:

It occurs to me that if it were possible to "fuzz" the matching a little bit, then store some error correction data along with the size and index, would that dramatically reduce the time necessary to calculate the initial position for writes? For example, find the first occurrence that is really close to the file's data (maybe within 97% or so), then calculate a binary diff/patch that would be necessary to "fix" the data after retrieval. There are a lot of ways to do checksums and data recovery (usually for combating corruption), and we have the advantage of knowing ahead of time what bits will be corrupt!

It is possible this would take just as long as (or longer than) finding an exact match (and I am not sure how much the error correction data stored would reduce the compression ratio), but I was curious whether it would be possible.

Any thoughts from people who aren't horribly lost in the world of mathematics?

rm-you avatar Jun 05 '15 03:06 rm-you

I like it. Make the file descriptor a list of 'inodes' and I think you have something there. Reduces the cost of appending to the end of the file.

jdmarshall avatar Jun 09 '15 17:06 jdmarshall

It seems like this would just be a tradeoff between compression speed and compression ratio, this being a novelty program I personally think it'd have more use if the ratio was higher, at least then there would be SOME practical use for it, if transfer speed between computers was a heavy bottleneck (floppy drives). If write speed is a bottleneck, there's much much better filesystems and compression algos that satisfy that niche, but a heavy compression algorithm has many more use cases imo.

typecasto avatar Mar 19 '19 15:03 typecasto

Oh gosh, I completely forgot about this project.

I think it's a matter of difference in opinion rooted in whether you're thinking about files at rest or files in motion. Encoding a file in pifs could take an arbitrarily long time if you represent the file as a single offset. During that time you can report no progress to an interested party.

If you made the file contents into a series of offsets (leaving aside hamming distance and correction codes for a moment), then you could stream the encoding at creation time. This could be writing a new file on disk (moving from memory to disk) or lazily encoding a file for transport to another system. Additionally, as I mentioned before, appending to a file is cheap because the common prefix between the new and old file could be represented by the same bytes.

jdmarshall avatar Mar 20 '19 22:03 jdmarshall

I'd honestly say that pi-based compression is better as a program like 7-zip, rather than a filesystem, but i'd gladly make a sacrifice of speed for a heavy compression algorithm.

It seems to me (just based on intuition) that as the amount required to store the location in pi increases linearly, the amount of data stored increases exponentially (this may not be the case, or it might have a zero that's way more than any sensible person could store, or take longer than any person could actually be expected to wait)

While this might have some application, I doubt that write speed would ever be very good, and it might not even be feasible for large amounts of data.

Though, error correction COULD be used to vastly decrease storage size, if we find a partial match early in pi, and then correct the errors from there. I'd say that if we have a n byte file, and we find a match of the file that has more than n/2 bytes correct, we generate error correction data based on that match and then write the location, ecc data, and length to a file, and it could even be shorter than the file itself in some cases.

I might honestly just write my own version at this point. Thoughts?

typecasto avatar Apr 01 '19 15:04 typecasto

From my experience, when I tried to store a small image, the file size of the pi compressed file dramatically larger than the file before compression. Could be a draw of the luck but it could also be that the index value has to be massive to accompany the data.

mmhobi7 avatar Apr 01 '19 15:04 mmhobi7

That's why the error correction data would be added, so that we could find a much lower match and still have it match a majority of the file, storing less data.

typecasto avatar Apr 01 '19 16:04 typecasto

Pigeonhole principle. If you want a library with an infinite number of books, (which is what you're doing here) then the catalog is also infinite. The numbering on the book will on average be the same size as the contents of the book. In some cases shorter, in others longer. Compression just exploits the fact that some sequences of bytes are gibberish and are not useful to human beings. So gibberish takes longer numbers and comprehensible data shorter ones. The Library of Congress has standards that don't include admitting books full of gibberish. So they only have to index books that make some sense. Finnegan's Wake notwithstanding.

But pi is irrational. It's gibberish by definition. You're trying to find one sensible sequence of bytes in a sea of random noise. Fuzzing makes a smaller file if you can approximate the output with a simpler pattern. Essentially predict what a more boring version of the file would look like. There is no simpler pattern here. The file is always going to take as many bytes as it will take.

The counterproposal I made doesn't improve space either. In fact it's a little worse. I'm constrained by the same laws of Information Theory as the rest of you. But it would improve time, and quite substantially.

But as we're essentially talking about BogoSort for files, there's not enough beer in this conversation I'm afraid :)

jdmarshall avatar Apr 16 '19 19:04 jdmarshall