ngPost icon indicating copy to clipboard operation
ngPost copied to clipboard

split instead of rar?

Open sjpotter opened this issue 4 years ago • 24 comments

If one has a big file (say a 40GB iso) one can create a par2 that can repair the whole file, split the file (using normal file system splitting tools) and just use a single par2 file (assuming no block errors) to reconstruct the file from all the splits.

now, this can be a big help on IO if one is trying to post the file.

Currently, one has to essentially do 3 passes over all the data

  1. rar to split it (and encrypt if so desired)
  2. par to create the pars on the rar split
  3. read to upload

If on the other hand, one just needed to split, one could redo this

  1. par to create pars on the big file
  2. read to upload. the read to upload will generate the split in real time

a 33% IO win.

yes, one doesn't get encryption this way, but that's the trade off. Some would be willing to make that tradeoff especially with obfuscated names.

one thing I might also add is perhaps in the context of split, one also has the ability to randomize the order of the splits (i.e. assuming a 40GB file split into 500MB chunks is going to have 80 parts, instead of uploading part1, 2, 3, it can be seek to part 75, upload that, seek to part 33, upload that and so on and so forth.

thoughts?

sjpotter avatar Feb 17 '20 09:02 sjpotter

hum... then what tool would be used to reconstruct the file? you kind of need to use some standard tool... when I'm compressing and generating the par2, I'm creating a new list of files to upload. There is nothing yet to distinguish the par2 from potential input files that needs to have a special treatment (being cut)

mbruel avatar Feb 17 '20 15:02 mbruel

try it, take a 1GB file, decide on what split size you want (say 100MB, for 10 parts) create par files that are perfect segments of that original chunk (say 10MB or 100MB parity blocks). split the file.

par2 r my.par2 *chunks*

par2 will rebuild the original

sjpotter avatar Feb 17 '20 17:02 sjpotter

let me be a bit clearer. let's be naive about it without saving any I/O

  1. create large iso
  2. create par set to repair iso if damage happens to it
  3. split iso into multiple files (i.e. damage it) (i.e. equivalent of the rar step)
  4. use parset to recover file (par2 is smart and should be able to determine that all the data needed to reconstruct the file is with in the set of split files, at least if the split is done correctly. i.e. that the each split files corresponds to a complete multiple of par2 blocks)

in step 3 instead of creating files (i.e. writing them out), ngPost can just create them directly out of the single large file without having to write them disk

sjpotter avatar Feb 17 '20 19:02 sjpotter

hi, Yeah I see that, but I believe most people nowadays would rather use RAR so they can protect with a password. Someone asked me to develop the option of file name obfuscation, but that was to put inside the RAR archives to avoid using a password. So I agree, ngPost could do the cutting itself and avoid this way:

  • to waste some disk space for the temporary copy to archive the original file
  • the time duration of the rar process that is not multithreaded...
  • avoid some I/O

As I said, this sounds quite specific. And it would need to redesign the whole process of handling the input files, not all, as the par2 would be excluded, just certain ones... It could be done... I won't have time in the next few months. Let's see if other people are interested...

mbruel avatar Feb 17 '20 20:02 mbruel

plus one on this one especially for big files this is a unique nd excellent way to do things

ghost avatar Feb 17 '20 21:02 ghost

I'm not arguing that this should be done in replacement of rar, just that it could be useful for some in parallel.

basically the way I'd think of it, is create some sort of "FileProvider" interface that has a Read() method.

normal FileProvider is pretty easy

fp = new NormalFileProvider(fileName) (probably just a wrapper around the QFile you already use)

but we can also do

fp = new SplitFileProvider(fileName, from, to) (probably also a wrapper, but with a seek and making sure not to read past to)

sjpotter avatar Feb 17 '20 21:02 sjpotter

i.e. in the above case, you'd generate par2 files like normal (so they would be read in with a "NormalFileProvider", but the big iso would be read in by many SplitFileProvider classes

sjpotter avatar Feb 17 '20 21:02 sjpotter

Well if you look at the code, things are not so easy...

A condition to start a PostingJob is to have created all the NntpFiles. an NntpFile knows exactly its number of NntpArticles. It is asynchronously receiving a signal each time its NntpArticles are posted. It then knows when the whole NntpFile is fully posted and signal it to the the PostingJob. When this one has finished all its NntpFiles, it knows the posting is finished... Basically it's the NntpFile that would need to be modified or better inherited to have a child SplitNntpFile. a simple static method could create all the SplitNntpFile from a file we want to split. This has to be done just after the creation of the par2, as the list PostingJob::_files is then rebuilt. There is an issue here as we're using QFileInfos... cf PostingJob::_postFiles

PostingJob::_readNextArticleIntoBufferPtr will also need to be updated to take that in consideration. That's where all the actual reading in file is done in a multi-threaded sequential way (thread safe). There is only one file handler (QFile) opened that is shared by all the ArticleBuilders which don't need to know what file is currently processed.

so quite a lot of changes to do... also probably the file name obfuscation and a way to integrate that in the GUI...

Something also to bare in mind, par2cmdline is only generating par2 on the same folder than the files it's working on. ParPar doesn't seem to have this limitation

mbruel avatar Feb 18 '20 02:02 mbruel

I think what I'm proposing is orthogonal to ile name obfuscation (at least within the archive), the obfuscation is only at the post level. (i.e. each part posted gets a random subject / poster)

sjpotter avatar Mar 06 '20 08:03 sjpotter

I think what I'm proposing is orthogonal to ile name obfuscation (at least within the archive), the obfuscation is only at the post level. (i.e. each part posted gets a random subject / poster)

I don't get what you mean. I don't see the connection between file splitting and obfuscation.

mbruel avatar Mar 07 '20 15:03 mbruel

there are 2 forms of obfuscation that I know about (that can be used together)

  1. renaming an iso to scrambled filename and then archiving that renamed file (so can't determine contents by filename)

  2. archiving an iso into a normal set of rar files, creating a par2 set for that set of rar files, then randomly renaming the files in the rar set (requiring one to use the par2 to recover their original names)

I believe you also add obfuscating the poster by creating a random poster for each file to make it harder to correlate the different pieces together without the nzb file.

the 1 above is irrelevant in my example. if I wanted a randomized file name, I would just rename it in advance.

the 2 above is similar to what I'm proposing, but less efficient. the main thing it provides you (which is valuable, but takes time) is encryption. A secondary thing it provides you is the ability to upload more than single file (but split mode only makes sense for a single large file)

i.e. to a first approximation, with rar files this is what happens

  1. create a rar set
  2. create a par set
  3. randomly rename files, requiring par2 file to recover original names.
  4. upload individual files

with file splitting, we can skip the first step

  1. create a par set for the single file we are going to split
  2. upload single file as split parts by just splitting it dynamically without requiring it be written to disk. These split parts would be obfuscated (i.e. would be given a random subject file name / poster name).
  3. upload par set.

sjpotter avatar Mar 07 '20 23:03 sjpotter

Hi,

  1. renaming an iso to scrambled filename and then archiving that renamed file (so can't determine contents by filename)

your first type of obfuscation is implemented as someone requested it for automated posts without encryption password. It is the "file name obfuscation" in the Parameter section.

  1. archiving an iso into a normal set of rar files, creating a par2 set for that set of rar files, then randomly renaming the files in the rar set (requiring one to use the par2 to recover their original names)

your second type of obfuscation is not implemented. Instead, Article obfuscation which is far more powerful can be used. BUT I wouldn't recommand using it as your posts would be lost if you loose the NZB file.

  1. upload single file as split parts by just splitting it dynamically without requiring it be written to disk. These split parts would be obfuscated (i.e. would be given a random subject file name / poster name).

About the splitting, I've explained here, that it is problematic with the architecture of ngPost as prior to start the posting (uploading) loop we're expecting to have created all the NntpFiles objects and also the reading function (which is used by multiple threads) is written based on all the NntpFiles pointing on different files.

So really dealing with splitting is a totally different thing from obfuscation in term of implementation with the actual ngPost. I guess it could have been designed in another way but that's not the case...

mbruel avatar Mar 08 '20 08:03 mbruel

I'd argue that article obfuscation is just 2 taken to it's logical extreme. in the simple case I described, one obfuscates at the file level (but still need an nzb to get all the files, as randomly named), though conceptually any indexer would show each individual file complete so if one knew which random set of files to download, one could hit the indexer to get them.

If I understand article obuscation, it changes every subject / poster randomly, preventing an indexer from being able to group them together, so really requiring the nzb generated. without it, no way to recover anything, it is just noise in the stream. I don't think its much different in terms of practicality of requiring one to depend on the nzb.

article obfuscation with par file creation but without rar creation might works for my needs. i.e. create a par set for the 30GB file and upload it and the pars with article obfuscation.

could that work? my main concern would be how download clients would deal with a 30-40GB file instead of smaller parts. i.e. if they are designed to keep all articles in memory until they combine them, that wont be so nice to most computers.

sjpotter avatar Mar 08 '20 10:03 sjpotter

Yep you're all right about the article obfuscation.

It should work without issue with a 30GB file. I suppose most grabbers use temp file for each Articles but I didn't check. I kind of remember nzbget did it. But anyway, don't you have 50GB space left on your drive, archiving without compression doesn't take much time. Well I didn't try on such huge files but... I presume it's no more than few minutes

mbruel avatar Mar 10 '20 22:03 mbruel

depends. a sata ssd is only going to get 300MB/s if its only doing that (nvme obviously can do much better, spinning will do much worse, especially under io load). In my case, my large data drive is spinning disk and is many times in heavy use so minimizing iops is important

sjpotter avatar Mar 10 '20 22:03 sjpotter

yeah that's what I'm saying, worst case on a SATA, it would take few minutes... probably still less than the posting time for most people... What could be good otherwise would be to be able to start posting while compressing and generating the par2 but that also would require a "massive" change in the architecture. I don't use HD or 4K stuff, so never dealing with files bigger than 5GB, I didn't thought of that...

mbruel avatar Mar 11 '20 15:03 mbruel

speaking from experience, when my storage is starved for iops, performance goes into the crapper. i.e. even if normally I can manage 150-200MB/s on spinning disk, I'm lucky to get single digits (Especially for writes onto a raid5 which hurts even more)

sjpotter avatar Mar 11 '20 15:03 sjpotter

hum, it's good practice I guess to have one extra disk in direct access to be used as the /tmp or /Download directory ;) But I get your point. Well, if/when I find some time, I'll try to implement the split but I'm not sure it will be this year... Have you tried to post a 30GB file without compression? So just one file, potentially using the Archive obfuscation. How are grabber handling the download? Is their memory going up or do they use temp files?

mbruel avatar Mar 14 '20 14:03 mbruel

this is good and dandy but i don't think it will fit me or many other if you replaced the rar password support with this split , what if i want a password and rar splits , what if i want to deal with folder how you going to split folder ? or a file that is over 150 gb iso or a 50 gb AVI file ? or even more , if other guys are worry about one single file , maybe you can add extra option but please dont take winrar parts with password options away , please

a small advice for disk space and read and write to save the raid HDD , everyone nowadays should have SSD for such work and also 250 at least is very important , rar or split or par or anything in this drive then when its going to die get a new one , SSD dont last but that is how its made for , quick work . if you limit your self you will always have problems with every new thing your going to use thank you .

DEVUVO avatar Mar 29 '20 13:03 DEVUVO

I wasn't asking to get rid of the current mechanism, just a new mechaism for uploading a single (large) file with obfuscation so that a password doesn't really matter.

and yes, SSDs are faster, but they still have IO limits, and not everyone can just use nvme and sata SSDs are also very limited.

the point is to create optimizations that have value to people.

sjpotter avatar Mar 29 '20 13:03 sjpotter

SSD dont last but that is how its made for

why do you say ssd don't last? My 1To nvme on my Laptop is 3 years old and still working perfectly... I was thinking flash drive would last longer than magnetic ones as you won't have failure of the reading head... Am I wrong?

the point is to create optimizations that have value to people.

well yeah, but it's hard to please everyone and fulfil all use-cases ;) as I said, even if you don't have a SSD, get a direct access (non RAID) drive for temporary files, that's a minimum and allows you to not pollute your RAID

mbruel avatar Mar 29 '20 14:03 mbruel

SSDs only have a limited amount of writes they can take. (i.e. 1000TB of writes is a good SSD).

sjpotter avatar Mar 29 '20 17:03 sjpotter

SSDs only have a limited amount of writes they can take. (i.e. 1000TB of writes is a good SSD).

euh... where did you see that? I don't see any reason why there would be a limited number of writes.. oO In general you just have a 1 year warranty. you can write as much as you want... Am I wrong?

mbruel avatar Mar 29 '20 17:03 mbruel

as an example

https://www.ontrack.com/uk/blog/pieces-of-interest/how-long-do-ssds-really-last/

"On its website, Samsung even promises that the product is withstanding up to 600 terabytes written (TBW)."

sjpotter avatar Mar 29 '20 19:03 sjpotter