Enhancement Request: support wget -i type multifile download
It is currently not possible to specify the same output path for two URLs.
@cxplay What is the use case where different sources should map to the exact same file on disk?
What should the behavior be if:
example.com/mymodel-weights.safetensor is targeted to /src/weights.safetensor
and
example.com/my-other-model-weights.safetensor is targeted to /src/weights.safetensor
Should we take the latest one in the list? Should the first one take priority? Since PGET is focused on performance downloading, I'd prefer to not download both files just to overwrite. Additionally there would likely be an error as we do not use O_TRUNC unless the --force flag is provided.
I'm open to a change, but I want to make sure we're being clear about the intent and what the expected behavior should be. At the very least a warning that the local filesystem path is duplicated and clearly defining which file will be downloaded is important. However, it also could be argued that telling the downloader to download two distinct files to the same target is incorrect.
The one caveat is the the null consumer (-o null) which doesn't do the deduplication check and also just funnels all the data to io.Discard so nothing is written. (Null consumer is interesting for testing/cache warming/etc, but not useful in real-world inference environments as far as I've seen)
What I mean is that I used the following manifest.
https://example.com/res.bin ./temp
https://example.com/res-2.bin ./temp
But PGet refuses to read it, telling me the destination is the same.
Yes, Pget works in absolute paths so the correct format would be
https://example.com/res.bin ./temp/res.bin
https://example.com/res-2.bin ./temp/res-2.bin
There is a bug that pget doesn't create the directory if it doesn't exist in the 0.6.2 release but in the 0.7.x (upcoming release) it will create the directory if needed.
Thanks for the explanation, I understand. But does it mean that I have to specify the filename of the output file, otherwise PGet will still refuse to read it? My main goal is to quickly batch download files using something like wget -i url.txt like in wget.
Ok! I think I understand now and I think we can add a flag to support this type of behavior. Let me run down what I read from your request:
wget -i like behavior.
This means that all elements in the manifest (multifile mode) would use the target as a directory:
https://example.com/my-file.txt /src
https://example.com/my-file2.txt /src
This would result in:
/src/my-file.txt
/src/my-file2.txt
Assumptions
- PGET creates the target directory if it doesn't exist
- PGET will deconflict the filenames (same as today), e.g. specifying
https://example.com/blah.txtandhttps://example.org/blah.txtwould still be an error case - If the target does exist and is not a directory, this is an error case
- Maintain
--forcebehaviour that utilizesO_TRUNC
Further Questions
- As we have work to support un-tar in multifile mode coming, should the untar just land all the files in the target directory? Should untar be exempted from working with the
wget -itype behavior? - Should the new behavior maintain the potential multiple target directories (e.g. each URL has a target) or should it take a simple "target directory" type switch and all files go to the same location?
@cxplay Let me know if this all makes sense and if you can answer the items in "further questions" I'd appreciate it. I think this is a small change (relative to some others) and should be easy to get added to pget.
- Yes, I think we should observe the global flag setting on the command line first, and then read it in the manifest, assuming that we are compatible with the following format in the future:
If thehttps://example.com/my-file.txt /src https://example.com/my-file2.txt /src https://example.com/my-file3.tar /src un-tar https://example.com/my-file4.tar /srcun-tarflag is not set, file3 will still be un-tar to the target. But file4 will not, because un-tar is not set. - As above, we should support setting the target (folder) explicitly on the command line, and if it is not set, then try to read it from the manifest.
@cxplay I wanted to circle up and let you know this hasn't fallen off the radar and as we implement the multifile form for tar extraction we'll hit this. 0.8.1 needed to land for some memory utilization reasons before we took on new work.