pywb
pywb copied to clipboard
WIP: allow symlinkinkg and hardlinking files instead of just copying
Description
This allows users to manage collections of large WARC files without duplicating space. Hardlinks are used instead of symlinks to reflect the original mechanism, where the file is copied (so it can be safely removed from the source). If we used symlinks, we would break that expectation which could lead to data loss.
Inversely, hardlinks can lead to data loss as well. For example, pywb could somehow edit the file, which would modify the original as well. But we assume here pywb does not modify the file, and each side of the hardlink can have their own permissions to ensure this (or not) as well.
Closes: #408
Types of changes
- [ ] Replay fix (fixes a replay specific issue)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
Checklist:
- [x] My change requires a change to the documentation.
- [ ] I have updated the documentation accordingly.
- [ ] I have added or updated tests to cover my changes.
- [ ] All new and existing tests passed.
This is WIP because I haven't worked on the docs or tests yet, as I want feedback on the idea first. Furthermore, tests fail here but that's unrelated to the patch here: they've been failing on master since at least https://github.com/webrecorder/pywb/commit/08b0ac87f70fbfa3c352d4a8c9498915609c5162
Advice on which docs to update and insights on the test suite would be very much welcome as well.
Hey, if your WARC files are already present somewhere else in a structured way, you might have success with configuring multiple archive paths, see https://pywb.readthedocs.io/en/latest/manual/configuring.html#archive-paths
The problem with that approach is that this expects a certain layout in the filesystem. Right now WARC files are stored like this:
archive/example.com-YYYY-MM-DD-HASH/fooo-NNNN.warc.gz
Where:
YYYY-MM-DDis the date of the crawlHASHis a unique hashNNNNis an incrementing number (e.g.0000for the first WARC file,0001for the second one, files are split on 5GB boundaries)
This layout is commonly used by grab-site and archivebot and other crawlers. It does not match the expectations of pywb, including custom archive paths, which still look in collection/<coll name>/example.warc.gz.
So I don't think that's a sufficient solution to my problem.
I can't speak for the technical implementation (although it looks good to me), but I'd definitely appreciate anything that allows using existing WARCs – whatever file structure they may be in – in pywb without having to make any copies. Collections of WARCs can be huge, and storing them twice is wasteful and in some cases not even possible due to space constraints.
Of course, it would be possible to create a separate directory with the expected structure, make hardlinks or symlinks in there, and then use that as another archive path. But from a usability perspective, it would be much better in my opinion if pywb/wb-manager simply had an option that can take care of that directly (like the one proposed in this PR) instead of having to do that manually or with helper scripts.
@anarcat thanks for the PR.
I believe the appropriate docs section that would need updating is docs/manual/usage.rst: Using Existing Web Archive Collections
Still thinking about how to test this because we will also need to support this functionality on Windows.
It may also be a good idea to add moving WARCs to the expected locations. Would also be useful for CDX per #410.
/cc @ikreymer
Thanks for suggesting this, I agree that this should be supported, but not sure that symlinking/hardlinking is the way to go. As @N0taN3rd, this would complicate Windows support and would potentially make the setup more brittle.
pywb is close to supporting what you want with external paths, but unfortunately its not automated.
It seems like the best option would be to support a per-collection overrides, say overrides.yaml which can set a list of one or more paths to directories that contain WARC files, ex:
collections/external-data/overrides.yaml:
archive_paths:
- /path/to/warcs/
- /path/to/more/warcs/
- /path/to/some-warcs/warcfile.warc.gz
Then, instead of the local archives directory, pywb will look for works in those directories instead.
This can also work with auto-indexing, so any time a new WARC is added to those directories, it can be indexed automatically. pywb can already check and index subdirectories, so you can use whatever structure you'd like in the external directory.
Of course, the overrides.yaml could also be managed with something like:
wb-manager add-external-path <coll> <path>
wb-manager remove-external-path <coll> <path>
If you wanted to add only a specific WARC file in a directory instead of all WARC, that too can be supported by specifying the file path instead of a directory (eg. /path/to/some-warcs/warcfile.warc.gz)
Where would the CDX files be stored with that setup?
Regarding Windows support, is that a hard requirement or could that feature simply be unavailable if not supported by the OS, the Python version, or the configuration (only admins can create symlinks on Windows according to the Python docs)?
Where would the CDX files be stored with that setup?
By default, still in the indexes directory, although can provide an override for that also, if needed.
Regarding Windows support, is that a hard requirement or could that feature simply be unavailable if not supported by the OS, the Python version, or the configuration (only admins can create symlinks on Windows according to the Python docs)?
Ideally, the external directory feature would be available on all platforms.
what I hear from the various comments so far is this:
- LGTM, some fix like this is necessary
- this can be done by hand
- add support for moving files (
os.move, presumably, although that fails if we cross FS boundaries) - external paths are close to answering that requirement, but would require a new config file, maybe modifiable with a new set of commands
- concerns about Windows support
- pointers to which docs need updating (
usage.rst, thanks!)
I understand where you are coming from: multi-platform compatibility is important, and there are existing features which might fit this requirement.
however, i would argue that "a bird in the mouth is better than two in the bush": I have a working patch to workaround a real scalability issue with pywb, right now. it might not work that effectively on Windows, but I want to point out that both os.link and os.symlink are actually supported on Windows, at least since Vista. so i don't think it's as much a blocker as people would tend to believe
the proposals to automate editing of the YAML file seem to be an entirely different approach, one that would require much more changes to the documentation and seem to me like feature creep. i just want to copy files lightly in the archive, not redesign how the entire YAML configuration system works! :) if this is the approach you want to take, I'm not sure I can help since I would need to dive again deeper in the internals of pywb, which might mean this would never be done at all. ;)
so to move this forward, I would propose that we keep on following the approach I proposed here. this would mean adding tests for the functionality and documentation. i would be happy to push that forward, if the proposal is accepted, otherwise I'm afraid I won't be able to provide a solution to #408 myself going forward.
have a nice day!
PS: the travis test failure here does not seem related to the patch, you might want to look into that... i'll re-trigger the build to see if it works better now.
Codecov Report
Merging #409 into master will decrease coverage by
0.15%. The diff coverage is66.66%.
@@ Coverage Diff @@
## master #409 +/- ##
==========================================
- Coverage 88.04% 87.89% -0.16%
==========================================
Files 59 59
Lines 7227 7235 +8
Branches 1286 1288 +2
==========================================
- Hits 6363 6359 -4
- Misses 570 575 +5
- Partials 294 301 +7
| Impacted Files | Coverage Δ | |
|---|---|---|
| pywb/manager/manager.py | 97.58% <66.66%> (-1.35%) |
:arrow_down: |
| pywb/apps/static_handler.py | 90% <0%> (-2.5%) |
:arrow_down: |
| pywb/warcserver/index/aggregator.py | 89.72% <0%> (-1.98%) |
:arrow_down: |
| pywb/recorder/multifilewarcwriter.py | 77.84% <0%> (-1.14%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 1b151b7...011640a. Read the comment docs.
hi @anarcat apologies for the delay -- was away on vacation and neglected to respond earlier!
I reread your comment and thought about it more, and have considered our current work. Since we don't have the bandwidth to implement the config-based approach now, and it could still be done at a later time, I think you're right and we should add this solution, even if it is not cross-platform as it will help your (and possibly others') use cases. This solution is a simple change to the wb-manager while the config option would be a much more extensive change, as you've mentioned.
To proceed, could you add:
-
tests (and mark them with something like
@pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows")to skip windows) Existing tests for manager are mostly in./tests/test_auto_colls.pyand./tests/test_redirects.pyis another example. A test module similar to those (but simpler probably!) should be good. -
docs in
usage.rst, and include some tradeoffs between using the hardlink vs symlink vs copy approaches.
And we'll try to merge it in for next release! Thanks again!
(Yes, the travis-ci issue is/was unrelated, we're looking at that)
awesome! i'm not sure I'll have time to do this before the next year (and I'm don't mind at all if someone else beats me to it), but hopefully I'll be able to come back to this soon-ish.
Any progress with this? It's something I'm finding problematic at the moment.