urlwatch
urlwatch copied to clipboard
Allow changing URLs/commands while keeping change history
Sometimes I need to tweak the url or command for a given entry, without materially changing the content that gets fetched. But since urlwatch generates GUIDs by hashing the url or command string itself, this means the changed entry always gets treated as "new".
What about using the "name" field for identification instead? It would be a breaking change for people who currently have duplicate names, but I don't know how common that is.
This seems to be a duplicate of #232?
I think it's a different, although related issue. #232 is about having different GUID for the same url, whereas this issue is about keeping the same GUID when tweaking url/command. Generating GUID from name does seem to resolve both issues though. However, it comes with its own set of problems (duplicate names, renaming jobs, etc.).
@cfbao True, thanks for noticing.
So this could be solved by having a command-line switch that will change the URL of a Job and also "move" the corresponding cache entry in the cache database, so that it will "diff" to the old version of the URL?
Made-up (not yet implemented) proposal:
urlwatch --change-url http://example.org/old.html http://example.net/new.html
What this would do:
- Change the "url" of the matching job (and save the joblist)
- Copy the cache entry for the old URL to the new URL
Of course, it might only work for "url" (and "browser"?) jobs, but then it might not make sense for things like command jobs or any other kind of job type.
It does seem like a reasonable solution without breaking the current design.
It probably still makes sense for command jobs. For example, I might be using ls *
to monitor a folder with all text files. But later the folder contains some other type of files that I don't care, so I change the command to ls *.txt
. From my point of view, the two commands do essentially the same job.
Another approach could be:
- Add feature to list current GUIDs
- Make it possible to override (explicitly specify) the GUID in
urls.yaml
This way, if somebody wants to change the URL, they could:
- Get the current GUID and add it explicitly to the
urls.yaml
file - Change the URL at will -- the GUID will always be used
Also, it would provide a way to solve #232 by explicitly specifying a different GUID for each task even if the URL is the same (but then again, the "anchor hack" is probably fine).
This feature would also be very useful to me. I monitor a lot of web pages that link to PDFs full of additional data. To update that additional data, the page maintainers will typically upload a new version of the PDF with a different filename (think: somedata-2020.pdf changes to somedata-2021.pdf) and change their page's link href
accordingly.
So, in order to urlwatch
those PDFs, I need to keep the URLs to those PDFs up to date. I'm sure I can figure out a way to do that, but the missing piece is this feature request right here--being able to update somedata-2020.pdf to somedata-2021.pdf while still comparing the 2020 data to the 2021 data.
Anyway, @thp's first suggestion seems a bit more straightforward for users and keeps the GUID internals, well, internal. I'd be happy to implement the --change_url
command and make a PR. Just assign to me if that sounds ok, or let me know if a different approach is preferred.
Anyway, @thp's first suggestion seems a bit more straightforward for users and keeps the GUID internals, well, internal. I'd be happy to implement the
--change_url
command and make a PR. Just assign to me if that sounds ok, or let me know if a different approach is preferred.
Sounds like a plan.
@zsau this is probably far too late, but I did just discover a workaround, which is to use the user_visible_url
key. You can define a user_visible_url
for your job, and that string is actually used to compute the guid instead (here and then here). Thus, as long as you keep user_visible_url
constant, you can modify the command
or url
values at will without modifying the associated guid.
I suppose the downside is that the "true" url or command used for a job is hidden in reports.
@thp Great, I'll get a PR together soon.