urlwatch icon indicating copy to clipboard operation
urlwatch copied to clipboard

Allow changing URLs/commands while keeping change history

Open zsau opened this issue 5 years ago • 5 comments

Sometimes I need to tweak the url or command for a given entry, without materially changing the content that gets fetched. But since urlwatch generates GUIDs by hashing the url or command string itself, this means the changed entry always gets treated as "new".

What about using the "name" field for identification instead? It would be a breaking change for people who currently have duplicate names, but I don't know how common that is.

zsau avatar Oct 27 '18 03:10 zsau

This seems to be a duplicate of #232?

thp avatar Nov 02 '18 20:11 thp

I think it's a different, although related issue. #232 is about having different GUID for the same url, whereas this issue is about keeping the same GUID when tweaking url/command. Generating GUID from name does seem to resolve both issues though. However, it comes with its own set of problems (duplicate names, renaming jobs, etc.).

cfbao avatar Nov 04 '18 17:11 cfbao

@cfbao True, thanks for noticing.

So this could be solved by having a command-line switch that will change the URL of a Job and also "move" the corresponding cache entry in the cache database, so that it will "diff" to the old version of the URL?

Made-up (not yet implemented) proposal:

urlwatch --change-url http://example.org/old.html http://example.net/new.html

What this would do:

  1. Change the "url" of the matching job (and save the joblist)
  2. Copy the cache entry for the old URL to the new URL

Of course, it might only work for "url" (and "browser"?) jobs, but then it might not make sense for things like command jobs or any other kind of job type.

thp avatar Nov 05 '18 20:11 thp

It does seem like a reasonable solution without breaking the current design.

It probably still makes sense for command jobs. For example, I might be using ls * to monitor a folder with all text files. But later the folder contains some other type of files that I don't care, so I change the command to ls *.txt. From my point of view, the two commands do essentially the same job.

cfbao avatar Nov 10 '18 22:11 cfbao

Another approach could be:

  • Add feature to list current GUIDs
  • Make it possible to override (explicitly specify) the GUID in urls.yaml

This way, if somebody wants to change the URL, they could:

  • Get the current GUID and add it explicitly to the urls.yaml file
  • Change the URL at will -- the GUID will always be used

Also, it would provide a way to solve #232 by explicitly specifying a different GUID for each task even if the URL is the same (but then again, the "anchor hack" is probably fine).

thp avatar Jul 30 '20 09:07 thp

This feature would also be very useful to me. I monitor a lot of web pages that link to PDFs full of additional data. To update that additional data, the page maintainers will typically upload a new version of the PDF with a different filename (think: somedata-2020.pdf changes to somedata-2021.pdf) and change their page's link href accordingly.

So, in order to urlwatch those PDFs, I need to keep the URLs to those PDFs up to date. I'm sure I can figure out a way to do that, but the missing piece is this feature request right here--being able to update somedata-2020.pdf to somedata-2021.pdf while still comparing the 2020 data to the 2021 data.

Anyway, @thp's first suggestion seems a bit more straightforward for users and keeps the GUID internals, well, internal. I'd be happy to implement the --change_url command and make a PR. Just assign to me if that sounds ok, or let me know if a different approach is preferred.

trevorshannon avatar Dec 08 '22 05:12 trevorshannon

Anyway, @thp's first suggestion seems a bit more straightforward for users and keeps the GUID internals, well, internal. I'd be happy to implement the --change_url command and make a PR. Just assign to me if that sounds ok, or let me know if a different approach is preferred.

Sounds like a plan.

thp avatar Dec 12 '22 16:12 thp

@zsau this is probably far too late, but I did just discover a workaround, which is to use the user_visible_url key. You can define a user_visible_url for your job, and that string is actually used to compute the guid instead (here and then here). Thus, as long as you keep user_visible_url constant, you can modify the command or url values at will without modifying the associated guid.

I suppose the downside is that the "true" url or command used for a job is hidden in reports.

@thp Great, I'll get a PR together soon.

trevorshannon avatar Dec 13 '22 02:12 trevorshannon