httrack icon indicating copy to clipboard operation
httrack copied to clipboard

Feature Request: Query aware filtering (CGI, PHP, etc)

Open mpheyse opened this issue 8 years ago • 4 comments

While I have many CGI handling suggestions, the first and most straight forward is Simplifying and Filtering CGI options (Queries)

Simplification CGI, PHP, and other active content generators get data (Options) thru URL Queries (?key1=value1&key2=value2) Httrack uses Links, which contains those Key/Value pairs.

The problem is that Links with Queries can be written lots of different ways that are the same link Link1 cgi?Country=USA&State=MI Link2 cgi?State=MI&Country=USA Order does not matter, so all links with the same options are pointing at the same page.

Httrack should internally resort all option pairs, alphabetically based on the key. This would make the above Link1 and Link2, both point to Link1

Filtering A single CGI program is really equivalent to thousands of individual pages, so there should be some filtering based of off the options it's sent.

  • Key/Value filtering, accept or reject a link that has a specific Key/Value pair
  • Ignore a specific Key, any link with that key has that key ignored
  • Block a specific Key, any link with that key, the link is dropped
  • Value filtering, Choose a key and a filter its value to accept or reject
  • Default Key Value, Add a key to a link if its not present.
  • Multi Key Filtering, Allow combinations of the above to apply as a single rule

Key/Value filtering Some sites use a single CGI that controls different sections of a site. To get only one of the sections we want to include all links that contain a specific key/value pair

Ignore a specific Key Some Keys are trivial and do not change the page content. WTTrack doesn't know this, so telling it to ignore these keys will limit page duplication.

Personally I see what should be a Fragment (#JumptoHere) but is a CGI key/value pair in comment systems. A page with 50 comments will have 50 jump to links and 50 replyto links that all point to back to itself. WTTrack needs to be told to drop the keys "comment" and "replyto" or one valid page becomes 101 duplicates.

Block a specific Key Drop any link that contains this Key regardless of value. This could also be used to fix the above example by ignoring the link for comments and replyto.

Value filtering Set a scan filter/rule that is applied to the Value of a specific Key.

Default Key Value Most CGI programs have internal defaults for values of keys. Normally that's not a problem for us, unless some links have them hard set to the default value. To fix this, we hard set a default value that HTTrack adds to any link where the Key is not present.

Example, this CGI use a Key "Country" and if the key is not present it defaults to "us". The problem is some links have "Country=us" in them. This would cause duplicates. We can't drop the "Country" Key if we want still drop all the "it", "en", and "fr" pages too. So we set a Default Key Value of "Country=us", then any link that doesn't have a hard defined the "Country" value, now does - eliminating any duplicates.

Multi Key Filtering Build a set of key filters that are all applied as a single rule.

Currently trying to do any of this is very dependent on the order of Key/value pairs with the standard rules, and is fairly complex.

mpheyse avatar Oct 12 '16 15:10 mpheyse

Syntax Proposal

Here is my proposal for a Filter Definition syntax to be added inline with the existing filter rule set

+query:QueryRules

or maybe just

+q:QueryRules

But I think the best is

+?QueryRules

As this makes the Queries almost natural

+bin.cgi?times=fun

this would accept all links to bin.cgi that have a query times=fun

+bin.cgi -bin.cgi?times=fun

this would accept all links to bin.cgi except ones that have a query times=fun

Use '>' as Default specifier (Duplication Reduction)

+bin.cgi?city>LA

this would accept all links to bin.cgi. Any links that don't have a city Key defined, would get city=LA added (this is programmably the same as removing any instances of city=LA from bin.cgi links.)

Multi Key with & specifier

+bin.cgi?times=fun&end=now

This would accept links to bin.cgi that contain both times set to fun and end set to now

Fast Block with !

+bin.cgi?times=fun&end=now -bin.cgi?times=fun&end=now&Bad=*

This would accept links to bin.cgi that have both "times" set to "fun" and "end" set to "now", but do not contain any value for Bad. (anywhere the Key "Bad" isn't set,) To simplify this, in just one line the NOT prefix '!' can be used

+bin.cgi?times=fun&end=now&!Bad=*

would be equivalent to the two line version, in a single line. links with ....&Bad=&... would not be blocked as "Bad=" means Bad=NULL and is the same as Bad not being defined/set

Ignore with ':' (Duplication Reduction)

+bin.cgi?map=England&date=1796&:referer

This drops the 'referer' Key from any otherwise matching link, to eliminate producing duplicates.

':' only matches the Key, not the value so any given value is ignored.

+bin.cgi?map=England&date=1796&:referer=page76

is still functional equivalent to the first version.

Ignore all But, ':*' (Duplication Reduction)

+bin.cgi?:*&times=fun*

This gets all links that match the equivalent +bin.cgi?times=fun* but then drops ALL other Keys from the links.

Implied everything else All Keys that are not in the rule are implied to match

+bin.cgi?times=fun

would match links like

  • bin.cgi?times=fun
  • bin.cgi?times=fun&time=now

This means that

+bin.cgi?

would match all links to bin.cgi the same way

+bin.cgi*

does currently.

To stop the implied match, use the aforementioned Ignore All

'!' ':' and '>' are disallowed or restricted in a URL so there should be no collisions, and '?' and '&' have special meaning to queries and are used as intended so there also should be no issues with them as specifiers.

mpheyse avatar Jul 24 '17 17:07 mpheyse

Advanced Syntax

Forced Index with $

When downloading 3,400 different pages that are all from a bin.cgi, HTTrack adds a random 4 digit hex index (then can add a "-" and more digits) to the name to prevent collisions. This works, and all the links get set correctly, but makes jumping directly to a specific one, impossible.

Specify a Index Key with $ Lots of sites have unique keys for things, Like Amazon's ASIN, or the Date on a newspaper site.

+bin.cgi?$date

is the same as +bin.cgi? but the local-on-disk version will be a list of pages named bin.cgi-datevalue.html

Defining an Index does not change the rule, and can be used inline with rules

+bin.cgi?$date=199*

This sets "date" as the Index, and requires it to start with "199"

If multiple pages have the same Index, then the normal hex-based index is added to the end.

Specify a sub-index key with a second $ Order matters, but an index able value could have been spread across multiple Query-Keys on the site, So you can build up your index from several Keys. Like a newspaper with a date AND a page.

+bin.cgi?$date&$page

is the same as +bin.cgi? but the local-on-disk version will be a list of pages named bin.cgi-datevalue-pagevalue.html

Specify a Index text with $= Need to make that on-disk archive a little more user friendly? Add constant strings to the Index by simply not specifying a Key-name, only a Value with '='. Index Text has no affect on the rule other than the file name on disk.

+bin.cgi?$=ShipDate&$date

is the same as +bin.cgi? but the local-on-disk version will be a list of pages named bin.cgi-ShipDate-datevalue.html

Notes

  1. They don't have to be static Index Definitions and any existing rule can be made into an Index by adding '$'.

+bin.cgi?$month=*ember&$day

  1. Any link with an Index that's not defined uses a empty string for the missing value in the file name
  2. Order is important only to building up the Index, and not the underlying Query Rule. Query Rules have no order other than the left most wins in-case of duplicates.
  3. the '-' hyphen, used as an Index separator in the above examples, is shown for clarity, and could be dropped from the actual syntax.

mpheyse avatar Jul 24 '17 18:07 mpheyse

A Little More (Future Addions)

These rules require a bit more work than simple link text manipulation

Use only Once, but Drop ':-'

+bin.cgi?:-SessionID

All other Query Rules are applied as links are found. This rule works differently, and uses two different versions of the link. The first is a link that contains the affected Key, and will be used to download the page. The second does not contain the Key, and is the recorded URL for the page in the local-on-disk cache. The net effect is similar to dropping the Key, but can be used where the Key is required.

The example shown is a Key for SessionID, a unique number that all pages must be fed, but doesn't contribute the to page content. The problem is that if you drop it, the pages don't load properly, and if you keep it, later runs of Update Existing Download will have a different SessionID value and the entire site would look like new pages.

Like Ignore all But, this can be expanded to all other unlisted Keys with :-*

mpheyse avatar Jul 30 '17 18:07 mpheyse

It would be great if this feature request is accepted. Without this url's with variable list just get random number's which makes proper archiving difficult.

https://www.literotica.com/s/an-angels-wish?page=2

get's saved as an-angels-wish4658.html , instead of an-angels-wish-page2

Hope this gets resolved.

Thanks.

Michael9450 avatar Sep 05 '17 04:09 Michael9450