Wildcard URIs
Wildcard URIs as described in the AP designate their wildcard'ed components by using "empty strings". This is confusing for many. A single wildcard "*" is what most wild immediately recognize. Consider changing the spec.
I was among those expecting * to be the wildcard token when I first read the spec. Seems more intuitive that way.
I was expecting * as well. I vote for changing it.
+1
+1 - it's both what people expect and much less easy to overlook.
By the way if you would use the * for prefixed as well you could get rid of the SUBSCRIBE.Options.match|string option. In this case a * with a leading or tailing . would indicate a wildcard and a tailing * without a leading . would indicate a prefix.
Example:
com.myapp.topic.emergency*
for prefix:
com.myapp.topic.emergency.11
com.myapp.topic.emergency-low
com.myapp.topic.emergency.category.severe
com.myapp.topic.emergency
and
com.myapp.*.userevent
for wildcard:
com.myapp.foo.userevent
com.myapp.bar.userevent
com.myapp.a12.userevent
By the way if you would use the * for prefixed as well you could get rid of the SUBSCRIBE.Options.match|string option.
No, because wildcard matches only apply on URI components, whereas prefix matches not.
com.myapp.* under a prefix match policy will more URIs than under a wildcard match policy.
Encoding the match policy in the URI .. no.
That makes sense. I didn't know that com.myapp. was a possible prefix value. I Thought only com.myapp was allowed.
Yes, com.myapp. is allowed for prefix matching, and the result is different from prefix matching com.myapp. Whether this fine distinction should be used in an app is a different question, but technically, it's possible.
Looks like this is not reflected in the spec yet (I looked at https://github.com/wamp-proto/wamp-proto/blob/master/rfc/text/advanced/ap_pubsub_pattern_based_subscription.md). Is it agreed upon that wildcard will now use * (i.e. if I'm implementing WAMP today, should I support this instead of the empty component)?
And if so, I assume that * will now join #, . and spaces as invalid characters in URI components?
@PuerkitoBio I would stick with current implementations: that is, * isn't special. Wildcard URI components are identified as being empty (zero-length) URI components. Whether a string is interpreted as exact URI, prefix matching or wildcard matching URI (pattern) is solely determined from the SUBSCRIBE.Options.match|string attribute. And this definitely won't change, as we can't get away with * alone (without match) option ..
Ok, thanks.
@oberstet Does de.. also match de or is at least one more uri component needed like de.konsultaner? This is not exactly clear when reading the docs.
@oberstet One more question. What happens if I register a procedure de.konsultaner with a match wildcard? Must a wildcard matching registration contain a .. in the uri?
Hi @konsultaner !
Well, none of your examples are correct, sorry.
de.. means that URI should consist a minimum of 3 portions, and not 2. Dot means, that something more should be placed after.
Rgd ur last question above:
Well, you are right, that is not precisely described in spec. So it can be implemented differently. But for me it seems that it should be treated just like an exact matching as there is no placeholders inside and that is not a prefix. So the URI can only contain de.konsultaner string and all
@konsultaner yes, agreed, the spec lacks in precision. I think, one of the easiest ways of improvement would be adding an agreed set of test URIs with expected pattern types to the spec.
I've had a look around in AB/CB, and we indeed have automated tests https://github.com/crossbario/crossbar/blob/master/crossbar/router/test/test_wildcard.py for wildcards and matching
what does your code produce when used with the cases from above test? do you have a list of test cases like above?
IMO, this would be quickest: sync up CI test cases between implementors, then discuss what is "right and wrong", and only then come up with additional spec text which tries to capture the semantics implied by the list.
@oberstet Oh ok. I totally missunderstood then. I thought .. indicates a wildcard. Happy refactoring 😄
- So basicylly an empty uri component indicates a wildcard uri
- the number of empty uri components indicate the minimum amount of uri components to be filled up
So this means:
| uri | match | not match |
|---|---|---|
empty |
com any.route.will.match |
- |
| . | com.connectanum com.connectanum.www |
com |
| com. | com.connectanu.www.x.y.z | com |
| .www | connectanum.www com.connectanum.www |
www |
| com..www | com.connectanum.www com.connectanum.x.www |
com.www |
| com...www | com.connectanum.x.www | com.www com.connectanum.www |
| .com..www. | de.com.connectanum.www.x | com.connectanum.www |
Am I correct?
@oberstet if we introduced a *-suffix for prefix matching, we could get rid of the match policy in the REGISTER and SUBSCRIBE. That would help in some of my classes. It would also make hybrid pattern matching possible. Something like de..www.xy* which could be beneficial to someone... maybe 😅
A leading or tailing . as well as a .. (or more) somewhere in the middle or a tailing * would indicate the prefix, wildcard or hybrid matching policy.
Have you ever thought of just allowing regex? Either by setting a fourth policy or by a leading ^ and tailing $. At least Java allows to precompile Pattern, that speeds up matching quite a lot.
Am I correct?
Maybe. I haven't thought manually through your examples. Too lazy, there is a unit test;) Have you run above unit test with these examples?
or a tailing
*would indicate the prefix
the character * is a valid raw URI character in WAMP. eg ********************* is a valid, concrete WAMP URI
the character level design is defining the (an) absolute minimum. an empty URI component is only valid in URI patterns, and as such can already be used as a marker
so I would not want to add some additional * semantics - also the WAMP AP spec feature is "stable"
Have you ever thought of just allowing regex?
yes, it is not included by design. the design was following roughly this thinking: we want pattern based subs/regs in WAMP which is:
- standardized and predefined in the spec (BP or AP), and hence widely available and compatible for app devs
- powerful enough to cover most of the practically important use cases
- efficient to implement: worst case matching run-time of O(N*log(N)) for URI strings of length N
the 3 matching policies exact, prefix and wildcard all qualify for above.
in crossbar, there are only 2 additions:
- you can choose between strict and loose URI processing. with "strict", URI components are valid identifiers (in regular programming languages=
- for wildcard matching, the router can automatically match and extract each individual URI component against eg "int" or "uuid" and such. pls have a look here https://github.com/crossbario/autobahn-python/blob/master/autobahn/wamp/test/test_wamp_uri_pattern.py
At least Java allows to precompile Pattern, that speeds up matching quite a lot.
can you match a string of length N against M regex patterns in worst case run-time O(N*log(N))?
fwiw: I don't think this is possible, in Java or in any other Turing machine .. but maybe I'm wrong
Maybe. I haven't thought manually through your examples. Too lazy, there is a unit test;) Have you run above unit test with these examples?
Some of my examples are not covered by your unit test. If you have a minute. would be nice if you could take a quick look at it 🙏?
so I would not want to add some additional
*semantics - also the WAMP AP spec feature is "stable"
I get that. I think an additional * semantic would have some advantages.
- it is backward compatible (if
*would be declared reseved as well as#is) - it would reduce the message size
- make some routing internals simpler -> less data to be stored/sorted out
- would make a mix of wildcard and prefix matching possible
I see it as an update, but ok, if you don't like it.
for wildcard matching, the router can automatically match and extract each individual URI component against eg "int" or "uuid" and such.
this is a really cool idea!
can you match a string of length N against M regex patterns in worst case run-time O(N*log(N))?
I have absolutly no idea. I've just added regex matching into my security layer that decides if a callee/subscriber is allowed to register/subscibe. Which is just indirectly related to this topic.
So well yes, I don't need regex matching 😉
would be nice if you could take a quick look at it
looking closer, assuming the first column is not URI, but URI pattern, the table misses the URI pattern type that is to be used for those ...
Crossbar docs for Wildcard matching, for reference: https://crossbar.io/docs/Pattern-Based-Subscriptions/#wildcard-matching
The way I currently interpret it, is that missing "components" before or after a dot are effectively the wildcards. I interpret "component" to be a non-empty string token with no dots.
Crossbar wildcard pattern matching tests, for reference: https://github.com/crossbario/crossbar/blob/master/crossbar/router/test/test_wildcard.py
Relevant snippet:
WILDCARDS = ['.', 'a..c', 'a.b.', 'a..', '.b.', '..', 'x..', '.x.', '..x', 'x..x',
'x.x.', '.x.x', 'x.x.x']
MATCHES = {
'abc': [],
'a.b': ['.'],
'a.b.c': ['a..c', 'a.b.', 'a..', '.b.', '..'],
'a.x.c': ['a..c', 'a..', '..', '.x.'],
'a.b.x': ['a.b.', 'a..', '.b.', '..', '..x'],
'a.x.x': ['a..', '..', '.x.', '..x', '.x.x'],
'x.y.z': ['..', 'x..'],
'a.b.c.d': []
}
The wildcard matching logic in Crossbar uses a string split function to break the URIs into dot-separated tokens. An empty token is effectively a wildcard, as I currently understand it.
An empty token is effectively a wildcard, as I currently understand it.
yes, this is what the spec calls "empty URI components"
https://wamp-proto.org/wamp_latest_ietf.html#name-uris
and those are only possible in URI patterns.
fwiw, there is also a bunch of URI related tests in autobahn, but those are more for the automatic client side processing of URI, eg you can register com.myapp.<product:int>.update and get your procedure called with a proper product parameter parsed as int
https://github.com/crossbario/autobahn-python/blob/master/autobahn/wamp/test/test_wamp_uri_pattern.py
@konsultaner
Am I correct?
I think you incorrectly assume that wildcards are "greedy" and can consume more than one "component", where "component" means the labels between the dots.
If wildcards were in fact greedy, then the a.b.c.d URI in the Crossbar tests would match the a.b. pattern (it doesn't). That's because a.b.c.d contains 4 components, whereas a.b. only contains 3 components with the last one being wild.
I think an empty wildcard pattern would match nothing if I follow the logic correctly, because it would have zero components. If a subscriber wants to match every possible topic, they could use prefix matching with an empty pattern.
BTW, I don't like the term "component" being used here because it has a different meaning for URLs. I personally use the term "token" in my code to mean the labels between the dots.
The one-component-per-wildcard mechanism that is implied by the Crossbar tests leads to a very trivial implementation for checking if it matches a pattern. It's like 7 lines of C++ code after the URI and pattern have been split into sub-strings:
using SplitUri = std::vector<std::string>;
bool uriMatchesWildcardPattern(const SplitUri& uri, const SplitUri& pattern)
{
// Not yet tested
auto uriSize = uri.size();
if (uriSize != pattern.size())
return false;
for (SplitUri::size_type i = 0; i != uriSize; ++i)
if (!pattern[i].empty() && uri[i] != pattern[i])
return false;
return true;
}
@oberstet
efficient to implement: worst case matching run-time of O(N*log(N)) for URI strings of length N
~Did you realy mean "O(N*log(M)) for URI strings of length N", where M is the number of patterns?~ Nevermind, thinking of the wrong data structure.
BTW, I don't like the term "component" being used here because it has a different meaning for URLs. I personally use the term "token" in my code to mean the labels between the dots.
I guess I messed up with the name "URI component".
FWIW, here is what happened: I was looking for a word for "smallest unit of meaning" of a bigger thing, and in German this would be called "Wortbestandteil".
Which I split by its two part "Wort" = word => URI and "Bestandteil" = component.
I now learned that linguistics call it "Monem" (French) or "Morphem" (English).
https://en.wikipedia.org/wiki/Bound_and_free_morphemes
So if one takes "word == URI", the closest thing would be "free morpheme". If every wildcard matched part could occur on its own as well. If not, it's a "bound morpheme".
Now, this is linguists speech. How would a normal person call the individual words "fact" and "finding" in factfinding? Which is one word according to https://www.merriam-webster.com/legal/factfinding
Sorry if above seems exaggerated, somehow I find this interesting. Eg I also learned what's the real name of this feature of German language with the very long words: German is a synthetic language rather than analytic language (English) according to above Wikipedia page.
This is fascinating .. I wasn't aware English has such long words: Antidisestablishmentarianism.
It means: "the movement to prevent revoking the Church of England's status as the official church [of England, Ireland, and Wales]."
Right;)
I guess I messed up with the name "URI component".
"Component" makes sense purely from an English perspective, it's just that Berners-Lee used the term to mean the parts of a URL between the / slashes.
I've looked it up, and RFC1123 uses the word "label" for the parts between the dots in a hostname:
The DNS defines domain name syntax very generally -- a string of labels each containing up to 63 8-bit octets, separated by dots, and with a maximum total of 255 octets.
For example, www, google, and com would be labels in the hostname www.google.com.
I was using the term "token" because of the C strtok function that splits a string into what they call "tokens". Java also uses the same term with StringTokenizer.
So we should choose from:
- URI component (keep it as currently used)
- URI label (as in RFC1123)
- URI token (C/Java stdlib)
- URI morpheme (linguists)
?
I like "label" (RFC1123) a lot, would be what I'd choose now, but is it worth changing and confusing people even more by changing it now? If others think so, I'd be cool with that as well!
Is it worth changing and confusing people even more by changing it now?
I'm fine either way. I think just about everyone will understand the spec's current usage of "URI component" via context and the given examples. I was just nitpicking about the term.
@ecorm Thanks for clearifying my missunderstanding. I think the name wildcard is much more confusing then component. I think wildcard is associated with more variablity like .* from regex.
@oberstet I'm +1 on label to match the RFC1123 which makes absulute sence.