wamp-proto icon indicating copy to clipboard operation
wamp-proto copied to clipboard

Wildcard URIs

Open oberstet opened this issue 10 years ago • 35 comments

Wildcard URIs as described in the AP designate their wildcard'ed components by using "empty strings". This is confusing for many. A single wildcard "*" is what most wild immediately recognize. Consider changing the spec.

oberstet avatar Aug 27 '15 20:08 oberstet

I was among those expecting * to be the wildcard token when I first read the spec. Seems more intuitive that way.

ecorm avatar Aug 28 '15 19:08 ecorm

I was expecting * as well. I vote for changing it.

jcelliott avatar Aug 28 '15 19:08 jcelliott

+1

konsultaner avatar Aug 29 '15 09:08 konsultaner

+1 - it's both what people expect and much less easy to overlook.

goeddea avatar Aug 29 '15 12:08 goeddea

By the way if you would use the * for prefixed as well you could get rid of the SUBSCRIBE.Options.match|string option. In this case a * with a leading or tailing . would indicate a wildcard and a tailing * without a leading . would indicate a prefix.

Example:

com.myapp.topic.emergency*

for prefix:

com.myapp.topic.emergency.11
com.myapp.topic.emergency-low
com.myapp.topic.emergency.category.severe
com.myapp.topic.emergency

and

com.myapp.*.userevent

for wildcard:

com.myapp.foo.userevent
com.myapp.bar.userevent
com.myapp.a12.userevent

konsultaner avatar Sep 16 '15 12:09 konsultaner

By the way if you would use the * for prefixed as well you could get rid of the SUBSCRIBE.Options.match|string option.

No, because wildcard matches only apply on URI components, whereas prefix matches not.

com.myapp.* under a prefix match policy will more URIs than under a wildcard match policy.

Encoding the match policy in the URI .. no.

oberstet avatar Sep 16 '15 12:09 oberstet

That makes sense. I didn't know that com.myapp. was a possible prefix value. I Thought only com.myapp was allowed.

konsultaner avatar Sep 16 '15 12:09 konsultaner

Yes, com.myapp. is allowed for prefix matching, and the result is different from prefix matching com.myapp. Whether this fine distinction should be used in an app is a different question, but technically, it's possible.

oberstet avatar Sep 16 '15 12:09 oberstet

Looks like this is not reflected in the spec yet (I looked at https://github.com/wamp-proto/wamp-proto/blob/master/rfc/text/advanced/ap_pubsub_pattern_based_subscription.md). Is it agreed upon that wildcard will now use * (i.e. if I'm implementing WAMP today, should I support this instead of the empty component)?

And if so, I assume that * will now join #, . and spaces as invalid characters in URI components?

mna avatar Feb 26 '16 15:02 mna

@PuerkitoBio I would stick with current implementations: that is, * isn't special. Wildcard URI components are identified as being empty (zero-length) URI components. Whether a string is interpreted as exact URI, prefix matching or wildcard matching URI (pattern) is solely determined from the SUBSCRIBE.Options.match|string attribute. And this definitely won't change, as we can't get away with * alone (without match) option ..

oberstet avatar Feb 26 '16 16:02 oberstet

Ok, thanks.

mna avatar Feb 26 '16 16:02 mna

@oberstet Does de.. also match de or is at least one more uri component needed like de.konsultaner? This is not exactly clear when reading the docs.

konsultaner avatar Sep 05 '22 15:09 konsultaner

@oberstet One more question. What happens if I register a procedure de.konsultaner with a match wildcard? Must a wildcard matching registration contain a .. in the uri?

konsultaner avatar Sep 05 '22 15:09 konsultaner

Hi @konsultaner !

Well, none of your examples are correct, sorry.

de.. means that URI should consist a minimum of 3 portions, and not 2. Dot means, that something more should be placed after.

Rgd ur last question above: Well, you are right, that is not precisely described in spec. So it can be implemented differently. But for me it seems that it should be treated just like an exact matching as there is no placeholders inside and that is not a prefix. So the URI can only contain de.konsultaner string and  all

KSDaemon avatar Sep 05 '22 15:09 KSDaemon

@konsultaner yes, agreed, the spec lacks in precision. I think, one of the easiest ways of improvement would be adding an agreed set of test URIs with expected pattern types to the spec.

I've had a look around in AB/CB, and we indeed have automated tests https://github.com/crossbario/crossbar/blob/master/crossbar/router/test/test_wildcard.py for wildcards and matching

what does your code produce when used with the cases from above test? do you have a list of test cases like above?

IMO, this would be quickest: sync up CI test cases between implementors, then discuss what is "right and wrong", and only then come up with additional spec text which tries to capture the semantics implied by the list.

oberstet avatar Sep 05 '22 16:09 oberstet

@oberstet Oh ok. I totally missunderstood then. I thought .. indicates a wildcard. Happy refactoring 😄

  • So basicylly an empty uri component indicates a wildcard uri
  • the number of empty uri components indicate the minimum amount of uri components to be filled up

So this means:

uri match not match
empty com
any.route.will.match
-
. com.connectanum
com.connectanum.www
com
com. com.connectanu.www.x.y.z com
.www connectanum.www
com.connectanum.www
www
com..www com.connectanum.www
com.connectanum.x.www
com.www
com...www com.connectanum.x.www com.www
com.connectanum.www
.com..www. de.com.connectanum.www.x com.connectanum.www

Am I correct?

konsultaner avatar Sep 06 '22 11:09 konsultaner

@oberstet if we introduced a *-suffix for prefix matching, we could get rid of the match policy in the REGISTER and SUBSCRIBE. That would help in some of my classes. It would also make hybrid pattern matching possible. Something like de..www.xy* which could be beneficial to someone... maybe 😅

A leading or tailing . as well as a .. (or more) somewhere in the middle or a tailing * would indicate the prefix, wildcard or hybrid matching policy.

Have you ever thought of just allowing regex? Either by setting a fourth policy or by a leading ^ and tailing $. At least Java allows to precompile Pattern, that speeds up matching quite a lot.

konsultaner avatar Sep 06 '22 12:09 konsultaner

Am I correct?

Maybe. I haven't thought manually through your examples. Too lazy, there is a unit test;) Have you run above unit test with these examples?

or a tailing * would indicate the prefix

the character * is a valid raw URI character in WAMP. eg ********************* is a valid, concrete WAMP URI

the character level design is defining the (an) absolute minimum. an empty URI component is only valid in URI patterns, and as such can already be used as a marker

so I would not want to add some additional * semantics - also the WAMP AP spec feature is "stable"

Have you ever thought of just allowing regex?

yes, it is not included by design. the design was following roughly this thinking: we want pattern based subs/regs in WAMP which is:

  • standardized and predefined in the spec (BP or AP), and hence widely available and compatible for app devs
  • powerful enough to cover most of the practically important use cases
  • efficient to implement: worst case matching run-time of O(N*log(N)) for URI strings of length N

the 3 matching policies exact, prefix and wildcard all qualify for above.

in crossbar, there are only 2 additions:

  • you can choose between strict and loose URI processing. with "strict", URI components are valid identifiers (in regular programming languages=
  • for wildcard matching, the router can automatically match and extract each individual URI component against eg "int" or "uuid" and such. pls have a look here https://github.com/crossbario/autobahn-python/blob/master/autobahn/wamp/test/test_wamp_uri_pattern.py

At least Java allows to precompile Pattern, that speeds up matching quite a lot.

can you match a string of length N against M regex patterns in worst case run-time O(N*log(N))?

fwiw: I don't think this is possible, in Java or in any other Turing machine .. but maybe I'm wrong

oberstet avatar Sep 06 '22 19:09 oberstet

Maybe. I haven't thought manually through your examples. Too lazy, there is a unit test;) Have you run above unit test with these examples?

Some of my examples are not covered by your unit test. If you have a minute. would be nice if you could take a quick look at it 🙏?

so I would not want to add some additional * semantics - also the WAMP AP spec feature is "stable"

I get that. I think an additional * semantic would have some advantages.

  1. it is backward compatible (if * would be declared reseved as well as # is)
  2. it would reduce the message size
  3. make some routing internals simpler -> less data to be stored/sorted out
  4. would make a mix of wildcard and prefix matching possible

I see it as an update, but ok, if you don't like it.

for wildcard matching, the router can automatically match and extract each individual URI component against eg "int" or "uuid" and such.

this is a really cool idea!

can you match a string of length N against M regex patterns in worst case run-time O(N*log(N))?

I have absolutly no idea. I've just added regex matching into my security layer that decides if a callee/subscriber is allowed to register/subscibe. Which is just indirectly related to this topic.

So well yes, I don't need regex matching 😉

konsultaner avatar Sep 06 '22 19:09 konsultaner

would be nice if you could take a quick look at it

looking closer, assuming the first column is not URI, but URI pattern, the table misses the URI pattern type that is to be used for those ...

oberstet avatar Sep 06 '22 20:09 oberstet

Crossbar docs for Wildcard matching, for reference: https://crossbar.io/docs/Pattern-Based-Subscriptions/#wildcard-matching

The way I currently interpret it, is that missing "components" before or after a dot are effectively the wildcards. I interpret "component" to be a non-empty string token with no dots.

ecorm avatar Jan 04 '23 08:01 ecorm

Crossbar wildcard pattern matching tests, for reference: https://github.com/crossbario/crossbar/blob/master/crossbar/router/test/test_wildcard.py

Relevant snippet:

WILDCARDS = ['.', 'a..c', 'a.b.', 'a..', '.b.', '..', 'x..', '.x.', '..x', 'x..x',
             'x.x.', '.x.x', 'x.x.x']

MATCHES = {
    'abc': [],
    'a.b': ['.'],
    'a.b.c': ['a..c', 'a.b.', 'a..', '.b.', '..'],
    'a.x.c': ['a..c', 'a..', '..', '.x.'],
    'a.b.x': ['a.b.', 'a..', '.b.', '..', '..x'],
    'a.x.x': ['a..', '..', '.x.', '..x', '.x.x'],
    'x.y.z': ['..', 'x..'],
    'a.b.c.d': []
}

The wildcard matching logic in Crossbar uses a string split function to break the URIs into dot-separated tokens. An empty token is effectively a wildcard, as I currently understand it.

ecorm avatar Jan 04 '23 09:01 ecorm

An empty token is effectively a wildcard, as I currently understand it.

yes, this is what the spec calls "empty URI components"

https://wamp-proto.org/wamp_latest_ietf.html#name-uris

and those are only possible in URI patterns.

fwiw, there is also a bunch of URI related tests in autobahn, but those are more for the automatic client side processing of URI, eg you can register com.myapp.<product:int>.update and get your procedure called with a proper product parameter parsed as int

https://github.com/crossbario/autobahn-python/blob/master/autobahn/wamp/test/test_wamp_uri_pattern.py

oberstet avatar Jan 04 '23 09:01 oberstet

@konsultaner

Am I correct?

I think you incorrectly assume that wildcards are "greedy" and can consume more than one "component", where "component" means the labels between the dots.

If wildcards were in fact greedy, then the a.b.c.d URI in the Crossbar tests would match the a.b. pattern (it doesn't). That's because a.b.c.d contains 4 components, whereas a.b. only contains 3 components with the last one being wild.

I think an empty wildcard pattern would match nothing if I follow the logic correctly, because it would have zero components. If a subscriber wants to match every possible topic, they could use prefix matching with an empty pattern.

BTW, I don't like the term "component" being used here because it has a different meaning for URLs. I personally use the term "token" in my code to mean the labels between the dots.

The one-component-per-wildcard mechanism that is implied by the Crossbar tests leads to a very trivial implementation for checking if it matches a pattern. It's like 7 lines of C++ code after the URI and pattern have been split into sub-strings:

using SplitUri = std::vector<std::string>;

bool uriMatchesWildcardPattern(const SplitUri& uri, const SplitUri& pattern)
{
    // Not yet tested
    auto uriSize = uri.size();
    if (uriSize != pattern.size())
        return false;
    for (SplitUri::size_type i = 0; i != uriSize; ++i)
        if (!pattern[i].empty() && uri[i] != pattern[i])
            return false;
    return true;
}

@oberstet

efficient to implement: worst case matching run-time of O(N*log(N)) for URI strings of length N

~Did you realy mean "O(N*log(M)) for URI strings of length N", where M is the number of patterns?~ Nevermind, thinking of the wrong data structure.

ecorm avatar Jan 05 '23 04:01 ecorm

BTW, I don't like the term "component" being used here because it has a different meaning for URLs. I personally use the term "token" in my code to mean the labels between the dots.

I guess I messed up with the name "URI component".

FWIW, here is what happened: I was looking for a word for "smallest unit of meaning" of a bigger thing, and in German this would be called "Wortbestandteil".

Which I split by its two part "Wort" = word => URI and "Bestandteil" = component.

I now learned that linguistics call it "Monem" (French) or "Morphem" (English).

https://en.wikipedia.org/wiki/Bound_and_free_morphemes

So if one takes "word == URI", the closest thing would be "free morpheme". If every wildcard matched part could occur on its own as well. If not, it's a "bound morpheme".

Now, this is linguists speech. How would a normal person call the individual words "fact" and "finding" in factfinding? Which is one word according to https://www.merriam-webster.com/legal/factfinding

Sorry if above seems exaggerated, somehow I find this interesting. Eg I also learned what's the real name of this feature of German language with the very long words: German is a synthetic language rather than analytic language (English) according to above Wikipedia page.

oberstet avatar Jan 05 '23 05:01 oberstet

This is fascinating .. I wasn't aware English has such long words: Antidisestablishmentarianism.

It means: "the movement to prevent revoking the Church of England's status as the official church [of England, Ireland, and Wales]."

Right;)

oberstet avatar Jan 05 '23 05:01 oberstet

I guess I messed up with the name "URI component".

"Component" makes sense purely from an English perspective, it's just that Berners-Lee used the term to mean the parts of a URL between the / slashes.

I've looked it up, and RFC1123 uses the word "label" for the parts between the dots in a hostname:

The DNS defines domain name syntax very generally -- a string of labels each containing up to 63 8-bit octets, separated by dots, and with a maximum total of 255 octets.

For example, www, google, and com would be labels in the hostname www.google.com.

I was using the term "token" because of the C strtok function that splits a string into what they call "tokens". Java also uses the same term with StringTokenizer.

ecorm avatar Jan 05 '23 06:01 ecorm

So we should choose from:

  1. URI component (keep it as currently used)
  2. URI label (as in RFC1123)
  3. URI token (C/Java stdlib)
  4. URI morpheme (linguists)

?

I like "label" (RFC1123) a lot, would be what I'd choose now, but is it worth changing and confusing people even more by changing it now? If others think so, I'd be cool with that as well!

oberstet avatar Jan 05 '23 06:01 oberstet

Is it worth changing and confusing people even more by changing it now?

I'm fine either way. I think just about everyone will understand the spec's current usage of "URI component" via context and the given examples. I was just nitpicking about the term.

ecorm avatar Jan 05 '23 06:01 ecorm

@ecorm Thanks for clearifying my missunderstanding. I think the name wildcard is much more confusing then component. I think wildcard is associated with more variablity like .* from regex.

@oberstet I'm +1 on label to match the RFC1123 which makes absulute sence.

konsultaner avatar Jan 05 '23 09:01 konsultaner