TEI-Simple icon indicating copy to clipboard operation
TEI-Simple copied to clipboard

The behaviour attribute value doesn't specify it's parsing.

Open buckett opened this issue 10 years ago • 14 comments

There's no specification on how the behaviour attribute's value should parsed. How should strings, URIs and XPath expressions should be quoted.

buckett avatar Mar 23 '15 14:03 buckett

When attempting to parse a behaviour="cit(.,'uri://something') it would be good to know how I should parse the arguments.

buckett avatar Mar 23 '15 14:03 buckett

in tei-pm, there should be a datatype for each parameter of a function. That should deal with this? XPaths are not quoted, strings are.

sebastianrahtz avatar Mar 24 '15 15:03 sebastianrahtz

For example how is a " escaped in a string? I'm guessing the existing implementation treats the function as an XSLT function and so the parsing rules are the same as XSLT function parsing rules.

buckett avatar Mar 24 '15 15:03 buckett

um. we have no idea! we don't know how we'd handle that in XSLT.

sebastianrahtz avatar Mar 24 '15 15:03 sebastianrahtz

So are strings assumed to be XML encoded, so a string of "Hello" said the policeman should be written as "Hello" said the policeman ?

buckett avatar Mar 24 '15 15:03 buckett

That doesn't help you, because the XML parser expands the entities into Unicode anyway. I honestly dont know how to deal with this.

On 24 March 2015 at 15:59, Matthew Buckett [email protected] wrote:

So are strings assumed to be XML encoded, so a string of "Hello" said the policeman should be written as "Hello" said the policeman ?

— Reply to this email directly or view it on GitHub https://github.com/TEIC/TEI-Simple/issues/8#issuecomment-85575868.

Sebastian Rahtz

Director (Research) of Academic IT

University of Oxford IT Services

13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431

Não sou nada.

Nunca serei nada.

Não posso querer ser nada.

À parte isso, tenho em mim todos os sonhos do mundo.

sebastianrahtz avatar Mar 24 '15 17:03 sebastianrahtz

This came about because an XPath expression may contain a comma (I think) so I was thinking about how to parse the function to extract out the 2 XPath expressions for alternate(xpath,xpath)

buckett avatar Mar 24 '15 17:03 buckett

ah. I see where you are going how. and I just met a similar problem. I just wrote behaviour="break('page',if (@n) then @n else @facs)" and it doesn't look right at all.

I am beginning to think we should change this spec to say that the XPath expression should be passed as a string, i.e. surrounded by quotes. Doesn't help with how to pass quotes, but does deal with the embedded comma.

sebastianrahtz avatar Mar 24 '15 22:03 sebastianrahtz

I suggest elements should be used instead of attributes (for behaviour and predicate). Otherwise I think this is going to be a source of endless pain. EDIT: on the other hand, if this stuff will typically be implemented in XSLT etc then perhaps it makes sense to use attributes, so that encoders are forced to write XPath expressions in a way that will work in XSLT, however awkward it may make certain expressions.

Conal-Tuohy avatar Mar 24 '15 23:03 Conal-Tuohy

Since this is XPath 2, we have the codepoints-to-string() function, but it's not pretty.

"concat(codepoints-to-string(34), 'Hello', codepoints-to-string(34), ' said the policeman')"

Conal-Tuohy avatar Mar 25 '15 00:03 Conal-Tuohy

It's a fair point, Conal. I don't want to change horses mid-race when the problem right now is checking functionality is there, but after we have a stable 1.0 using attributes, it would be a good idea to reconsider the choice of using attributes rather then element children.

sebastianrahtz avatar Mar 25 '15 20:03 sebastianrahtz

I've compared the TEI Simple dtd with the DTA schema. Simple is more generous than DTA, but DTA has the following elements that Simple does not allow for:

addName country foreName genName nameLink orgName persName roleName surname

Should we include them? I can see three different arguments in favour of doing so. First, DTA has been adopted by CLARIN as its base format. Other things being equal, there is a benefit if a text in that format validates under Simple.

Second, and perhaps more substantively, named entity extraction seems to be the chief, and often the only, thing that people are interested in when they work with texts.

Third, when I showed Simple to the Perseus folks, they were very interested in the processing model but objected to the exclusion of the name elements.

On the minus side, you can just use type attributes for sub specification of names, and Simple may run the risk of no longer being simple. Do we want to slide down that slippery slope?

martinmueller39 avatar Apr 08 '15 14:04 martinmueller39

I think we quite consciously have made the decision of excluding 'syntactic sugar' options for types and subtypes of names, all for the sake of leaving the editor with precisely one way of encoding things. To accommodate DTA and other corpora we provided a conversion piece from 'general TEI' to 'Simple TEI' that converts all <addNames> &co into typed . Funnily enough I can't find the conversion stylesheet on gitHub now.

On 8 April 2015 at 15:19, martinmueller39 [email protected] wrote:

I've compared the TEI Simple dtd with the DTA schema. Simple is more generous than DTA, but DTA has the following elements that Simple does not allow for:

addName country foreName genName nameLink orgName persName roleName surname

Should we include them? I can see three different arguments in favour of doing so. First, DTA has been adopted by CLARIN as its base format. Other things being equal, there is a benefit if a text in that format validates under Simple.

Second, and perhaps more substantively, named entity extraction seems to be the chief, and often the only, thing that people are interested in when they work with texts.

Third, when I showed Simple to the Perseus folks, they were very interested in the processing model but objected to the exclusion of the name elements.

On the minus side, you can just use type attributes for sub specification of names, and Simple may run the risk of no longer being simple. Do we want to slide down that slippery slope?

— Reply to this email directly or view it on GitHub https://github.com/TEIC/TEI-Simple/issues/8#issuecomment-90930437.

tuurma avatar Apr 09 '15 09:04 tuurma

the naming thing is hard. we can put back all the specific ones, but then we'd have to remove the generic @type version. would that actually be better? i.e. not to support at all?

the conversion stylesheet is now in the TEI Stylesheets

sebastianrahtz avatar Apr 19 '15 11:04 sebastianrahtz