apertium icon indicating copy to clipboard operation
apertium copied to clipboard

Transfer: Begin of sentence (or of text)

Open hectoralos opened this issue 4 years ago • 9 comments

It would be interesting to have some way of being able to refer in the selection to the beginning of the text we are translating. For sentences that are not the first one, we have the full stop of the previous sentence to refer to the beginning of the sentence, but for the first sentence we don't have any reference. I am thinking of something like a default category that could be referenced in a pattern-item.

It is possible to do it using a variable, but it is complicated.

hectoralos avatar Oct 28 '21 18:10 hectoralos

I've done the ugly variable thing. Not recommended – you have to ensure to set the variable for every possible word. Would be nice to have something built-in.

unhammer avatar Oct 29 '21 10:10 unhammer

As I mentioned in https://github.com/apertium/apertium-lex-tools/issues/88, beginning-of-text is pretty easy but beginning-of-sentence probably has to be language-specific.

Here we have the added consideration of what happens if you try to manipulate the BOS token.

One thing we could do is insert a fake beginning-of-stream token like BOS<$^> and then replace it with empty string if anyone tries to actually output it. (Moving it would do weird things to blanks, but Don't Do That.)

mr-martian avatar Nov 03 '21 13:11 mr-martian

We've had discussions before about delimiters in Apertium. CG has the concept of delimiters and injects an untouchable invisible pseudotoken >>> at the start of each sentence (either beginning-of-input or after-delimiter), precisely so that rules can see if they are at the start of a sentence. But the Apertium stream format doesn't yet have this concept.

TinoDidriksen avatar Nov 03 '21 14:11 TinoDidriksen

@mr-martian, you are right. Begin of text is the real problem. If it could be easily referenced in the rules, it would be excellent.

hectoralos avatar Nov 03 '21 14:11 hectoralos

A possibly cleaner solution that I just came up with:

Have an element like <at-beginning/> that can be used in conditionals and evaluates to true if there are no words between the beginning of the current rule and \0 or the beginning of the stream and false otherwise.

Depending on how such position information is currently being used, this might be the simplest solution to implement (on the other hand, this might make things really messy for users, in which case some sort of <pattern-item/> is probably better).

mr-martian avatar Nov 06 '21 20:11 mr-martian

Actually, we could make it so that a certain <def-cat> could be marked as defining a delimiter, and whenever that element is seen, it resets <at-beginning/>. I'll have to think more about what happens if a rule applies across the delimiter, but that seems like a rare enough situation to me that it's probably ok if the result is a little weird.

mr-martian avatar Nov 06 '21 21:11 mr-martian

The cleaner way from a developer's point of view would be to be able to reference it in a def-cat clause. This way we could change:

    <def-cat n="sent">
      <cat-item tags="sent"/>
    </def-cat>

into:

    <def-cat n="sent">
      <cat-item tags="sent"/>
      <whatever/>
    </def-cat>

and everything would work as expected. If there is an element that could be used in conditionals, that will be good, but I am not sure whether I have used them for sent until today.

What does @unhammer think on this?

hectoralos avatar Nov 07 '21 14:11 hectoralos

I mostly work in macros, so a general <at-beginning/> conditional would be confusing (which clip does it refer to now?).

Either of the following would be helpful:

  1. an attribute on def-cats:
<def-cat n="adv-at-beginning"> 
  <cat-item tags="adv" at-beginning="true"/>
  <cat-item tags="adj.nt.*" at-beginning="true"/>
</def-cat>
  1. a new test, but referring to a pattern index: <test><at-beginning pos="1"/></test>

unhammer avatar Nov 08 '21 12:11 unhammer

The option of an attibute on def-cats is good for me, if tags can be excluded. This way we could have:

    <def-cat n="sent">
      <cat-item tags="sent"/>
      <cat-item at-beginning="true"/>
    </def-cat>

Another option can be:

    <def-cat n="sent">
      <cat-item tags="sent"/>
      <cat-item tags="*" at-beginning="true"/>
      <cat-item tags="" at-beginning="true"/>
    </def-cat>

As for the test, I haven't any opinion on it. OK for me.

hectoralos avatar Nov 08 '21 13:11 hectoralos