text-fabric A new notion of adjacency

Suppose you want to look for the first clause after an earlier clause, but not necessarily tightly adjacent. Can we define a spatial operator for that?

This is in response to an ongoing request by Cody Kingham.

May 23 '19 09:05 dirkroorda

Here is a proposal: the relational operators <~, ~>, ~~, defined as follows

x <~ y holds for those nodes n and m where n is an instance of x, and m is a first instance of y that lies completely past n.

m is a first instance of y means that there is no other instance of y that starts before m.

We could also say:

m is a first instance of y, meaning that there is no other instance of y that precedes m in the canonical ordering.

May 23 '19 09:05 dirkroorda

example

word lex=L
<~ subphrase

It could be that we have multiple subphrases after the word.

Consider this case: the word has slot 1, and there are subphrases with slots as follows:

s1 = {1, 2, 3} s2 = {1, 2, 3} s3 = {1, 2} s4 = {1, 3}

May 23 '19 09:05 dirkroorda

Then there is more than one first subphrase after the word, however you define it.

In the canonical ordering, s1 and s2 precede s3 and s4.

Maybe the best proposal is to let all of s1, s2, s3, s4 qualify

We could also make operators

L.b(node, nodeType or set) (before)

and

L.a(node, nodeType or set) (after)

which select the first node in a nodeType or set before or after the reference node, walking from the reference node.

Then we can return all candidates in the canonical order, like the other L operators do.

May 23 '19 09:05 dirkroorda

Back to <~

It works also for custom sets:

myword <~ mysubphrase

May 23 '19 09:05 dirkroorda

But <~ and ~> are not each others converse.

If a certain subphrase is the first one after a certain word,

then

that word does not have to be the first word before that subphrase.

w1 w2 w3 w4
-- -- s1 s1

If s1 has slots w3 and w4, and w2 is not in any subphrase, then s1 is the first subphrase after w1, but w2 is the first word before s1.

May 23 '19 09:05 dirkroorda

So it makes sense to also have ~~ meaning both:

word ~~ subphrase

matches (w, s) such that s is the first subphrase after w and w is the first word before s

May 23 '19 09:05 dirkroorda

Still there is a subtlety that keeps puzzling me

word lex=L <~ word lex=M

could be interpreted in two ways:

(A) (n, m) is a result if n is a word with lex=L and m is the first word with lex=M after n

(B) (n, m) is a result if n is a word with lex=L and m is the first word after n and m has lex=M

Which one makes more "algebraic sense"?

May 23 '19 09:05 dirkroorda

If we go with A, then it will make it complicated to specify features while finding adjacent objects. I feel like B makes more sense intuitively. Algebra is not my strong suit.

May 23 '19 10:05 codykingham

On my bike ride I also felt that B is better. B is more a spatial relationship, whereas A needs arbitrary conditions on the nodes in terms of features.

Moreover, I think you can use quantifiers to approximate A. Is it possible to express B with quantifiers, I wonder? Not easily if you cannot refer to multiple parent nodes.

May 23 '19 18:05 dirkroorda

I have been thinking what is needed for computing these relations efficiently, both for L.a L.b and in queries. You need to pre-compute for each slot the list of nodes that are first after that slot among their brethren of the same type. That will cost a bit of time once in a while, it will take a few 100 MB extra in RAM, but since we have saved a lot recently, that must be doable. If you have custom sets in your query, then TF has to pre-compute similar data for those sets on the fly as preparation to run the query. I think this is all doable.

May 23 '19 18:05 dirkroorda

Maybe we can chunk this task in smaller bits and ask people to work on it. For example, the precompute step is a nice task for a medium-experienced python programmer.

Task: with one run over all nodes in canonical order, produce indexes that give for each slot and each node type the first node of that type before and after that slot.

May 23 '19 18:05 dirkroorda

These relations are innovative. All earlier relations can be defined in terms of nodes and the slots they are linked to.

The new ones need something more: the concept that a node can be part of a set. The relations are defined by stipulating that a node is the first or last node in its set with a certain property.

That makes me think, can we do the same trick with other spatial relationships?

May 23 '19 19:05 dirkroorda

Yes. We can see ~ as an operator that modifies relations.

n < m means: n comes before m in the canonical ordering

A query

clause <~ phrase

then looks for nodes c,p such that p is the first phrase after clause c.

And n << m means that all slots of n lie before the first slot of m

A query

clause <<~ phrase

selects clauses with the first phrase completely after it.

May 23 '19 19:05 dirkroorda

Then

clause <:~ phrase

means the first phrase that is adjacent after the preceding clause. In theory, there can be multiple phrases adjacent to the same clause, if phrases can be embedded in each other.

May 23 '19 19:05 dirkroorda

The # operator means: unequal as nodes, i.e. not the same node.

phrase #~ phrase

means: a phrase and the first other phrase in the dataset. Not very meaningful.

May 23 '19 19:05 dirkroorda

Embedding

n [[ m means n embeds m.

book [[~ chapter

looks for books with their first embedded chapter. This includes books with non-chapter material before the first chapter.

Shorthand with indentation:

book ~ chapter

And for a book with its last chapter

book chapter ~

May 23 '19 19:05 dirkroorda

I think these are the relations that could be augmented with ~

<~ ~<
>~ ~>
<<~ ~<<
>>~ ~>>
[[~ ~[[
]]~ ~]]

May 23 '19 19:05 dirkroorda

What does

phrase ~<< clause

mean?

The last phrase before the clause.

The same as

clause
>>~ phrase

May 23 '19 19:05 dirkroorda

But maybe this is getting over the top.

May 23 '19 19:05 dirkroorda

I think only the << and [[ operators and their opposites are candidates for the ~ operator. We could replace on of the characters by ~ resultiing in

<~ ~>
>~ ~<
[~ ~]
]~ ~[

May 24 '19 05:05 dirkroorda

Somehow the urgency for this seems to be lost.

Sep 12 '22 11:09 dirkroorda

text-fabric text-fabric copied to clipboard

A new notion of adjacency

text-fabric
text-fabric copied to clipboard