text-fabric
text-fabric copied to clipboard
A new notion of adjacency
Suppose you want to look for the first clause after an earlier clause, but not necessarily tightly adjacent. Can we define a spatial operator for that?
This is in response to an ongoing request by Cody Kingham.
Here is a proposal: the relational operators <~, ~>, ~~, defined as follows
x <~ y holds for those nodes n and m where n is an instance of x, and m is a first instance of y that lies completely past n.
m is a first instance of y means that there is no other instance of y that starts before m.
We could also say:
m is a first instance of y, meaning that there is no other instance of y that precedes m in the canonical ordering.
example
word lex=L
<~ subphrase
It could be that we have multiple subphrases after the word.
Consider this case: the word has slot 1, and there are subphrases with slots as follows:
s1 = {1, 2, 3} s2 = {1, 2, 3} s3 = {1, 2} s4 = {1, 3}
Then there is more than one first subphrase after the word, however you define it.
In the canonical ordering, s1 and s2 precede s3 and s4.
Maybe the best proposal is to let all of s1, s2, s3, s4 qualify
We could also make operators
L.b(node, nodeType or set) (before)
and
L.a(node, nodeType or set) (after)
which select the first node in a nodeType or set before or after the reference node, walking from the reference node.
Then we can return all candidates in the canonical order, like the other L operators do.
Back to <~
It works also for custom sets:
myword <~ mysubphrase
But <~ and ~> are not each others converse.
If a certain subphrase is the first one after a certain word,
then
that word does not have to be the first word before that subphrase.
w1 w2 w3 w4
-- -- s1 s1
If s1 has slots w3 and w4, and w2 is not in any subphrase, then s1 is the first subphrase after w1, but w2 is the first word before s1.
So it makes sense to also have ~~ meaning both:
word ~~ subphrase
matches (w, s) such that s is the first subphrase after w and w is the first word before s
Still there is a subtlety that keeps puzzling me
word lex=L <~ word lex=M
could be interpreted in two ways:
(A) (n, m) is a result if n is a word with lex=L and m is the first word with lex=M after n
(B) (n, m) is a result if n is a word with lex=L and m is the first word after n and m has lex=M
Which one makes more "algebraic sense"?
If we go with A, then it will make it complicated to specify features while finding adjacent objects. I feel like B makes more sense intuitively. Algebra is not my strong suit.
On my bike ride I also felt that B is better. B is more a spatial relationship, whereas A needs arbitrary conditions on the nodes in terms of features.
Moreover, I think you can use quantifiers to approximate A. Is it possible to express B with quantifiers, I wonder? Not easily if you cannot refer to multiple parent nodes.
I have been thinking what is needed for computing these relations efficiently, both for L.a L.b and in queries. You need to pre-compute for each slot the list of nodes that are first after that slot among their brethren of the same type. That will cost a bit of time once in a while, it will take a few 100 MB extra in RAM, but since we have saved a lot recently, that must be doable. If you have custom sets in your query, then TF has to pre-compute similar data for those sets on the fly as preparation to run the query. I think this is all doable.
Maybe we can chunk this task in smaller bits and ask people to work on it. For example, the precompute step is a nice task for a medium-experienced python programmer.
Task: with one run over all nodes in canonical order, produce indexes that give for each slot and each node type the first node of that type before and after that slot.
These relations are innovative. All earlier relations can be defined in terms of nodes and the slots they are linked to.
The new ones need something more: the concept that a node can be part of a set. The relations are defined by stipulating that a node is the first or last node in its set with a certain property.
That makes me think, can we do the same trick with other spatial relationships?
Yes. We can see ~ as an operator that modifies relations.
n < m means: n comes before m in the canonical ordering
A query
clause <~ phrase
then looks for nodes c,p such that p is the first phrase after clause c.
And n << m means that all slots of n lie before the first slot of m
A query
clause <<~ phrase
selects clauses with the first phrase completely after it.
Then
clause <:~ phrase
means the first phrase that is adjacent after the preceding clause. In theory, there can be multiple phrases adjacent to the same clause, if phrases can be embedded in each other.
The # operator means: unequal as nodes, i.e. not the same node.
phrase #~ phrase
means: a phrase and the first other phrase in the dataset. Not very meaningful.
Embedding
n [[ m means n embeds m.
book [[~ chapter
looks for books with their first embedded chapter. This includes books with non-chapter material before the first chapter.
Shorthand with indentation:
book ~ chapter
And for a book with its last chapter
book chapter ~
I think these are the relations that could be augmented with ~
<~ ~<
>~ ~>
<<~ ~<<
>>~ ~>>
[[~ ~[[
]]~ ~]]
What does
phrase ~<< clause
mean?
The last phrase before the clause.
The same as
clause
>>~ phrase
But maybe this is getting over the top.
I think only the << and [[ operators and their opposites are candidates for the ~ operator. We could replace on of the characters by ~ resultiing in
<~ ~>
>~ ~<
[~ ~]
]~ ~[
Somehow the urgency for this seems to be lost.