pycrdt icon indicating copy to clipboard operation
pycrdt copied to clipboard

Fix Index Conversion from Text to TextRef

Open jbdyn opened this issue 7 months ago • 3 comments

Hey @davidbrochart :wave:

I played around with some emojis in Text and noticed that insertion is working different than expected:

🐍 test script
from pycrdt import Doc, Text

## setup

ydoc = Doc()
ytext = Text()
ydoc["text"] = ytext

state = ""            # track state of ytext


def callback(event):
    """Print change record"""
    global state

    new_state = str(event.target)
    delta = str(event.delta)
    print(f"{delta}: '{state}' -> '{new_state}'")

    # update current state
    state = new_state


ytext.observe(callback)


## Manipulate Text

print("Insert and delete single emoji '🌴'")
# works as expected
ytext.insert(0, "🌴")
assert state == "🌴"

# given index is for Unicode code points
# but callback returns length of individual bytes in delta
del ytext[0:1]
assert state == ""

print("\nInsert '🌴abcde' sequentially")
for c, char in enumerate("🌴abcde"):
    ytext.insert(c, char)
assert state == "🌴abcde"
Insert and delete single emoji '🌴'
[{'insert': '🌴'}]: '' -> '🌴'
[{'delete': 4}, {'insert': ''}]: '🌴' -> ''

Insert '🌴abcde' sequentially
[{'insert': '🌴'}]: '' -> '🌴'
[{'retain': 4}, {'insert': 'a'}]: '🌴' -> '🌴a'
[{'retain': 4}, {'insert': 'b'}]: '🌴a' -> '🌴ba'
[{'retain': 4}, {'insert': 'c'}]: '🌴ba' -> '🌴cba'
[{'retain': 4}, {'insert': 'd'}]: '🌴cba' -> '🌴dcba'
[{'retain': 5}, {'insert': 'e'}]: '🌴dcba' -> '🌴decba'

In the Python code, one gives the index for Unicode code points, however

TextRef structure internally uses UTF-8 encoding and its length is described in a number of bytes rather than individual characters

[source]

So, I put in some thought to adapt the given index to the UTF-8 encoded string with this PR:

Insert and delete single emoji '🌴'
[{'insert': '🌴'}]: '' -> '🌴'
[{'delete': 4}]: '🌴' -> ''

Insert '🌴abcde' sequentially
[{'insert': '🌴'}]: '' -> '🌴'
[{'retain': 4}, {'insert': 'a'}]: '🌴' -> '🌴a'
[{'retain': 5}, {'insert': 'b'}]: '🌴a' -> '🌴ab'
[{'retain': 6}, {'insert': 'c'}]: '🌴ab' -> '🌴abc'
[{'retain': 7}, {'insert': 'd'}]: '🌴abc' -> '🌴abcd'
[{'retain': 8}, {'insert': 'e'}]: '🌴abcd' -> '🌴abcde'

However, I am not sure how to deal with the numbers returned in event.delta upon TextEvents, as they are also based on the UTF-8 encoded form and thereby can be off for the Python string representation. (My use case: keeping Text in sync with contents of the Textual TextArea widget.)

Should the user deal with that with own code? Should Text try to give the numbers for the Python string repr? Or should Text be capable of handling rich text as TextRef does:

TextRef offers a rich text editing capabilities (it’s not limited to simple text operations). Actions like embedding objects, binaries (eg. images) and formatting attributes are all possible using TextRef.

[source]

I also thought about limiting Text to inserted values for which len(val) == len(val.encode()), but this does not feel right to me.

jbdyn avatar Jun 29 '24 05:06 jbdyn