google-api-python-client icon indicating copy to clipboard operation
google-api-python-client copied to clipboard

Google doc dates returned as unicode (e.g., \ue907)

Open nick-youngblut opened this issue 11 months ago • 2 comments

Example code:

import os
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build
from google.auth import default

def get_document_dates(doc_id, creds_file=None):
    scopes = ['https://www.googleapis.com/auth/documents.readonly']
    if creds_file and os.path.exists(creds_file):
        creds = Credentials.from_service_account_file(creds_file, scopes=scopes)
    else:
        creds, project = default(scopes=scopes)
    
    # Build the Docs API service
    service = build('docs', 'v1', credentials=creds)
    
    # Get the document
    document = service.documents().get(
        documentId=doc_id,
        fields='body'  
    ).execute()
    
    # Access the document's content
    content = document.get('body').get('content')
    
    # Process each element
    for element in content:
        if 'paragraph' in element:
            paragraph = element.get('paragraph')
            elements = paragraph.get('elements', [])
            
            for elem in elements:
                print(elem)

The first section of the doc:

Image

I want to parse the date via the python API: Jan 13, 2025.

The first few elements printed:

{'startIndex': 1, 'endIndex': 5, 'textRun': {'content': '\ue907 | ', 'textStyle': {}}}
{'startIndex': 5, 'endIndex': 6, 'richLink': {'richLinkId': 'kix.p3Xj3hkh7bXl', 'textStyle': {}, 'richLinkProperties': {'title': 'Asana Board New NGS Submissions', 'uri': 'https://www.google.com/calendar/event?eid=XXX'}}}
{'startIndex': 6, 'endIndex': 7, 'textRun': {'content': '\n', 'textStyle': {}}}
{'startIndex': 7, 'endIndex': 18, 'textRun': {'content': 'Attendees: ', 'textStyle': {}}}

The date is returned in the first element as \ue907. How can that be converted to a date?

Note: there is a richLinkId in the second element, but that is for a separate calendar element, and not the Jan 13, 2025 date element.

More generally, why are date elements returned as unicode instead of something easier to work with?

nick-youngblut avatar Jan 13 '25 19:01 nick-youngblut

I believe (and cannot find it documented anywhere) that Docs uses Private Use Area Unicode characters to represent special elements like chips and code blocks.

eseidohl avatar Feb 06 '25 17:02 eseidohl

While this issue is about docs, It looks like, as of this writing, the feature is not available in sheets: https://stackoverflow.com/questions/79331123/how-to-extract-both-name-and-link-from-google-sheets-smart-chip-place-using-ap

eseidohl avatar Feb 06 '25 17:02 eseidohl

Thanks for reporting this issue! This sounds very much like a an API endpoint issue rather than a client library issue; you would probably get the same response if you issued a curl command directly from the terminal.

I suggest following the suggestions in the Docs support page to see whether this issue has surfaced before, and to file an issue with the service team if needed.

Thanks!

vchudnov-g avatar Apr 24 '25 18:04 vchudnov-g