stanza icon indicating copy to clipboard operation
stanza copied to clipboard

Save and read Document object from JSON file

Open paulthemagno opened this issue 4 years ago • 8 comments

I need to save in a JSON file the analysis of a text and then read it from the file as a Document object. I see there are some methods which can be good:

  • The best for me would be the to_dict() method in order to save in a readable version the document in the JSON:
import json
import stanza

nlp = stanza.Pipeline('en')
doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
with open("output.json","w") as json_file:
    json.dump(doc.to_dict(), json_file, indent=4, ensure_ascii = False)

Then I could read it in this way:

with open("output.json") as json_file:
    data = json.load(json_file)

But how can I re-compose the Document object?

import json
import stanza

nlp = stanza.Pipeline('en')
doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
with open("output.json","w") as json_file:
    json.dump(doc.to_serialized(), json_file, indent=4, ensure_ascii = False)

But it gives me this error: TypeError: Object of type bytes is not JSON serializable, so I searched for how to dump a bytes object and I see many posts saying to do json.dumps(bytes_object.decode("utf-8")) so I tried doing:

with open("output.json","w") as json_file:
    json.dump(doc.to_serialized().decode("utf-8"), json_file, indent=4, ensure_ascii = False)

And it said: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte.

Is there a solution?

paulthemagno avatar Oct 12 '21 09:10 paulthemagno

If we back up for a moment, what you want is a method to serialize and later reload annotations? Does it have to be json, or anything will do?

AngledLuffa avatar Oct 12 '21 17:10 AngledLuffa

Yes, I need only a way to load in a second moment the Document object because I have to iterate on each token computing method as .parent , .ner and so on...

For my purpose it would be better to save analysis in a JSON file, rather than pickle. it’s not necessary to have the Document directly in the JSON. The important thing is to have the possibility to re-build the Document given a JSON file.

paulthemagno avatar Oct 12 '21 17:10 paulthemagno

Hey guys I tried to reproduce to_serialized and from_serialized

# doc.to_serialized
json_str = json.dumps(doc.to_dict(), indent=4, ensure_ascii = True)
serialized_string = pickle.dumps((sequence, json_str))

# Document.from_serialized
(text, sentences) = pickle.loads(serialized_string)
print(text, sentences)
doc = Document(sentences, text=text)

but I get an error when creating Document class instance:

  File "/stanza/models/common/doc.py", line 80, in __init__
    self._process_sentences(sentences, comments)
  File "/stanza/models/common/doc.py", line 146, in _process_sentences
    sentence = Sentence(tokens, doc=self)
  File "/stanza/models/common/doc.py", line 374, in __init__
    self._process_tokens(tokens)
  File "/stanza/models/common/doc.py", line 381, in _process_tokens
    entry[ID] = (i+1, )
TypeError: 'str' object does not support item assignment

loretoparisi avatar Oct 13 '21 12:10 loretoparisi

so I came out with this solution that works for me

@classmethod
    def stanza2Json(self, doc, ensure_ascii = True, indent = 4):
        '''
            convert stanza document to json string
        '''
        try:
            json_str = json.dumps({
                'text': doc.text,
                'sentences': doc.to_dict()
            }, indent=indent, ensure_ascii = ensure_ascii)
            return json_str
        except:
            return None
    
    @classmethod
    def json2Stanza(self, json_str):
        '''
            convert json string to stanza Document instance
        '''
        try:
            json_doc = json.loads(json_str)
            doc = Document(json_doc['sentences'], text=json_doc['text'])
            doc.build_ents()
            return doc
        except:
            return None

loretoparisi avatar Oct 13 '21 13:10 loretoparisi

@AngledLuffa solved. I thought that for instantiating a new Document a list of Sentence objects was necessary.https://github.com/stanfordnlp/stanza/blob/f91ca215e175d4f7b202259fe789374db7829395/stanza/models/common/doc.py#L368 Instead, I saw that dumping in the JSON the Document.to_dict() and the text of the analysis, I can easily read it from the file and obtaining the Document in this way:

import json
import stanza
from stanza.models.common.doc import Document

nlp = stanza.Pipeline('en')
text = "Barack Obama was born in Hawaii.  He was elected president in 2008."
doc = nlp(text)
with open("output.json","w") as json_file:
    json.dump({"dict": doc.to_dict(), "text":text}, json_file, indent=4, ensure_ascii = False)

with open("output.json") as json_file:
    data = json.load(json_file)

dict = data["dict"]
text = data["text"]
read_doc = Document(sentences = dict, text = text)

Do you confirm that it's the right way to do it?

paulthemagno avatar Oct 13 '21 14:10 paulthemagno

Sorry, got caught up in some stuff today. I'll address it tomorrow, but glad something's working for now, at least

AngledLuffa avatar Oct 14 '21 07:10 AngledLuffa

While it certainly looks like this will work, it's very unfortunate that there's no existing method for converting to & from json already in the toolkit. Also, this method is lacking per-sentence information such as sentiment scores or constituency trees. I'll see if I can add something which converts the whole thing to future versions.

On Thu, Oct 14, 2021 at 12:40 AM John Bauer @.***> wrote:

Sorry, got caught up in some stuff today. I'll address it tomorrow, but glad something's working for now, at least

AngledLuffa avatar Oct 15 '21 20:10 AngledLuffa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 14 '21 23:12 stale[bot]