stanza
stanza copied to clipboard
Save and read Document object from JSON file
I need to save in a JSON file the analysis of a text and then read it from the file as a Document object. I see there are some methods which can be good:
- The best for me would be the to_dict() method in order to save in a readable version the document in the JSON:
import json
import stanza
nlp = stanza.Pipeline('en')
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
with open("output.json","w") as json_file:
json.dump(doc.to_dict(), json_file, indent=4, ensure_ascii = False)
Then I could read it in this way:
with open("output.json") as json_file:
data = json.load(json_file)
But how can I re-compose the Document object?
- Other way could be using to_serialized() method (since there is the Document.from_serialized(cls, serialized_string) method):
import json
import stanza
nlp = stanza.Pipeline('en')
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
with open("output.json","w") as json_file:
json.dump(doc.to_serialized(), json_file, indent=4, ensure_ascii = False)
But it gives me this error: TypeError: Object of type bytes is not JSON serializable, so I searched for how to dump a bytes object and I see many posts saying to do json.dumps(bytes_object.decode("utf-8")) so I tried doing:
with open("output.json","w") as json_file:
json.dump(doc.to_serialized().decode("utf-8"), json_file, indent=4, ensure_ascii = False)
And it said: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte.
Is there a solution?
If we back up for a moment, what you want is a method to serialize and later reload annotations? Does it have to be json, or anything will do?
Yes, I need only a way to load in a second moment the Document object because I have to iterate on each token computing method as .parent , .ner and so on...
For my purpose it would be better to save analysis in a JSON file, rather than pickle. it’s not necessary to have the Document directly in the JSON. The important thing is to have the possibility to re-build the Document given a JSON file.
Hey guys I tried to reproduce to_serialized and from_serialized
# doc.to_serialized
json_str = json.dumps(doc.to_dict(), indent=4, ensure_ascii = True)
serialized_string = pickle.dumps((sequence, json_str))
# Document.from_serialized
(text, sentences) = pickle.loads(serialized_string)
print(text, sentences)
doc = Document(sentences, text=text)
but I get an error when creating Document class instance:
File "/stanza/models/common/doc.py", line 80, in __init__
self._process_sentences(sentences, comments)
File "/stanza/models/common/doc.py", line 146, in _process_sentences
sentence = Sentence(tokens, doc=self)
File "/stanza/models/common/doc.py", line 374, in __init__
self._process_tokens(tokens)
File "/stanza/models/common/doc.py", line 381, in _process_tokens
entry[ID] = (i+1, )
TypeError: 'str' object does not support item assignment
so I came out with this solution that works for me
@classmethod
def stanza2Json(self, doc, ensure_ascii = True, indent = 4):
'''
convert stanza document to json string
'''
try:
json_str = json.dumps({
'text': doc.text,
'sentences': doc.to_dict()
}, indent=indent, ensure_ascii = ensure_ascii)
return json_str
except:
return None
@classmethod
def json2Stanza(self, json_str):
'''
convert json string to stanza Document instance
'''
try:
json_doc = json.loads(json_str)
doc = Document(json_doc['sentences'], text=json_doc['text'])
doc.build_ents()
return doc
except:
return None
@AngledLuffa solved. I thought that for instantiating a new Document a list of Sentence objects was necessary.https://github.com/stanfordnlp/stanza/blob/f91ca215e175d4f7b202259fe789374db7829395/stanza/models/common/doc.py#L368
Instead, I saw that dumping in the JSON the Document.to_dict() and the text of the analysis, I can easily read it from the file and obtaining the Document in this way:
import json
import stanza
from stanza.models.common.doc import Document
nlp = stanza.Pipeline('en')
text = "Barack Obama was born in Hawaii. He was elected president in 2008."
doc = nlp(text)
with open("output.json","w") as json_file:
json.dump({"dict": doc.to_dict(), "text":text}, json_file, indent=4, ensure_ascii = False)
with open("output.json") as json_file:
data = json.load(json_file)
dict = data["dict"]
text = data["text"]
read_doc = Document(sentences = dict, text = text)
Do you confirm that it's the right way to do it?
Sorry, got caught up in some stuff today. I'll address it tomorrow, but glad something's working for now, at least
While it certainly looks like this will work, it's very unfortunate that there's no existing method for converting to & from json already in the toolkit. Also, this method is lacking per-sentence information such as sentiment scores or constituency trees. I'll see if I can add something which converts the whole thing to future versions.
On Thu, Oct 14, 2021 at 12:40 AM John Bauer @.***> wrote:
Sorry, got caught up in some stuff today. I'll address it tomorrow, but glad something's working for now, at least
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.