dsp-api
dsp-api copied to clipboard
Standoff property value retrieving performance
We have in our project huge transcriptions based on our own standoff mapping and we have poor performances to retrieve the value of this property. I mainly use gravsearch to retrieve data, but even with the v2/resources we have poor perfs.
We are talking here about 20 or 30 seconds to retrieve this resource.
If needed @loicjaouen will provide to you our Mem/CPU stack config
I'm going to prepare a test case to let you try to reproduce this perf problem on your side.
Have you read https://discuss.dasch.swiss/t/large-texts-and-xml-databases/134 ?
You have two options:
- Break your text into smaller pieces, instead of storing a huge text in a single
TextValue. - Wait until Knora supports storing text in an XML database.
I suggested the same thing to you last April:
https://github.com/dasch-swiss/knora-api/issues/1293#issuecomment-480269479
Have you read https://discuss.dasch.swiss/t/large-texts-and-xml-databases/134 ?
no, sorry, no more enough motivations and time to follow your next devs, I just try to find solutions with the existing Knora :)
I suggested the same thing to you last April:
yep, I remember, with @mrivoal we thought about that, but not so easy for us to automatically split our user's data during the migration process from their mysql db into Knora. And anyway, at the end of the day, they probably won't want to split their data :|
Just have a look to their job : http://lumieres.unil.ch/fiches/trans/1088/ , in the edit mode, you need an account for that, they use ckeditor which produces a kind of pseudo-html, we provided a standoff mapping and it works very well, it's a shame that (probably) just for few transcriptions we have this kind of low perfs :(
The test case, if you want to reproduce : PerfTrans.zip
@mrivoal The only solution I see right now is to ask them to split their existing transcriptions in their database before our final migration.
@benjamingeer the save process is also very slow, it is not a problem for our migration process but probably a problem in our web app client if the end user have to wait more than 30 sec to save smthing... they didn't give us feedbacks about that but they probably will in a near future !
the save process is also very slow
If you can split the text into smaller pieces, both saving and loading will be faster.
Yes, the modeling solution, as usual. However, artificially splitting long editions that users can easily deal with with other tools (existDB) is not an acceptable solution (this is already the feedback we have from another of our edition projects).
Then I guess, for the long run, Knora will have to store long texts in XML databases.
However, artificially splitting long editions that users can easily deal with with other tools (existDB) is not an acceptable solution (this is already the feedback we have from another of our edition projects).
It's a trade-off. If you can store texts in small enough pieces (1000 words is a good size if you have a lot of markup), you can store them as RDF, and get functionality that you wouldn't get by storing the text in eXist-db, like "find me a text that mentions a person who was born after 1720 and who was a student of Euler". (Maybe you could do that in eXist-db if you were willing to store all your data as XML.)
Otherwise, you can store the text in eXist-db: storage and retrieval will be faster, and some queries will be faster, but you will lose some search capabilities.
I think the best we can do is offer both options, and let each project decide which is best for them.
What do you consider will be "a lot of markup"?
What do you consider will be "a lot of markup"?
In the test I did, nearly every word had a tag. The more markup you have, the more triples have to be retrieved, and the slower it's going to be. If you have a big text with very little markup, GraphDB can still retrieve it pretty quickly.
Ok, thanks.
Just have a look to their job : http://lumieres.unil.ch/fiches/trans/1088/
That text has chapters. Why not store one chapter per resource? That would also make navigation and editing a lot easier. Do you really want to scroll through that much text on one HTML page?