dsp-api icon indicating copy to clipboard operation
dsp-api copied to clipboard

Standoff property value retrieving performance

Open gfoo opened this issue 5 years ago • 12 comments

We have in our project huge transcriptions based on our own standoff mapping and we have poor performances to retrieve the value of this property. I mainly use gravsearch to retrieve data, but even with the v2/resources we have poor perfs. We are talking here about 20 or 30 seconds to retrieve this resource.

If needed @loicjaouen will provide to you our Mem/CPU stack config

I'm going to prepare a test case to let you try to reproduce this perf problem on your side.

gfoo avatar Jan 21 '20 15:01 gfoo

Have you read https://discuss.dasch.swiss/t/large-texts-and-xml-databases/134 ?

You have two options:

  1. Break your text into smaller pieces, instead of storing a huge text in a single TextValue.
  2. Wait until Knora supports storing text in an XML database.

benjamingeer avatar Jan 21 '20 17:01 benjamingeer

I suggested the same thing to you last April:

https://github.com/dasch-swiss/knora-api/issues/1293#issuecomment-480269479

benjamingeer avatar Jan 21 '20 22:01 benjamingeer

Have you read https://discuss.dasch.swiss/t/large-texts-and-xml-databases/134 ?

no, sorry, no more enough motivations and time to follow your next devs, I just try to find solutions with the existing Knora :)

I suggested the same thing to you last April:

yep, I remember, with @mrivoal we thought about that, but not so easy for us to automatically split our user's data during the migration process from their mysql db into Knora. And anyway, at the end of the day, they probably won't want to split their data :|

Just have a look to their job : http://lumieres.unil.ch/fiches/trans/1088/ , in the edit mode, you need an account for that, they use ckeditor which produces a kind of pseudo-html, we provided a standoff mapping and it works very well, it's a shame that (probably) just for few transcriptions we have this kind of low perfs :(

gfoo avatar Jan 22 '20 06:01 gfoo

The test case, if you want to reproduce : PerfTrans.zip

gfoo avatar Jan 22 '20 06:01 gfoo

@mrivoal The only solution I see right now is to ask them to split their existing transcriptions in their database before our final migration.

@benjamingeer the save process is also very slow, it is not a problem for our migration process but probably a problem in our web app client if the end user have to wait more than 30 sec to save smthing... they didn't give us feedbacks about that but they probably will in a near future !

gfoo avatar Jan 22 '20 06:01 gfoo

the save process is also very slow

If you can split the text into smaller pieces, both saving and loading will be faster.

benjamingeer avatar Jan 22 '20 08:01 benjamingeer

Yes, the modeling solution, as usual. However, artificially splitting long editions that users can easily deal with with other tools (existDB) is not an acceptable solution (this is already the feedback we have from another of our edition projects).

Then I guess, for the long run, Knora will have to store long texts in XML databases.

mrivoal avatar Jan 22 '20 10:01 mrivoal

However, artificially splitting long editions that users can easily deal with with other tools (existDB) is not an acceptable solution (this is already the feedback we have from another of our edition projects).

It's a trade-off. If you can store texts in small enough pieces (1000 words is a good size if you have a lot of markup), you can store them as RDF, and get functionality that you wouldn't get by storing the text in eXist-db, like "find me a text that mentions a person who was born after 1720 and who was a student of Euler". (Maybe you could do that in eXist-db if you were willing to store all your data as XML.)

Otherwise, you can store the text in eXist-db: storage and retrieval will be faster, and some queries will be faster, but you will lose some search capabilities.

I think the best we can do is offer both options, and let each project decide which is best for them.

benjamingeer avatar Jan 22 '20 10:01 benjamingeer

What do you consider will be "a lot of markup"?

mrivoal avatar Jan 22 '20 10:01 mrivoal

What do you consider will be "a lot of markup"?

In the test I did, nearly every word had a tag. The more markup you have, the more triples have to be retrieved, and the slower it's going to be. If you have a big text with very little markup, GraphDB can still retrieve it pretty quickly.

benjamingeer avatar Jan 22 '20 10:01 benjamingeer

Ok, thanks.

mrivoal avatar Jan 22 '20 11:01 mrivoal

Just have a look to their job : http://lumieres.unil.ch/fiches/trans/1088/

That text has chapters. Why not store one chapter per resource? That would also make navigation and editing a lot easier. Do you really want to scroll through that much text on one HTML page?

benjamingeer avatar Jan 22 '20 11:01 benjamingeer