dsp-api Test performance of standoff searches with large texts

This needs large texts containing lots of markup.

Dec 07 '18 09:12 benjamingeer

@tobiasschweizer @SepidehAlassi To make a large text containing lots of markup, could we combine a lot of BEOL texts into one text? How hard do you think that would be?

Jul 10 '19 13:07 benjamingeer

I think it won‘t be hard if the standoff is created from XML. We could just copy and paste an existing text several times inside the same XML doc, reusing the same structure and mapping.

Jul 10 '19 13:07 tobiasschweizer

We could just copy and paste an existing text several times

I think that for a realistic performance test, it would be better not to repeat the same content.

Jul 10 '19 13:07 benjamingeer

Ok, then let‘s create a huge text by combining existing texts that have the same mapping

Jul 10 '19 13:07 tobiasschweizer

A digital edition corpus might not be the most demanding text for performance tests. As discussed with @tobiasschweizer and 2 researchers from a new SNF project, corpora from linguistics are probably more intensively tagged.

We (meaning @loicjaouen) were also thinking about creating a fake corpus tagged with NLP libraries, such as the one provided by Standford, which, according to one of the researcher produced 8 different tags. But we were planning this for the end of july.

Jul 10 '19 13:07 mrivoal

we were planning this for the end of july.

That would be great. No reason we can't test with both kinds of texts.

Jul 10 '19 13:07 benjamingeer

Yes, exactly!

Jul 10 '19 13:07 mrivoal

@tobiasschweizer @SepidehAlassi To make a large text containing lots of markup, could we combine a lot of BEOL texts into one text? How hard do you think that would be?

@benjamingeer it is easy to do, I can combine all of the Euler correspondence texts which are full of markup. when do you need it?

Jul 24 '19 09:07 SepidehAlassi

I downloaded 50 large books from Project Gutenberg. Each is at least 500 MB, many are over 1 MB.

The current plan is to use knora-py to create a simple ontology for these books, and to add markup (using the standard mapping):

On each word.
On each sequence of 10 words (simulating sentences).

Then test:

Retrieving a book without markup.
Retrieving a book with markup.
Searching for a book using full-text search.
Searching for a book using standoff in Gravsearch.

The goal is to provide some guidelines about how and when to split up large texts into multiple text values.

Sep 03 '19 14:09 benjamingeer

We initially hoped that we could help you to test this. However, the project I mentioned earlier won't use Knora as a research tool in the course of the project, but rather as a data curation tool, at the end of the project.

So, we are not going to simulate a corpus tagged with NLP libraries right now.

Sep 03 '19 14:09 mrivoal

dsp-api dsp-api copied to clipboard

Test performance of standoff searches with large texts

dsp-api
dsp-api copied to clipboard