dsp-api icon indicating copy to clipboard operation
dsp-api copied to clipboard

Test performance of standoff searches with large texts

Open benjamingeer opened this issue 6 years ago • 10 comments

This needs large texts containing lots of markup.

benjamingeer avatar Dec 07 '18 09:12 benjamingeer

@tobiasschweizer @SepidehAlassi To make a large text containing lots of markup, could we combine a lot of BEOL texts into one text? How hard do you think that would be?

benjamingeer avatar Jul 10 '19 13:07 benjamingeer

I think it won‘t be hard if the standoff is created from XML. We could just copy and paste an existing text several times inside the same XML doc, reusing the same structure and mapping.

tobiasschweizer avatar Jul 10 '19 13:07 tobiasschweizer

We could just copy and paste an existing text several times

I think that for a realistic performance test, it would be better not to repeat the same content.

benjamingeer avatar Jul 10 '19 13:07 benjamingeer

Ok, then let‘s create a huge text by combining existing texts that have the same mapping

tobiasschweizer avatar Jul 10 '19 13:07 tobiasschweizer

A digital edition corpus might not be the most demanding text for performance tests. As discussed with @tobiasschweizer and 2 researchers from a new SNF project, corpora from linguistics are probably more intensively tagged.

We (meaning @loicjaouen) were also thinking about creating a fake corpus tagged with NLP libraries, such as the one provided by Standford, which, according to one of the researcher produced 8 different tags. But we were planning this for the end of july.

mrivoal avatar Jul 10 '19 13:07 mrivoal

we were planning this for the end of july.

That would be great. No reason we can't test with both kinds of texts.

benjamingeer avatar Jul 10 '19 13:07 benjamingeer

Yes, exactly!

mrivoal avatar Jul 10 '19 13:07 mrivoal

@tobiasschweizer @SepidehAlassi To make a large text containing lots of markup, could we combine a lot of BEOL texts into one text? How hard do you think that would be?

@benjamingeer it is easy to do, I can combine all of the Euler correspondence texts which are full of markup. when do you need it?

SepidehAlassi avatar Jul 24 '19 09:07 SepidehAlassi

I downloaded 50 large books from Project Gutenberg. Each is at least 500 MB, many are over 1 MB.

The current plan is to use knora-py to create a simple ontology for these books, and to add markup (using the standard mapping):

  1. On each word.
  2. On each sequence of 10 words (simulating sentences).

Then test:

  1. Retrieving a book without markup.
  2. Retrieving a book with markup.
  3. Searching for a book using full-text search.
  4. Searching for a book using standoff in Gravsearch.

The goal is to provide some guidelines about how and when to split up large texts into multiple text values.

benjamingeer avatar Sep 03 '19 14:09 benjamingeer

We initially hoped that we could help you to test this. However, the project I mentioned earlier won't use Knora as a research tool in the course of the project, but rather as a data curation tool, at the end of the project.

So, we are not going to simulate a corpus tagged with NLP libraries right now.

mrivoal avatar Sep 03 '19 14:09 mrivoal