BlackLab icon indicating copy to clipboard operation
BlackLab copied to clipboard

Include attribute values of matching element in CSV

Open jan-niestadt opened this issue 4 years ago • 2 comments

If a query matches an XML element (e.g.

), it would be nice if the attribute values of the element (such as the paragraph's ID) could be included in the CSV.

Original request by Martin Reynaert:

As we have FoLiA and therefore have the exact references to the paragraphs that match, would it be feasible to get these references in the CSV-output too? Perhaps make this dependent on the choice of unit to be searched in (document or paragraph or sentence)?

This would help us a lot as we want to refer to these text units in papers and want to extract paragraphs based on queries for further annotation.

jan-niestadt avatar Apr 24 '20 10:04 jan-niestadt

For now: if you would like to solve this yourself with a script, you would first request /hits from BlackLab Server and then use /docs/pid/snippet (with conc=orig and wordstart and wordend parameters based on the hit positions) to get a snippet from the original XML, so it includes the paragraph tags with the IDs. See https://inl.github.io/BlackLab/blacklab-server-overview.html#requests

jan-niestadt avatar Apr 24 '20 10:04 jan-niestadt

Adding conc=orig to the hits request will generate concordances from the original input XML, which will include the inline tags such as <s/> and its attributes. This is not CSV though. It would be challenging to add this to the CSV in a performant way.

To solve this in a generic/performant way, you would probably need a forward index for XML elements. A complication would be that multiple XML tags can occur at a single token position.

jan-niestadt avatar Apr 25 '22 11:04 jan-niestadt