BlackLab
BlackLab copied to clipboard
Include attribute values of matching element in CSV
If a query matches an XML element (e.g.
), it would be nice if the attribute values of the element (such as the paragraph's ID) could be included in the CSV.Original request by Martin Reynaert:
As we have FoLiA and therefore have the exact references to the paragraphs that match, would it be feasible to get these references in the CSV-output too? Perhaps make this dependent on the choice of unit to be searched in (document or paragraph or sentence)?
This would help us a lot as we want to refer to these text units in papers and want to extract paragraphs based on queries for further annotation.
For now: if you would like to solve this yourself with a script, you would first request /hits
from BlackLab Server and then use /docs/pid/snippet
(with conc=orig
and wordstart
and wordend
parameters based on the hit positions) to get a snippet from the original XML, so it includes the paragraph tags with the IDs. See https://inl.github.io/BlackLab/blacklab-server-overview.html#requests
Adding conc=orig
to the hits request will generate concordances from the original input XML, which will include the inline tags such as <s/>
and its attributes. This is not CSV though. It would be challenging to add this to the CSV in a performant way.
To solve this in a generic/performant way, you would probably need a forward index for XML elements. A complication would be that multiple XML tags can occur at a single token position.