grobid-ner
grobid-ner copied to clipboard
Define a TEI format to publish the annotated data
If we want to officially publish the annotated data, we would need to define a new format - TEI based.
Example of current input:
<?xml version="1.0" encoding="UTF-8"?>
<corpus>
<subcorpus>
<document name="EHRI_institutions_be-006093.en">
<p xml:lang="en" xml:id="P0">
<sentence xml:id="P0E0">In retrospect, the <ENAMEX type="INSTITUTION">National Archives of Belgium</ENAMEX> were established by the <ENAMEX type="LEGAL">French law of October 26th 1796 (5 Brumair V)</ENAMEX>, which, amongst others, foresaw in the organisation of departmental depots (amongst others, in <ENAMEX type="LOCATION">Brussels</ENAMEX>), in which the archives of the disbanded institutions of the <ENAMEX type="PERIOD">Ancien Régime</ENAMEX> would be stored.</sentence>
<sentence xml:id="P0E1">In <ENAMEX type="PERIOD">1831</ENAMEX>, the archive depot in <ENAMEX type="LOCATION">Brussels</ENAMEX> was officially named the <ENAMEX type="INSTITUTION">National Archives of Belgium</ENAMEX>.</sentence>
<sentence xml:id="P0E2">Already in the <ENAMEX type="PERIOD">early nineteenth century</ENAMEX>, more archival depots in the provinces were installed, which were officially placed under the direction of the <ENAMEX type="TITLE">National State Archivist</ENAMEX> (who holds his office in the <ENAMEX type="INSTITUTION">National Archives</ENAMEX>) in <ENAMEX type="PERIOD">1851</ENAMEX>.</sentence>
<sentence xml:id="P0E3"><ENAMEX type="INSTITUTION">The “Archives Générales du Royaume”(National Archives of Belgium) and the “Archives de l’État dans les Provinces”(State Archives in the Provinces)</ENAMEX>, in other words the <ENAMEX type="INSTITUTION">State Archives</ENAMEX> are a federal academic establishment that forms part of the <ENAMEX type="INSTITUTION">“Service Public Fédéral de Programmation Politique scientifique”(Belgian Federal Science Policy Office)</ENAMEX>.</sentence>
<sentence xml:id="P0E4">The institution includes the “<ENAMEX type="INSTITUTION">Archives Générales du Royaume</ENAMEX>” in <ENAMEX type="LOCATION">Brussels</ENAMEX> and <ENAMEX type="MEASURE">18</ENAMEX> <ENAMEX type="INSTITUTION">State Archives</ENAMEX> that are distributed throughout the country.</sentence>
<sentence xml:id="P0E5">The <ENAMEX type="INSTITUTION">State Archives</ENAMEX> ensure the proper preservation of archival documents produced and managed by the state authorities.</sentence>
<sentence xml:id="P0E6">For this purpose, the <ENAMEX type="INSTITUTION">State Archives</ENAMEX> issue directives and recommendations, conduct inspections, organises training for civil servants and act as an advisory body for the construction and preparation of premises for the conservation of archives and for the organisation of archive management within a public authority.</sentence>
<sentence xml:id="P0E7">The <ENAMEX type="INSTITUTION">State Archives</ENAMEX> obtain and preserve (following sorting) archive documents that are at least <ENAMEX type="PERIOD">30 years</ENAMEX> old from courts, tribunals, public authorities, notaries and from the private sector and private individuals (companies, politicians, associations and societies, influential families, etc. that have played an important role in society).</sentence>
<sentence xml:id="P0E8">They ensure that public archives are transferred according to strict archival standards.</sentence>
</p>
<p xml:lang="en" xml:id="P1">
<sentence xml:id="P1E0">The <ENAMEX type="INSTITUTION">National Archives of Belgium 2 – Joseph Cuvelier repository</ENAMEX> preserves the archives of the external services of the <ENAMEX type="INSTITUTION">Federal Public Service Justice</ENAMEX> (penal institutions), the courts and tribunals under the responsiblity of the <ENAMEX type="LOCATION">Brussels-Capital Region</ENAMEX> (<ENAMEX type="INSTITUTION">justices of peace, police tribunals, Court of Cassation</ENAMEX>, etc.), the <ENAMEX type="INSTITUTION">Federal Public Service Economy</ENAMEX> (patents), the <ENAMEX type="INSTITUTION">Ministry for Reconstruction</ENAMEX> (files on war damages) and business archives.</sentence>
</p>
<p xml:lang="en" xml:id="P2">
<sentence xml:id="P2E0">There are several online search engines: keyword, archives, creator, persons, themes (<ENAMEX type="WEBSITE">http://search.arch.be/</ENAMEX>).</sentence>
<sentence xml:id="P2E1">In order to facilitate access to documents, archivists produce academic reference works aimed at users, such as archive group overviews, guides, historical source studies and, in particular, inventories and search guides with detailed indexes.</sentence>
<sentence xml:id="P2E2">The search guides can be consulted in the reading room, and they are currently subject to a digitisation initiative, which aims to make them fully accessible on-line or via the intranet available on the computers in all the depositories of the <ENAMEX type="INSTITUTION">State Archives</ENAMEX>.</sentence>
</p>
</document>
</subcorpus>
</corpus>
My proposal is to use <rs @type=class
for the annotations, I'm not sure how to encode the rest.
<?xml version="1.0" encoding="UTF-8"?>
<corpus>
<subcorpus>
<document name="EHRI_institutions_be-006093.en">
<p xml:lang="en" xml:id="P0">
<s xml:id="P0E0">In retrospect, the <ENAMEX type="INSTITUTION">National Archives of Belgium</ENAMEX> were established by the <ENAMEX type="LEGAL">French law of October 26th 1796 (5 Brumair V)</ENAMEX>, which, amongst others, foresaw in the organisation of departmental depots (amongst others, in <ENAMEX type="LOCATION">Brussels</ENAMEX>), in which the archives of the disbanded institutions of the <ENAMEX type="PERIOD">Ancien Régime</ENAMEX> would be stored.</s>
<s xml:id="P0E1">In <ENAMEX type="PERIOD">1831</ENAMEX>, the archive depot in <ENAMEX type="LOCATION">Brussels</ENAMEX> was officially named the <ENAMEX type="INSTITUTION">National Archives of Belgium</ENAMEX>.</s>
<s xml:id="P0E2">Already in the <ENAMEX type="PERIOD">early nineteenth century</ENAMEX>, more archival depots in the provinces were installed, which were officially placed under the direction of the <ENAMEX type="TITLE">National State Archivist</ENAMEX> (who holds his office in the <ENAMEX type="INSTITUTION">National Archives</ENAMEX>) in <ENAMEX type="PERIOD">1851</ENAMEX>.</s>
<s xml:id="P0E3"><ENAMEX type="INSTITUTION">The “Archives Générales du Royaume”(National Archives of Belgium) and the “Archives de l’État dans les Provinces”(State Archives in the Provinces)</ENAMEX>, in other words the <ENAMEX type="INSTITUTION">State Archives</ENAMEX> are a federal academic establishment that forms part of the <ENAMEX type="INSTITUTION">“Service Public Fédéral de Programmation Politique scientifique”(Belgian Federal Science Policy Office)</ENAMEX>.</s>
<s xml:id="P0E4">The institution includes the “<ENAMEX type="INSTITUTION">Archives Générales du Royaume</ENAMEX>” in <ENAMEX type="LOCATION">Brussels</ENAMEX> and <ENAMEX type="MEASURE">18</ENAMEX> <ENAMEX type="INSTITUTION">State Archives</ENAMEX> that are distributed throughout the country.</s>
<s xml:id="P0E5">The <ENAMEX type="INSTITUTION">State Archives</ENAMEX> ensure the proper preservation of archival documents produced and managed by the state authorities.</s>
<s xml:id="P0E6">For this purpose, the <ENAMEX type="INSTITUTION">State Archives</ENAMEX> issue directives and recommendations, conduct inspections, organises training for civil servants and act as an advisory body for the construction and preparation of premises for the conservation of archives and for the organisation of archive management within a public authority.</s>
<s xml:id="P0E7">The <ENAMEX type="INSTITUTION">State Archives</ENAMEX> obtain and preserve (following sorting) archive documents that are at least <ENAMEX type="PERIOD">30 years</ENAMEX> old from courts, tribunals, public authorities, notaries and from the private sector and private individuals (companies, politicians, associations and societies, influential families, etc. that have played an important role in society).</s>
<s xml:id="P0E8">They ensure that public archives are transferred according to strict archival standards.</s>
</p>
<p xml:lang="en" xml:id="P1">
<s xml:id="P1E0">The <ENAMEX type="INSTITUTION">National Archives of Belgium 2 – Joseph Cuvelier repository</ENAMEX> preserves the archives of the external services of the <ENAMEX type="INSTITUTION">Federal Public Service Justice</ENAMEX> (penal institutions), the courts and tribunals under the responsiblity of the <ENAMEX type="LOCATION">Brussels-Capital Region</ENAMEX> (<ENAMEX type="INSTITUTION">justices of peace, police tribunals, Court of Cassation</ENAMEX>, etc.), the <ENAMEX type="INSTITUTION">Federal Public Service Economy</ENAMEX> (patents), the <ENAMEX type="INSTITUTION">Ministry for Reconstruction</ENAMEX> (files on war damages) and business archives.</s>
</p>
<p xml:lang="en" xml:id="P2">
<s xml:id="P2E0">There are several online search engines: keyword, archives, creator, persons, themes (<ENAMEX type="WEBSITE">http://search.arch.be/</ENAMEX>).</s>
<s xml:id="P2E1">In order to facilitate access to documents, archivists produce academic reference works aimed at users, such as archive group overviews, guides, historical source studies and, in particular, inventories and search guides with detailed indexes.</s>
<s xml:id="P2E2">The search guides can be consulted in the reading room, and they are currently subject to a digitisation initiative, which aims to make them fully accessible on-line or via the intranet available on the computers in all the depositories of the <ENAMEX type="INSTITUTION">State Archives</ENAMEX>.</s>
</p>
</document>
</subcorpus>
</corpus>
Just a few comments:
- in the document header, need to keep track of the wikipedia article used (acknowledgement), with its version/download date
- in the header, include a link to the NE schema used, I guess with
<classDecl>
under<encodingDesc>
, pointing to readthedocs - I personally find easier to manage/debug/etc. a list of separate files, one per document, rather than one single huge document with teiCorpus (external people can then use github to correct the annotations easily), the annotated data will live :)