grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Merging two functionalities in grobid

Open Tanmay98 opened this issue 1 year ago • 4 comments

Right now, grobid follows the cascading like: Segmentation->Header, Fulltext, Reference segmenter

And I want to also use: Custom segmentation->Custom feature

Is it possible to combine both in one build? what I mean is that right now when I run processFullText, grobid follows the set hierarchy, but let's say if I run processMyFeature and I want grobid to follow some other custom hierarchy like I mentioned above.

All in all is it possible to add both of these seperate cascadings in one single build??

Thanks in advance!

Tanmay98 avatar Jul 20 '22 07:07 Tanmay98

Hi @Tanmay98

Is "Custom feature" an existing submodel of Grobid or you would like to write your own?

The existing sub-models are constrained in term of input/output, and all the output do not have a final serialization - so something to return. At least a new result serialization (in xml or json) would be necessary.

If one wants to add its own process at any stages of the processing hierarchy, currently some Java development for this new process is required. This is done in the grobid modules listed here, which introduce additional models applied after segmentation or fulltext, on certain relevant substructures.

kermitt2 avatar Jul 20 '22 08:07 kermitt2

Thankyou for your quick response @kermitt2 !

Actually, no the custom feature is not an existing submodule of grobid.

My concern is that i want two seperate hierarchies to run. For example, I want to use the current hierarchy that grobid by default follows as well my other custom heirarchy. I was wondering if it was possible?

Also, I did went through the grobid-dictionary submodule. So regarding that I assumed that using maven I will be only able to run the dictionary part and not the default grobid features using one single server. I am sorry but I am new to java and maven, etc. (I know Machine Learning very well). Is it possible to run both the grobid dictionary modules as well as default grobid modules by running only one server? As in if i run maven/./gradlew run, I am able to run both processfulltext as well as processDictionary?

Tanmay98 avatar Jul 20 '22 08:07 Tanmay98

Hi @kermitt2, my goal was to train models such as segmentation(for grobid) and segmentation(grobid-dictionary) from a single server run (./gradlew run) So I tried to combine both grobid dictionary modules and grobid modules in one single pipeline. I made necessary files in grobid-core and grobid-trainer as well as attached two different TEI formatter (one for grobid dictionary and other for grobid). Finally I also did changes in the gradle build file. I was able to successfully build the library but when I run ./gradlew train_dictionary_body_segmentation, I get the following errors Screenshot 2022-07-26 at 11 10 13 AM

Can you help me?

Tanmay98 avatar Jul 26 '22 05:07 Tanmay98

Hello @Tanmay98 !

Apparently you need to load a property file specific to grobid-dictionaries and instantiate a GrobidDictionaryProperties object.

But I am was not part of the developers of grobid-dictionaries - you will certainly receive better help by asking in the grobid-dictionaries.

kermitt2 avatar Jul 26 '22 10:07 kermitt2