ParlaMint icon indicating copy to clipboard operation
ParlaMint copied to clipboard

refactorization and speedup: metadata extraction

Open matyaskopp opened this issue 11 months ago • 2 comments

currently, an extraction of metadata to TSV appears multiple times in the code:

https://github.com/clarin-eric/ParlaMint/blob/2de4c7c04567e8df740f72b70577a15adbf4cd90/Scripts/parlamintp-tei2text.pl#L51-L75 https://github.com/clarin-eric/ParlaMint/blob/2de4c7c04567e8df740f72b70577a15adbf4cd90/Scripts/parlamintp2conllu.pl#L107-L124

Before I try to speed up the process, I need to

  • [x] factorize it out

In parlamint2distro.pl, I want to call this metadata extraction separately, because it needs a different setup (less jobs in paralel), especially for ParlaMint-IL, it is not possible to run it in 60 jobs, because it runs out of memory (all 45k files need to open taxonomies and particDesc files, that are extremely large). @TomazErjavec, should it be backwards compatible? I can add an option -no-meta to the script, so that extracting metadata will not be called when it is present.

I can then try to speed up the process:

  • [ ] print all translations in one run
  • [ ] process multiple component files in one run (chunking), so header files will be parsed fewer times

@TomazErjavec, please let me know, what you think about this, I will then implement it in #894

matyaskopp avatar Feb 05 '25 10:02 matyaskopp

Before I try to speed up the process, I need to factorize it out

If I understand correctly, you will make a new script, say parlamintp-tei2meta.pl, put this code there and then call the original script + parlamintp-tei2meta.pl from parlamint2distro.pl. Which is certainly a good idea, better than having the same code twice in different scripts.

@TomazErjavec, should it be backwards compatible?

I don't quite understand what you mean by this, but my inuitive answer is "no". If we can run the new scripts and get the same result as with the old ones, that is quite ok.

I can add an option -no-meta to the script, so that extracting metadata will not be called when it is present.

Hm, I don't see the need for that. It might even be dangerous, is it might (although probably won't) happen that the metadata is present, but from some previous version.

@TomazErjavec, please let me know, what you think about this

I think it is a good idea.

TomazErjavec avatar Feb 05 '25 15:02 TomazErjavec

the metadata extraction is now in Scripts/parlamintp-tei2meta.pl It uses -inRoot parameter instead of the input directory, so I placed dirification before running the script in samples: https://github.com/clarin-eric/ParlaMint/blob/f80fbb8a07e55355bc8299751b30c4d4ec7389e7/Scripts/parlamint2distro.pl#L339-L340 https://github.com/clarin-eric/ParlaMint/blob/f80fbb8a07e55355bc8299751b30c4d4ec7389e7/Scripts/parlamint2distro.pl#L348-L349

matyaskopp avatar Feb 07 '25 14:02 matyaskopp

This has been all nicely implemented a while back, thanks @matyaskopp. I will close this, in case you feel anything here is still open, prelase reopen.

TomazErjavec avatar Jul 07 '25 10:07 TomazErjavec