refactorization and speedup: metadata extraction
currently, an extraction of metadata to TSV appears multiple times in the code:
https://github.com/clarin-eric/ParlaMint/blob/2de4c7c04567e8df740f72b70577a15adbf4cd90/Scripts/parlamintp-tei2text.pl#L51-L75 https://github.com/clarin-eric/ParlaMint/blob/2de4c7c04567e8df740f72b70577a15adbf4cd90/Scripts/parlamintp2conllu.pl#L107-L124
Before I try to speed up the process, I need to
- [x] factorize it out
In parlamint2distro.pl, I want to call this metadata extraction separately, because it needs a different setup (less jobs in paralel), especially for ParlaMint-IL, it is not possible to run it in 60 jobs, because it runs out of memory (all 45k files need to open taxonomies and particDesc files, that are extremely large).
@TomazErjavec, should it be backwards compatible? I can add an option -no-meta to the script, so that extracting metadata will not be called when it is present.
I can then try to speed up the process:
- [ ] print all translations in one run
- [ ] process multiple component files in one run (chunking), so header files will be parsed fewer times
@TomazErjavec, please let me know, what you think about this, I will then implement it in #894
Before I try to speed up the process, I need to factorize it out
If I understand correctly, you will make a new script, say parlamintp-tei2meta.pl, put this code there and then call the original script + parlamintp-tei2meta.pl from parlamint2distro.pl. Which is certainly a good idea, better than having the same code twice in different scripts.
@TomazErjavec, should it be backwards compatible?
I don't quite understand what you mean by this, but my inuitive answer is "no". If we can run the new scripts and get the same result as with the old ones, that is quite ok.
I can add an option -no-meta to the script, so that extracting metadata will not be called when it is present.
Hm, I don't see the need for that. It might even be dangerous, is it might (although probably won't) happen that the metadata is present, but from some previous version.
@TomazErjavec, please let me know, what you think about this
I think it is a good idea.
the metadata extraction is now in Scripts/parlamintp-tei2meta.pl
It uses -inRoot parameter instead of the input directory, so I placed dirification before running the script in samples:
https://github.com/clarin-eric/ParlaMint/blob/f80fbb8a07e55355bc8299751b30c4d4ec7389e7/Scripts/parlamint2distro.pl#L339-L340
https://github.com/clarin-eric/ParlaMint/blob/f80fbb8a07e55355bc8299751b30c4d4ec7389e7/Scripts/parlamint2distro.pl#L348-L349
This has been all nicely implemented a while back, thanks @matyaskopp. I will close this, in case you feel anything here is still open, prelase reopen.