khiops
khiops copied to clipboard
Parallelization of the collection of tokens for text analysis
Contexte: Khiops V11, autoML pour les variables de type Text (nouveau type)
Dans le cadre de la construction de variable automatique pour les variables Text, une première phase d'analyse de la base consiste à collecter les tokens ("ngrams", "words","tokens") les plus fréquents pour créer des blocs de variables sparse. Cette phase est actuellement implémentée en séquentiel. Il s'agit de passer l'implémentation en parallèle.
Point d'entrée:
- KDDomainKnowledge/KDTextTokenSampleCollectionTask::CollectTokenSamples
- example of a basic task:
KWDatabaseCheckTask: overall structure; - parent task class:
KWDatabaseTask; - another task with basic parallel processing mechanics (database chunking by the master, traversal by the workers, aggregation by the master),
KWDatabaseBasicStatsTask - relevant class here:
KDTextTokenSampleCollectionTask:- the
CollectTokenSamples:- sequential implementation:
SequentialCollectTokenSamplesmethod - beginning parallel implementation (some preparatory work for parallelization):
InternalCollectTokenSamples, which contains only first-pass implementation (to check for completeness); second pass not yet implemented.
- sequential implementation:
MasterAggregateResults: aggregation of tokens from the worker not yet mplemented.SlaveInitialize: prepare data structures for / by each worker: first pass implemented already; second pass not yet implemented.
- the