khiops icon indicating copy to clipboard operation
khiops copied to clipboard

Parallelization of the collection of tokens for text analysis

Open marcboulle opened this issue 2 years ago • 1 comments

Contexte: Khiops V11, autoML pour les variables de type Text (nouveau type)

Dans le cadre de la construction de variable automatique pour les variables Text, une première phase d'analyse de la base consiste à collecter les tokens ("ngrams", "words","tokens") les plus fréquents pour créer des blocs de variables sparse. Cette phase est actuellement implémentée en séquentiel. Il s'agit de passer l'implémentation en parallèle.

Point d'entrée:

  • KDDomainKnowledge/KDTextTokenSampleCollectionTask::CollectTokenSamples

marcboulle avatar Sep 08 '23 14:09 marcboulle

  • example of a basic task: KWDatabaseCheckTask: overall structure;
  • parent task class: KWDatabaseTask;
  • another task with basic parallel processing mechanics (database chunking by the master, traversal by the workers, aggregation by the master), KWDatabaseBasicStatsTask
  • relevant class here: KDTextTokenSampleCollectionTask:
    • the CollectTokenSamples:
      • sequential implementation: SequentialCollectTokenSamples method
      • beginning parallel implementation (some preparatory work for parallelization): InternalCollectTokenSamples, which contains only first-pass implementation (to check for completeness); second pass not yet implemented.
    • MasterAggregateResults: aggregation of tokens from the worker not yet mplemented.
    • SlaveInitialize: prepare data structures for / by each worker: first pass implemented already; second pass not yet implemented.

popescu-v avatar Sep 02 '24 11:09 popescu-v