statistical-classifier icon indicating copy to clipboard operation
statistical-classifier copied to clipboard

New document class

Open corentin-larose opened this issue 11 years ago • 2 comments

Work in progress, just to know if you like this lead...

corentin-larose avatar Jan 04 '14 23:01 corentin-larose

I like the idea, but I am still thinking about the implications.

camspiers avatar Jan 05 '14 01:01 camspiers

Just figured out that I put my explanations in the commit message, don't know if you saw them (but yes, a lot to think... It's just a lead):

For instance, this document can be used directly in classify() method replacing the commented code and thus the related properties/accessors:

    public function classify(Document $document)
    {
    $results = array();

    /*
        if ($this->documentNormalizer) {
            $document = $this->documentNormalizer->normalize($document);
        }

        $tokens = $this->tokenizer->tokenize($document);

        if ($this->tokenNormalizer) {
            $tokens = $this->tokenNormalizer->normalize($tokens);
        }

        $tokens = array_count_values($tokens);
    */

    $tokens = $document;
    [...]

My pros

  • Strong contract for Documents through interface
  • Document is in a frequency state ASAP
  • Document API is very wide open (cf Unit Tests)
  • Document can still be manipulated as an array/Iterable (shame, Symfony config component (DataStore) doesn't like ArrayObject)
  • Since document is an object, it is more RAM-efficient (no multiple copies as with an array)
  • Agnostic approach using SPL
  • One can even use closures/built-in functions for normalizers/tokenizers (faster?)
  • Hydrators/Extractors made simplier
  • Some more document-level calculations could be done in the instance
  • TokenCountByDocument no longer necessary

My cons

  • Not sure if it should be in tokens state rather than in frequency state (need for calculation/count? we could store these information either)
  • Loose contracts for normalizers/tokenizers since it uses callables instead of classes with interfaces (could still be enforced though, but we would loose the closures/built-in functions advantage)
  • Slower than arrays? (not sure, needs a bench since SPL is incredibly fast, and it removes a lot of logic/iterations around)
  • Static approach for accessors which is sometimes hated by developpers (Unit Tests...)
  • Your cons?

corentin-larose avatar Jan 05 '14 08:01 corentin-larose