statistical-classifier
statistical-classifier copied to clipboard
New document class
Work in progress, just to know if you like this lead...
I like the idea, but I am still thinking about the implications.
Just figured out that I put my explanations in the commit message, don't know if you saw them (but yes, a lot to think... It's just a lead):
For instance, this document can be used directly in classify() method replacing the commented code and thus the related properties/accessors:
public function classify(Document $document)
{
$results = array();
/*
if ($this->documentNormalizer) {
$document = $this->documentNormalizer->normalize($document);
}
$tokens = $this->tokenizer->tokenize($document);
if ($this->tokenNormalizer) {
$tokens = $this->tokenNormalizer->normalize($tokens);
}
$tokens = array_count_values($tokens);
*/
$tokens = $document;
[...]
My pros
- Strong contract for Documents through interface
- Document is in a frequency state ASAP
- Document API is very wide open (cf Unit Tests)
- Document can still be manipulated as an array/Iterable (shame, Symfony config component (DataStore) doesn't like ArrayObject)
- Since document is an object, it is more RAM-efficient (no multiple copies as with an array)
- Agnostic approach using SPL
- One can even use closures/built-in functions for normalizers/tokenizers (faster?)
- Hydrators/Extractors made simplier
- Some more document-level calculations could be done in the instance
- TokenCountByDocument no longer necessary
My cons
- Not sure if it should be in tokens state rather than in frequency state (need for calculation/count? we could store these information either)
- Loose contracts for normalizers/tokenizers since it uses callables instead of classes with interfaces (could still be enforced though, but we would loose the closures/built-in functions advantage)
- Slower than arrays? (not sure, needs a bench since SPL is incredibly fast, and it removes a lot of logic/iterations around)
- Static approach for accessors which is sometimes hated by developpers (Unit Tests...)
- Your cons?