GreynirServer icon indicating copy to clipboard operation
GreynirServer copied to clipboard

Grammar for parsing numbers in text form

Open sultur opened this issue 3 years ago • 2 comments

Grammar idea for parsing spoken numbers.

sultur avatar Aug 11 '21 13:08 sultur

Note: This only works if GreynirPackage doesn't join together numbers (such as "fimm hundruð þúsund") into a single token. (~~Currently working on a pull request for GreynirPackage which addresses this~~. New pull request)

sultur avatar Aug 11 '21 13:08 sultur

Large improvements to this pull request (and new ideas) :), but it is not ready yet. I would like to move repeated code from the query modules into the queries/utility folder, but I don't have time to do that yet as I'm prioritizing the IoT PR.

sultur avatar Oct 07 '22 14:10 sultur

Should we merge this pull request, after a review? @vthorsteinsson @sveinbjornt

Pros

It contains very helpful functionality for other pull requests (which could make them cleaner and more manageable) and currently doesn't break any functionality in the query modules or elsewhere that I know of.

The STT normalizes numbers for us at the moment, so the new tokenization and number parsing isn't needed for the query modules usual behaviour.

Also it doesn't require any radical changes to query module creation or Greynir's usage, despite the wall of text which I'm currently writing.

Cons?

It still has room for improvements, mainly specifying better the inflections of numbers in the grammar and extracting more functionality from other query modules than currency.py, but that could be done in other pull requests so this one doesn't become extremely large.

Main changes summarized

queries/utility/

Created the queries/utility folder which should house shared functionality between the query modules. This required changes to query.py, tree.py and minor changes to processor.py.

query.py

Functions imported from tree query modules are now kept in a ChainMap instead of a module object (the chain map contains nonterminal functions from the utility modules, along with functions from the imported module). So Query._tree_processors is now a list of ChainMap objects.

The chain maps prefer the imported processor's functions over the shared utility functions, so authors of query modules can overwrite utility module functionality on a module-by-module basis if needed.

tree.py

I modified some functions to accomodate the the chain maps instead of module objects, which mostly just involved changing getattr() calls to .get() calls.

I created a type called ProcEnv (processing environment), which is just a type describing a mapping of keywords/nonterminal names to functions (a ChainMap is a mapping, so it's also a ProcEnv). Functions which used to accept module objects now accept ProcEnv objects.

processor.py

The aforementioned changes however also meant I had to pass in vars(p) in some places in processor.py instead of just p (in order to change the module objects into mappings of functions to match the ProcEnv type). The functionality stays the same there though. Also added type hints and calls to vars() in test_processors.py where necessary.

Query tokenization

Incoming queries (note: only used in query processing) are now tokenized using no_multiply_numbers=True. This means written numbers, e.g. "tvö hundruð" are no longer a single token, but rather two word tokens which should be parsed using the the UHeilTala nonterminal (fractions can be parsed using UBrotaTala, see queries/currency.py and tests/test_queries.py::test_currency for examples of these).

pathlib.Path instead of os.path

util.py now has GREYNIR_ROOT_PATH: pathlib.Path available as a variable (found using LICENSE.txt in the root of the repo). This makes it easy (and cleaner in my opinion) to reference any file path in the Greynir repo (platform agnostic too). Refactored a bunch of usages of os.path.join to utilize this instead, while keeping the functionality the same. This also allowed removing of a bunch of ~~hacky~~ functions that manipulated the path in order to find certain files.

sultur avatar Oct 27 '22 10:10 sultur

Looks good to me! But before merging, it'd be great if @vthorsteinsson took a look as well.

sveinbjornt avatar Oct 27 '22 14:10 sveinbjornt

I think I've addressed all your comments, thanks for the review! :smile_cat: Ok to merge?

sultur avatar Oct 28 '22 11:10 sultur