quantulum3 icon indicating copy to clipboard operation
quantulum3 copied to clipboard

Some quirks when parsing a general text...

Open psychemedia opened this issue 6 years ago • 7 comments

I wrote a simple story and it threw up some interesting numbers...

text = '''
Once upon a time, there was a thing. The thing weighed forty kilogrammes and cost £250. 
It was blue. It took forty five minutes to get it home. 
What a day that was. I didn't get back until 2.15pm. Then I had cake for tea.
'''

parser.inline_parse_and_expand(text)

returns:

"\nOnce upon one instance, there was a thing. The thing weighed forty kilograms and cost two hundred and fifty pounds sterling, zero pence. \nIt was blue. It took forty-five minutes to get it home. \nWhat one day that was. I didn't get back until two point one five picometres. Then I had cake for tea.\n"

and parser.parse(text) returns:

[Quantity(1, "Unit(name="count", entity=Entity("dimensionless"), uri=Count_data)"), Quantity(40, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)"), Quantity(250, "Unit(name="pound sterling", entity=Entity("currency"), uri=Pound_sterling)"), Quantity(45, "Unit(name="minute of arc", entity=Entity("angle"), uri=Minute_and_second_of_arc)"), Quantity(1, "Unit(name="day", entity=Entity("time"), uri=Day)"), Quantity(2.15, "Unit(name="picometre", entity=Entity("length"), uri=Picometre)")]

psychemedia avatar Aug 06 '19 13:08 psychemedia

There are some legitimate problems with the parse output of this text, thanks for the sample! I will have a look into certain issues.

  • [ ] "a thing" results in "1 count", which is actually not that wrong...
  • [x] pm/am are interpreted as pico-/attometres rather than time delimiters
  • [ ] "it took 45 minutes" is interpreted as minutes of arc

Disambiguation is not perfect yet as shown by the "minute of arc" interpretation. Still working on improving this...

nielstron avatar Aug 07 '19 16:08 nielstron

In passing, I also just spotted this natural language time parsing package — ctparse — but I've not had a chance to play with it yet.

psychemedia avatar Aug 11 '19 21:08 psychemedia

I had several similar issues. The weirdest being 'PayPal' being parsed into 'petayear year petayear litre'. Is there a way to force quantulum to just basic units and not try to guess these combinations? Or any way to change its behavior to adapt it to my situation.

alberto-bracci avatar May 11 '20 17:05 alberto-bracci

I agree, a parameter to disable parsing non-space-seperated combined units should be passed. Also maybe passing a list of custom (application specific) words that are not be interpreted as units. PRs addressing this are welcome, otherwise I might at some point find the time to implement this myself :)

nielstron avatar May 12 '20 19:05 nielstron

I'll see whether I can find the time to do it. On another note: the only way to add custom units is to edit the entities.json or units.json files? Or is there a way to do it from python?

alberto-bracci avatar May 13 '20 13:05 alberto-bracci

Currently this is the easiest way without changing the source code of the project. You can of course add your own entities and units by manipulating the cached Entities and Units objects stored in the _CACHE_DICT in load.py

nielstron avatar May 13 '20 13:05 nielstron

@alberto-bracci with #186 there will be an option to add custom entities and units to quantulum3 without any hassle :) sorry for the delay but this required some reworking of inner quantulum structure that was pending anyways

nielstron avatar Mar 02 '21 00:03 nielstron