doc.lists() sensitivity
Hello and thanks for the great library! 👋
I've been having some difficulty getting the doc.lists() method to work, it seems to be very sensitive to the way lists are structured and the items in the list. The example material at https://observablehq.com/@spencermountain/compromise-lists contains an example of this sensitivity, where the code sample for nlp('he eats, shoots, and leaves.').lists().length is returning a length of zero. In that particular sample, it's the word leaves that causes it to fail (changing "leaves" to i.e. "runs" returns a positive result).
Here are some other cases I've noticed will cause it to fail:
// correct
nlp('i saw Paris, Berlin, Warsaw, and Dublin').lists().items().out('array');
// > ["Paris,", "Berlin,", "Warsaw,", "Dublin"]
// OXFORD COMMA: remove oxford comma and "Warsaw" gets lost
nlp('i saw Paris, Berlin, Warsaw and Dublin').lists().items().out('array');
// > ["Paris,", "Berlin,", "Dublin"]
// NO COMMA: two-item list with no comma is empty
nlp('i saw Paris and Dublin').lists().items().out('array');
// > []
// COMPLEX ITEMS: determiners cause unpredictable matches
nlp('i saw the Eiffel Tower, the pyramids, and the Louvre').lists().items().out('array')
// > []
nlp('i saw the Eiffel Tower, the pyramids, and Warsaw').lists().items().out('array')
// > ["pyramids,", "Warsaw"]
On a related note, the .items() in the array usually have a comma if they had one in the original text. Is that intended? The output seems to include the commas whether I'm using items() or things().
Any guidance would be appreciated!
hey Jackson, you're right. This method is really poorly done and needs work.
The idea was that since .clauses() splits by one comma-usage (properly), .lists() would split by list-commas. It ended up being pretty half-baked.
You can see the method to split by comma here.
It would be great to improve not just the method for identifying lists (it's pretty brittle, as you mentioned) but also parsing them into relevant items, and adding/removing items in a grammatical way. I'd love some help with this.
hah, i just noticed that addOxfordComma and removeOxfordComma don't actually do anything. Yikes.
yeah, maybe it makes sense to step back and think about what a natural-language list is, in the first place. I made a modest attempt at preventing things like walked, two, and george. - that all three items must resemble eachother a little. Also, I don't think 2-item lists are a thing.
Lastly, it should be smart about things like "red, green and not blue." -> .add('yellow')...
Poorly done? Nah! Just a work-in-progress. I can see this being a very complex problem to tackle in a way that still fits in with the 'modest compromise' philosophy.
What files would you suggest I take a look at if I were to try and wrap my head around the current implementation? I imagine compromise/src/Subset/Lists.js would be a good place to start? I'm just getting started with the library but would love to help once I'm confident with the internals.
thanks, that would be lovely. Any pr is welcome.
Feel free to play-around with different ways to find lists with the match-syntax in in ./subset/lists.js. You can see the current one uses .if('@hasComma'), which returns any sentences with ... a comma... ☹
... but then tries some forms like #Noun #Conjunction #Noun - which matches 'cat and dog', etc.
I think there's a couple basic tests, if you run npm run test, but i'm sure anything you decide to do will be a step forward. cheers
i'd be wary about supporting two-item lists right now, like the 'Paris and Dublin' example. That may have a ton of false-positives, unless we were careful.
Even if there was a comma, there's things like 'Paris, France' or 'i loved paris, dublin was fun.'
hey, these examples have been fixed in 13.1.1, if you wanna check it out.
I've also implemented add/removeOxfordComma methods, and the lists().remove('item-match') method.
I'll leave this open, as there is probably a lot more we can do to improve the natural-language list parsing. thanks for the heads-up. cheers
Hey! It's been a busy few months on my end but I haven't forgotten about this – happy to hear that improvements rolled out with 13.1.1, guess it's been a while since I checked back! I tried taking a stab at this back in February but spent a lot of time learning the other internals so that I could make a quality PR and then got sidetracked by life :)
Thanks for leaving this open; I'm thinking I should get a chance to try this behaviour out sometime in the next few weeks.