Lots of new country and corporate type country coverage from the 'disco' fork
I added a bunch of country and company types from the disco fork at https://github.com/rjurney/disco
Starts out #93
So @psolin what do you think? This should be backwards compatible but add a lot of regional coverage.
@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that...
@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that...
Great, yes, it should be able to do that, but it'll probably take a lot of tokens.
@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that...
Great, yes, it should be able to do that, but it'll probably take a lot of tokens.
Okay, I think we're good to go for this PR! The unit tests all pass and things are alphabetized. Once this is in, I'll work on some of the performance improvements. I don't suppose you can take a look and see what you think you'd like to pull over?
I didn't make the changes, @marekmodry did but they made cleanco (renamed disco) much faster. See for example:
- The use of a LRU cache: https://github.com/rjurney/disco/blob/main/disco/utils.py#L17-L26
- More LRU cache: https://github.com/rjurney/disco/blob/main/disco/legaltype/detector.py#L24-L76
I can make the changes to the rest of the code, but @psolin I'm just wondering if you aren't the better person to do them? :)
@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago.
The test job failure (see below) is caused by an assertion error in tests/test_cleanname.py at line 478, specifically in a test checking unicode non-Latin script handling. The log indicates:
AssertionError: unicode non-Latin script test for Greek alphabet failed assert '' == 'Εταιρεία Περ...μένης Ευθύνης'
The greek text means "Limited Liability Company" — the Greek legal form equivalent to an LLC/Ltd.
Transliteration: Etaireía Periorisménis Efthýnis, or, as often written, Etaireia Periorismenis Efthynis. Abbreviation: Ε.Π.Ε. (E.P.E.)
The basename function is returning an empty string instead of the expected Greek text. Other tests pass - Greek seems to be the key issue).
Okay, I'll take a look.
@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago.
We found the slow spots and improved them, although I've no idea how to fix the UTF issue...
I think that the testing may be off. The test should preserve the entire Greek string "Εταιρεία Περιορισμένης Ευθύνης", but since it is a term now, it removes it, creating a blank string and failing the test. I think that the test needs to read something like: "Greek alphabet": ('Ελληνική Επιχείρηση', 'Ελληνική Επιχείρηση'),, since "Εταιρεία Περιορισμένης Ευθύνης" is literally a term now and not just a generic name for a business. I can update that now, and then this should pass.
Sounds good, thanks.
I think that the testing may be off. The test should preserve the entire Greek string "Εταιρεία Περιορισμένης Ευθύνης", but since it is a term now, it removes it, creating a blank string and failing the test. I think that the test needs to read something like:
"Greek alphabet": ('Ελληνική Επιχείρηση', 'Ελληνική Επιχείρηση'),, since "Εταιρεία Περιορισμένης Ευθύνης" is literally a term now and not just a generic name for a business. I can update that now, and then this should pass.
Any update? I'd like to get this integrated.
@psolin I am basing this off my other PR, to see if it fixes it. Can you approve my runs plz? Of course at the moment there is duplicate code, but merging the other one first would fix that. I'm not sure how else to do it.
Okay, I think this is ready to go? cc @psolin
@petri I think it needs to run through the test? I am not sure how to do this.
Please approve the workflows so the tests can run. Or make me a contributor.
Ping :)
@psolin the tests pass, can you please merge? If you will review and approve in a timely manner I will work on the library. Otherwise can you give me permission as Contributor?
@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago.
If you will approve this PR and merge it, I will move onto the performance enhancements and Chinese company name support.
God bless you every one, sirs!