cleanco Lots of new country and corporate type country coverage from the 'disco' fork

I added a bunch of country and company types from the disco fork at https://github.com/rjurney/disco

Nov 12 '25 06:11 rjurney

Starts out #93

Nov 12 '25 06:11 rjurney

So @psolin what do you think? This should be backwards compatible but add a lot of regional coverage.

Nov 13 '25 21:11 rjurney

@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that...

Nov 20 '25 22:11 rjurney

@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that...

Great, yes, it should be able to do that, but it'll probably take a lot of tokens.

Nov 20 '25 22:11 psolin

@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that...

Great, yes, it should be able to do that, but it'll probably take a lot of tokens.

Okay, I think we're good to go for this PR! The unit tests all pass and things are alphabetized. Once this is in, I'll work on some of the performance improvements. I don't suppose you can take a look and see what you think you'd like to pull over?

I didn't make the changes, @marekmodry did but they made cleanco (renamed disco) much faster. See for example:

The use of a LRU cache: https://github.com/rjurney/disco/blob/main/disco/utils.py#L17-L26
More LRU cache: https://github.com/rjurney/disco/blob/main/disco/legaltype/detector.py#L24-L76

I can make the changes to the rest of the code, but @psolin I'm just wondering if you aren't the better person to do them? :)

Nov 21 '25 04:11 rjurney

@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago.

Nov 21 '25 12:11 petri

The test job failure (see below) is caused by an assertion error in tests/test_cleanname.py at line 478, specifically in a test checking unicode non-Latin script handling. The log indicates:

AssertionError: unicode non-Latin script test for Greek alphabet failed assert '' == 'Εταιρεία Περ...μένης Ευθύνης'

The greek text means "Limited Liability Company" — the Greek legal form equivalent to an LLC/Ltd.

Transliteration: Etaireía Periorisménis Efthýnis, or, as often written, Etaireia Periorismenis Efthynis. Abbreviation: Ε.Π.Ε. (E.P.E.)

The basename function is returning an empty string instead of the expected Greek text. Other tests pass - Greek seems to be the key issue).

Nov 21 '25 12:11 petri

Okay, I'll take a look.

Nov 21 '25 17:11 rjurney

@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago.

We found the slow spots and improved them, although I've no idea how to fix the UTF issue...

Nov 22 '25 22:11 rjurney

I think that the testing may be off. The test should preserve the entire Greek string "Εταιρεία Περιορισμένης Ευθύνης", but since it is a term now, it removes it, creating a blank string and failing the test. I think that the test needs to read something like: "Greek alphabet": ('Ελληνική Επιχείρηση', 'Ελληνική Επιχείρηση'),, since "Εταιρεία Περιορισμένης Ευθύνης" is literally a term now and not just a generic name for a business. I can update that now, and then this should pass.

Nov 22 '25 23:11 psolin

Sounds good, thanks.

Nov 23 '25 01:11 rjurney

I think that the testing may be off. The test should preserve the entire Greek string "Εταιρεία Περιορισμένης Ευθύνης", but since it is a term now, it removes it, creating a blank string and failing the test. I think that the test needs to read something like: "Greek alphabet": ('Ελληνική Επιχείρηση', 'Ελληνική Επιχείρηση'),, since "Εταιρεία Περιορισμένης Ευθύνης" is literally a term now and not just a generic name for a business. I can update that now, and then this should pass.

Any update? I'd like to get this integrated.

Dec 06 '25 06:12 rjurney

@psolin I am basing this off my other PR, to see if it fixes it. Can you approve my runs plz? Of course at the moment there is duplicate code, but merging the other one first would fix that. I'm not sure how else to do it.

Dec 06 '25 09:12 rjurney

Okay, I think this is ready to go? cc @psolin

Dec 06 '25 23:12 rjurney

@petri I think it needs to run through the test? I am not sure how to do this.

Dec 07 '25 00:12 psolin

Please approve the workflows so the tests can run. Or make me a contributor.

Dec 07 '25 06:12 rjurney

Ping :)

Dec 14 '25 02:12 rjurney

@psolin the tests pass, can you please merge? If you will review and approve in a timely manner I will work on the library. Otherwise can you give me permission as Contributor?

Dec 15 '25 00:12 rjurney

@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago.

If you will approve this PR and merge it, I will move onto the performance enhancements and Chinese company name support.

Dec 15 '25 00:12 rjurney

God bless you every one, sirs!

Dec 15 '25 00:12 rjurney