cleanco icon indicating copy to clipboard operation
cleanco copied to clipboard

Lots of new country and corporate type country coverage from the 'disco' fork

Open rjurney opened this issue 1 month ago • 16 comments

I added a bunch of country and company types from the disco fork at https://github.com/rjurney/disco

rjurney avatar Nov 12 '25 06:11 rjurney

Starts out #93

rjurney avatar Nov 12 '25 06:11 rjurney

So @psolin what do you think? This should be backwards compatible but add a lot of regional coverage.

rjurney avatar Nov 13 '25 21:11 rjurney

@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that...

rjurney avatar Nov 20 '25 22:11 rjurney

@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that...

Great, yes, it should be able to do that, but it'll probably take a lot of tokens.

psolin avatar Nov 20 '25 22:11 psolin

@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that...

Great, yes, it should be able to do that, but it'll probably take a lot of tokens.

Okay, I think we're good to go for this PR! The unit tests all pass and things are alphabetized. Once this is in, I'll work on some of the performance improvements. I don't suppose you can take a look and see what you think you'd like to pull over?

I didn't make the changes, @marekmodry did but they made cleanco (renamed disco) much faster. See for example:

  • The use of a LRU cache: https://github.com/rjurney/disco/blob/main/disco/utils.py#L17-L26
  • More LRU cache: https://github.com/rjurney/disco/blob/main/disco/legaltype/detector.py#L24-L76

I can make the changes to the rest of the code, but @psolin I'm just wondering if you aren't the better person to do them? :)

rjurney avatar Nov 21 '25 04:11 rjurney

@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago.

petri avatar Nov 21 '25 12:11 petri

The test job failure (see below) is caused by an assertion error in tests/test_cleanname.py at line 478, specifically in a test checking unicode non-Latin script handling. The log indicates:

AssertionError: unicode non-Latin script test for Greek alphabet failed assert '' == 'Εταιρεία Περ...μένης Ευθύνης'

The greek text means "Limited Liability Company" — the Greek legal form equivalent to an LLC/Ltd.

Transliteration: Etaireía Periorisménis Efthýnis, or, as often written, Etaireia Periorismenis Efthynis. Abbreviation: Ε.Π.Ε. (E.P.E.)

The basename function is returning an empty string instead of the expected Greek text. Other tests pass - Greek seems to be the key issue).

petri avatar Nov 21 '25 12:11 petri

Okay, I'll take a look.

rjurney avatar Nov 21 '25 17:11 rjurney

@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago.

We found the slow spots and improved them, although I've no idea how to fix the UTF issue...

rjurney avatar Nov 22 '25 22:11 rjurney

I think that the testing may be off. The test should preserve the entire Greek string "Εταιρεία Περιορισμένης Ευθύνης", but since it is a term now, it removes it, creating a blank string and failing the test. I think that the test needs to read something like: "Greek alphabet": ('Ελληνική Επιχείρηση', 'Ελληνική Επιχείρηση'),, since "Εταιρεία Περιορισμένης Ευθύνης" is literally a term now and not just a generic name for a business. I can update that now, and then this should pass.

psolin avatar Nov 22 '25 23:11 psolin

Sounds good, thanks.

rjurney avatar Nov 23 '25 01:11 rjurney

I think that the testing may be off. The test should preserve the entire Greek string "Εταιρεία Περιορισμένης Ευθύνης", but since it is a term now, it removes it, creating a blank string and failing the test. I think that the test needs to read something like: "Greek alphabet": ('Ελληνική Επιχείρηση', 'Ελληνική Επιχείρηση'),, since "Εταιρεία Περιορισμένης Ευθύνης" is literally a term now and not just a generic name for a business. I can update that now, and then this should pass.

Any update? I'd like to get this integrated.

rjurney avatar Dec 06 '25 06:12 rjurney

@psolin I am basing this off my other PR, to see if it fixes it. Can you approve my runs plz? Of course at the moment there is duplicate code, but merging the other one first would fix that. I'm not sure how else to do it.

rjurney avatar Dec 06 '25 09:12 rjurney

Okay, I think this is ready to go? cc @psolin

rjurney avatar Dec 06 '25 23:12 rjurney

@petri I think it needs to run through the test? I am not sure how to do this.

psolin avatar Dec 07 '25 00:12 psolin

Please approve the workflows so the tests can run. Or make me a contributor.

rjurney avatar Dec 07 '25 06:12 rjurney

Ping :)

rjurney avatar Dec 14 '25 02:12 rjurney

@psolin the tests pass, can you please merge? If you will review and approve in a timely manner I will work on the library. Otherwise can you give me permission as Contributor?

rjurney avatar Dec 15 '25 00:12 rjurney

@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago.

If you will approve this PR and merge it, I will move onto the performance enhancements and Chinese company name support.

rjurney avatar Dec 15 '25 00:12 rjurney

God bless you every one, sirs!

rjurney avatar Dec 15 '25 00:12 rjurney