twitter-text
twitter-text copied to clipboard
google.lv is not extracted while google.com is
Here's my test case:
String text = "\nhttp://www.lursoft.lv/address/riga-terbatas-iela-73-lv-1001" +
"\ngoogle.com" +
"\ngoogle.lv" +
"\nwhatever.lv" +
"\nwhatever.lt" +
"\n$also $some $cash";
...
assertThat(urls, containsInAnyOrder("http://www.lursoft.lv/address/riga-terbatas-iela-73-lv-1001",
"google.lv", "google.com", "whatever.lv", "whatever.lt"));
https://github.com/twitter/twitter-text/blob/cebd98612738011d8b65d4c22650d56a0bcda669/conformance/TldLists.java#L1420
The difference here is that .com is considered a GTLD, whereas .lv is a CTLD and they have slightly different rules. Right now .lv will require either a leading protocol or trailing path to be recognised as a valid url. I've started working on this project recently and am not aware of the logic behind this. I'll try to get answers or a reevaluation.