bigbang icon indicating copy to clipboard operation
bigbang copied to clipboard

generalize get_list_name ?

Open sbenthall opened this issue 2 years ago • 5 comments

There's a small function which strips down a full URL of a mailman archive to get its last part, which is used as the mailing list 'name'

https://github.com/datactive/bigbang/blob/main/bigbang/mailman.py#L185-L198

This gets used somewhat widely; as of #500 there's a reference to it in archive.py

Which raises the question of whether this function should be more general to other ingress methods, like w3c and listserv?

sbenthall avatar Nov 19 '21 18:11 sbenthall

Yeah, I think it would make sense to generalize the name function, and use it for those other lists as well. (Maybe we need a list of regexps that work for different email archive systems? Or, one day, a way for a new ingress system to register a function that recognizes if a URL is likely to be one of their mailing lists and return various metadata about it.)

Typically a short-name is handy because we might want to save the files to a certain directory and re-load them from there, be able to refer to a list in your code without typing the full URL, etc.

But I can also see how there might eventually be problems: these short names are not going to be globally unique, whereas list archive URLs or list email addresses would be less ambiguous.

npdoty avatar Nov 21 '21 16:11 npdoty

Yep, agree, we should find a way to generalise this method and maybe place it in utils.py ? For Listserv mailing lists I have a function ListservList.get_name_from_url(mlist_url) here that get's the list name from an URL. But the way that is done is currently unique to Listserve maybe as the URL structure is always: https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_SA_WG2 https://list.etsi.org/scripts/wa.exe?A0=3GPP_CT89_E_MEETING etc.

Christovis avatar Nov 29 '21 09:11 Christovis

What if:

  • When we get data from a source, we put in in a subdirectory of the archive directory based on its source, i.e: archive/listserv/3GPP_TSG_SA_WG2 and something else appropriate for other sources.
  • a method in utils can take a "short name" and expand it into a full path, if that full path exists in the archive directory. It works as kind of as an autocomplete.
  • get_list_name can shorten a long/complete URL into a short list name?
  • If there's a collision because of two sources with the same short name, these methods intelligently complain.

On the other hand, I feel like the notebook workflow that this 'short name' stuff was intended to support is increasingly old fashioned and not how BigBang is currently being used.

I wouldn't mind officially deprecating a lot of the old notebooks and trying to come up with a better workflow.

sbenthall avatar Dec 03 '21 15:12 sbenthall

I'm not sure I want subdirectories based on method/source as that isn't always consistent across a project or an SDO even.

Could we use the email address of the list as the directory name? Does [email protected] cause any problems as a directory name? Can archive ingest code always determine the email address of the list?

The email address should generally be unique and descriptive. And list archive URLs can vary over time, but the mailing list email address itself is unlikely to.

I agree that it's fine to deprecate some older notebooks or styles.

npdoty avatar Dec 03 '21 17:12 npdoty

I like the idea of using the email address of a list as its directory name. At this time I don't know the answers to your other questions.

sbenthall avatar Dec 03 '21 18:12 sbenthall