backend
backend copied to clipboard
Make python package for url normalization code
As decided in today's tech meeting, I'm transferring this task into an issue.
To be more specific, here:
https://github.com/berkmancenter/mediacloud/blob/master/apps/common/src/python/mediawords/util/url/init.py#L158-L246
we normalize URLs with various "cruft" (tracking parameters, etc.) into their canonical form. For some examples, see the unit test:
https://github.com/berkmancenter/mediacloud/blob/master/apps/common/tests/python/mediawords/util/test_url.py#L117-L201
After a quick scan I'd say this code looks well-isolated enough, with lots of test cases, that Eric could take this task on. The package could just assume the the input is already a python str and eliminate the decode_object_from_bytes_if_needed
call.
Some wishlist items of mine:
- Merge with
normalize_youtube_url()
- Users (if any) will probably want to use their own user agent (web client) and logging for the module, so we could encapsulate this functionality in separate classes, provide default implementations (with
requests
/urllib3
andlogging
respectively), and make them configurable, e.g.:
class CustomLogger(AbstractLogger):
def info(self, message: str) -> None:
logging.info(message)
def debug(self, message: str) -> None:
logging.debug(message)
# ...
# Another class implementing AbstractUserAgent's interface
normalized_url = normalize_url(url=url, logger=CustomLogger(), ua=CustomUserAgent())