backend icon indicating copy to clipboard operation
backend copied to clipboard

Make python package for url normalization code

Open dsjen opened this issue 3 years ago • 3 comments

As decided in today's tech meeting, I'm transferring this task into an issue.

dsjen avatar Jul 07 '20 17:07 dsjen

To be more specific, here:

https://github.com/berkmancenter/mediacloud/blob/master/apps/common/src/python/mediawords/util/url/init.py#L158-L246

we normalize URLs with various "cruft" (tracking parameters, etc.) into their canonical form. For some examples, see the unit test:

https://github.com/berkmancenter/mediacloud/blob/master/apps/common/tests/python/mediawords/util/test_url.py#L117-L201

pypt avatar Jul 08 '20 12:07 pypt

After a quick scan I'd say this code looks well-isolated enough, with lots of test cases, that Eric could take this task on. The package could just assume the the input is already a python str and eliminate the decode_object_from_bytes_if_needed call.

rahulbot avatar Jul 08 '20 14:07 rahulbot

Some wishlist items of mine:

  • Merge with normalize_youtube_url()
  • Users (if any) will probably want to use their own user agent (web client) and logging for the module, so we could encapsulate this functionality in separate classes, provide default implementations (with requests / urllib3 and logging respectively), and make them configurable, e.g.:
class CustomLogger(AbstractLogger):

    def info(self, message: str) -> None:
        logging.info(message)

    def debug(self, message: str) -> None:
        logging.debug(message)

    # ...

# Another class implementing AbstractUserAgent's interface

normalized_url = normalize_url(url=url, logger=CustomLogger(), ua=CustomUserAgent())

pypt avatar Jul 09 '20 13:07 pypt