podcast-dl icon indicating copy to clipboard operation
podcast-dl copied to clipboard

--episode-template regex options

Open melmatsuoka opened this issue 1 year ago • 1 comments

I'd love to see an option that lets you use regular expressions to parse out elements of the episode title, and then create an episode-template based on the selected regex capture-group.

This idea comes out of the need to handle certain podcast feeds, such as the Command Control Power podcast, which appears to limit access to only the 100 most recent episodes. When I specify an --output-template using {{episode_num}}-{{title}}_{{release_date}} as the template, I end up getting filenames that look like this:

Original episode title:

Best Of CCP - 467: Interview with Brian Best from BestMacs and Mac-MSP Gruntwork

Downloaded file name:

_0100-Best Of CCP - 467_ Interview with Brian Best from BestMacs and Mac-MSP Gruntwork_20240528_.mp3

So in this case the episode_num is actually incorrect because while this episode may be the 100th episode listed in the feed, it's definitely not the actual episode number itself!

To handle edge-case feeds like this (and also for more granular control over file naming), it would be nice if you could use a regular expression to parse the current episode title, and then map the regex capture groups to the actual podcast-dl episode_template keywords.

For example. If I could "pre-filter" the episode title using the regex \d+(?=:.*), I could extract the actual episode number from the episode title (the number that appears before the first colon character in the title name), and then use a special keyword like episode_num_1 to tell the template to use the value from regex capture group \1 as the episode number.

melmatsuoka avatar May 29 '24 20:05 melmatsuoka

Hey! Thanks for taking the time to explain the issue in excellent detail.

I think this is doable and could be quite powerful! Let me noodle on this in a couple days and get back to you.

lightpohl avatar May 30 '24 19:05 lightpohl

@melmatsuoka Apologies for the delay!

I've opened a PR to add episode-custom-template-options here. It required updating the option parsing library, so I've going to take some time to make sure the update didn't cause any regressions. Please let me know if you have any thoughts on the API!

npx podcast-dl --url "https://cmdctrlpwr.libsyn.com/rss" --episode-custom-template-options "\d+(?=:.*)" --limit 1 --episode-template "{{custom_0}}-{{title}}"

lightpohl avatar Aug 23 '24 00:08 lightpohl

@lightpohl This is fantastic, thanks for implementing this!

The example you posted works great for parsing out the episode number embedded within the

of the episode. However, it does not seem to work properly (in <a href="https://github.com/lightpohl/podcast-dl/releases/tag/v10.3.2" rel="nofollow" target="_blank" >10.3.2</a> ) if you want to extract everything <em>after</em> the embedded episode number in the <title>. <p>For example, in the same cmdctrlpower RSS feed, the episode with the </p><title> <code>593: Navigating IT's Past and Future with Tim Nyberg of The MacGuys+</code> will download as <code>_ Navigating IT's Past and Future with Tim Nyberg of The MacGuys+.mp3</code> if I use <code>: (.*)</code> as the custom template option, and <code>{{custom_0}}</code> as the episode template. <p>It almost seems like the colon in the episode title is being "sanitized" into an underscore <em>before</em> the regex defined in <code>episode-custom-template-options</code> ever gets a chance to parse the title.</p> <p>So if I wanted to parse out the episode number from the </p><title>, as well as the actual title itself, using <code>(\d+): (.*)</code> as the custom template option, and <code>"{{custom_0}}-{{custom_1}}"</code> as the episode template, the resulting file ends up looking like this: <p><code>593_ Navigating IT's Past and Future with Tim Nyberg of The MacGuys+-{{custom_1}}.mp3</code></p> <p>When I would expect it to look like this:</p> <p><code>591-Navigating IT's Past and Future with Tim Nyberg of The MacGuys+.mp3</code></p> <p>Seems like the custom template option should operate on the raw </p><title>, rather than a sanitized version of it. <p>As a related aside, I noticed that the <code><itunes:title></code> tag in RSS feeds contains the episode titles without the episode number embedded in it, which I guess is <a href="https://podnews.net/article/episode-numbers-faq" rel="nofollow" target="_blank" >one of Apple's requirements</a> for including a podcast feed in the Apple Podcasts app? It would be great if podcast-dl could use that tag as one of the template options!</p>

melmatsuoka avatar Nov 25 '24 03:11 melmatsuoka

Hey @melmatsuoka! It took me a few tries, but I think I was able to get what you're looking for with a tweak to how the expressions are being passed in and changing up the second expression a bit. Passing regex in via the command line is a bit of a pain!

npx podcast-dl --url "https://cmdctrlpwr.libsyn.com/rss" --episode-custom-template-options "(\d+)" "(?<=: ).*" --limit 1 --episode-template "{{custom_0}}-{{custom_1}}"

> 594-Navigating Apple's Changing Ecosystem and the Future of Tech Support.mp3

Let me know if that helps!

lightpohl avatar Nov 28 '24 05:11 lightpohl