Before this commit, the file extension matcher could not match compound file extensions; e.g., '.md.html' would match .html

Pull Request Checklist

Resolves: #2780

[x] Ensured tests pass and (if applicable) updated functional test output
[x] Conformed to code style guidelines by running appropriate linting tools
[x] Added tests for changed code
[ ] Updated documentation for changed code

Oct 29 '20 18:10 holden-nelson

Some problems:

Pretty sure this will fail to include some.file.with.dots.rst
You don't actually handle reader selection logic (inside readers.py:Readers.read_file). foo.md.html would already be selected with previous code, if a reader that handles .html was enabled. The issue is that it was triggering HTMLReader instead of a reader specific to md.html. And same still happens.

Oct 29 '20 19:10 avaris

I see.

I'll admit, and this is probably obvious, but I'm kind of new to open source submissions and all that, so I probably jumped the gun by making this change without a better holistic view of how Pelican works. But I'm very determined to contribute, so I will spend more time getting acquainted, maybe write a plugin, and then revisit this topic.

Thanks!

Oct 30 '20 17:10 holden-nelson

That's all right and we appreciate the contribution :). If you need any help along the way, feel free to ask.

Oct 30 '20 19:10 avaris

Hi @holden-nelson. Just wanted to check in regarding the status of this PR. Do you intend to keep working on it?

Feb 18 '21 14:02 justinmayer

Hello @justinmayer. I had kind of put this on the backburner as other projects and real life happened, but ultimately I would like to do this. I'm finally wrapping up some of that other stuff, but I wouldn't expect to see any progress on this, on my end, for another couple of months. Let me know how that works on your end, I understand if you'd like to close it and move on.

Thanks!

Feb 19 '21 19:02 holden-nelson

Hello @justinmayer and @avaris. I'm ready to make this contribution, but I need to know how to handle the situation regarding a filename with multiple dots, like some.file.with.dots.rst.

Right now, extensions are stripped out of file names with os.path.splitext(basename)[1][1:] which will regard everything after the last dot, if there is a dot, as the extension. My thought was to use Pathlib.Path.suffixes() to grab everything that appears after the first dot and join it together to get an extension, like in my first pull request. But then a file with dots in its name would have the wrong extension and wouldn't be properly read.

I imagine it would come down to either

Closing the issue and not allowing compound file extensions.
Having a setting ALLOW_COMPOUND_FILE_EXTENSIONS=False with a big warning that if you set it equal to True you can't have file names with dots anywhere in your article path or it will break. And then in the code we can account for both scenarios.

Or is there some other way to allow arbitrary compound file extensions and dots in file names? What are your thoughts on this?

Dec 23 '21 22:12 holden-nelson

File with dots can be handled if we use a_string.endswith(extensions). The real challenge is selecting the appropriate reader. e.g. foo.md.html should trigger a reader with .md.html extension rather than .html one. I'm not sure how to handle that nicely (yet).

Dec 28 '21 10:12 avaris

Forgive me if this comment shows ignorance of the Pelican software.

I see how a_string.endswith() could be used to check for a particular file extension, but I don't see how it's useful for an arbitrary compound one if we still allow dots in file names. I guess what I'm saying is, the way I see it,

if we're only trying to add support for md.html files specifically, we could add a check in Generators.generate_context() for f.endswith('.md.html') . If True, we call Readers.read_file() on it with the fmt=md.html parameter. Otherwise we call it without the fmt parameter. We'd need a quick check in Generators._include_path() well. I don't know that this is the most clean way, especially if down the road we wanted to add support for more compound extensions, but conceptually something like this.
If we want to add support for arbitrary compound extensions and dropped support for filenames with dots we should be able to just switch to Pathlib.Path.suffixes() and be good.
I don't understand how it's possible to allow both arbitrary compound extensions and dots in filenames. Like beyond Python specifically I don't understand how it's logically possible.

Jan 02 '22 05:01 holden-nelson

If we are going to support this, it should be generic and not pinned down to a specific extension. Also I don't see the particular issue with filenames with dots. It's a pattern matching problem. We are trying to match the longest possible extension that matches available extensions. So a file like file.with.dots.md.html can be considered as one of the following possibilities:

file.with.dots.md + .html
file.with.dots + .md.html
file.with + .dots.md.html
file + .with.dots.md.html

Now the problem is which of these combinations is compatible with available readers. And the challenging part would be selecting the longest matching extension from available readers. One possible (and simple) option is "ranking the extensions and their readers" (say, based on the number of dots) and going them one by one starting from the highest until a match is found and that would be the preferred reader. But this would generally be slower than current approach, which is extract the last extension and check if it is in a dict...

Another approach is building a "tree of readers" based on the extensions and each level of compound extension makes it one more level deep. Something like

readers = {
    '.html': {
        '': HTMLReader,
        '.md': MDHTMLReader,
    },
    # ...
}

So, once .html is matched, it looks for the next extension and so on until it can't match anymore. You can think of it like working backwards from pathlib.Path.suffixes. This is quite a bit more involved but would perform a lot better.

Jan 02 '22 16:01 avaris

Ahh. Okay, now I see how that would work. Thanks for taking the time to explain.

I'll continue working on this feature and submit a PR soon.

Jan 03 '22 22:01 holden-nelson

Support compound file extensions. Fix #2780

Pull Request Checklist