Support compound file extensions. Fix #2780
Before this commit, the file extension matcher could not match compound
file extensions; e.g., '.md.html' would match .html
Pull Request Checklist
Resolves: #2780
- [x] Ensured tests pass and (if applicable) updated functional test output
- [x] Conformed to code style guidelines by running appropriate linting tools
- [x] Added tests for changed code
- [ ] Updated documentation for changed code
Some problems:
- Pretty sure this will fail to include
some.file.with.dots.rst - You don't actually handle reader selection logic (inside
readers.py:Readers.read_file).foo.md.htmlwould already be selected with previous code, if a reader that handles.htmlwas enabled. The issue is that it was triggeringHTMLReaderinstead of a reader specific tomd.html. And same still happens.
I see.
I'll admit, and this is probably obvious, but I'm kind of new to open source submissions and all that, so I probably jumped the gun by making this change without a better holistic view of how Pelican works. But I'm very determined to contribute, so I will spend more time getting acquainted, maybe write a plugin, and then revisit this topic.
Thanks!
That's all right and we appreciate the contribution :). If you need any help along the way, feel free to ask.
Hi @holden-nelson. Just wanted to check in regarding the status of this PR. Do you intend to keep working on it?
Hello @justinmayer. I had kind of put this on the backburner as other projects and real life happened, but ultimately I would like to do this. I'm finally wrapping up some of that other stuff, but I wouldn't expect to see any progress on this, on my end, for another couple of months. Let me know how that works on your end, I understand if you'd like to close it and move on.
Thanks!
Hello @justinmayer and @avaris. I'm ready to make this contribution, but I need to know how to handle the situation regarding a filename with multiple dots, like some.file.with.dots.rst.
Right now, extensions are stripped out of file names with os.path.splitext(basename)[1][1:] which will regard everything after the last dot, if there is a dot, as the extension. My thought was to use Pathlib.Path.suffixes() to grab everything that appears after the first dot and join it together to get an extension, like in my first pull request. But then a file with dots in its name would have the wrong extension and wouldn't be properly read.
I imagine it would come down to either
- Closing the issue and not allowing compound file extensions.
- Having a setting
ALLOW_COMPOUND_FILE_EXTENSIONS=Falsewith a big warning that if you set it equal toTrueyou can't have file names with dots anywhere in your article path or it will break. And then in the code we can account for both scenarios.
Or is there some other way to allow arbitrary compound file extensions and dots in file names? What are your thoughts on this?
File with dots can be handled if we use a_string.endswith(extensions). The real challenge is selecting the appropriate reader. e.g. foo.md.html should trigger a reader with .md.html extension rather than .html one. I'm not sure how to handle that nicely (yet).
Forgive me if this comment shows ignorance of the Pelican software.
I see how a_string.endswith() could be used to check for a particular file extension, but I don't see how it's useful for an arbitrary compound one if we still allow dots in file names. I guess what I'm saying is, the way I see it,
- if we're only trying to add support for
md.htmlfiles specifically, we could add a check inGenerators.generate_context()forf.endswith('.md.html'). If True, we callReaders.read_file()on it with thefmt=md.htmlparameter. Otherwise we call it without thefmtparameter. We'd need a quick check inGenerators._include_path()well. I don't know that this is the most clean way, especially if down the road we wanted to add support for more compound extensions, but conceptually something like this. - If we want to add support for arbitrary compound extensions and dropped support for filenames with dots we should be able to just switch to
Pathlib.Path.suffixes()and be good. - I don't understand how it's possible to allow both arbitrary compound extensions and dots in filenames. Like beyond Python specifically I don't understand how it's logically possible.
If we are going to support this, it should be generic and not pinned down to a specific extension. Also I don't see the particular issue with filenames with dots. It's a pattern matching problem. We are trying to match the longest possible extension that matches available extensions. So a file like file.with.dots.md.html can be considered as one of the following possibilities:
file.with.dots.md + .html
file.with.dots + .md.html
file.with + .dots.md.html
file + .with.dots.md.html
Now the problem is which of these combinations is compatible with available readers. And the challenging part would be selecting the longest matching extension from available readers. One possible (and simple) option is "ranking the extensions and their readers" (say, based on the number of dots) and going them one by one starting from the highest until a match is found and that would be the preferred reader. But this would generally be slower than current approach, which is extract the last extension and check if it is in a dict...
Another approach is building a "tree of readers" based on the extensions and each level of compound extension makes it one more level deep. Something like
readers = {
'.html': {
'': HTMLReader,
'.md': MDHTMLReader,
},
# ...
}
So, once .html is matched, it looks for the next extension and so on until it can't match anymore. You can think of it like working backwards from pathlib.Path.suffixes. This is quite a bit more involved but would perform a lot better.
Ahh. Okay, now I see how that would work. Thanks for taking the time to explain.
I'll continue working on this feature and submit a PR soon.