anystyle
anystyle copied to clipboard
brainstorming extending anystyle to create csl styles
I was thinking about this again recently.
Am I correct that anystyle
would be fairly easily-extended to allow creation of a csl style after parsing references, and that logically it would involve something like this at a high level?
- match parsed variables to matching csl macros
- perhaps some additional parameter options selected by the user
- sequence the macro calls into a completed style
- maybe give users a choice of among a few output examples
If that's right, then anystyle
would cover a lot of the work (maybe 70%) required to implement the linked tool?
citation-style-language/schema#244
This is an interesting idea that I have not considered before (*). Which CSL macros are we talking about here? Do you envision a canonical set of macros or all the macros of all styles in the repository? Or do you mean to use a set of references to pick one matching style (see below) and if it's not a 100% match, let the user change some of the parameters/macros of that style?
(*) I know we've discussed a style predictor before, i.e., returning a list of matching styles for a given set of references; this would be very easy to implement, I think I wrote two different prototypes already, but it would require further evaluation how viable the approach is, given that you may need a considerable large set of references to get a suitably short list of results.
I'm not sure on your first question, but was speculating the software could analyze and extract a canonical subset from all styles.
The basic hypothesis is that styling for any variation has already been written.
And yeah, after the process ran, the idea is it would present one or more suggestions, user would choose one, and from there could download/import (depending on application) directly, or click and edit link to tweak.
On the second point, what about using CSL to round trip?
As in, take all independent CSL styles, run them on some common data, and use that output to train the software to find the macros that created that output?
I think to decide if this is a viable option, you'd have to test with real data.
- You start with a set of references in an unknown style.
- Parse the references into CSL-JSON data (maybe allowing for user corrections like on anystyle.io)
- Process this data with all independent styles
- Compare the generated references to the original input using Levenshtein distance (or something more appropriate)
- Select the best matches
I think the main issue here is, how many references you will need to end up with a small list of matches. My first impression, working on the prototype was that you will get hundreds of styles that match the input very closely. However, you would need a much smaller result list for this to be useful tool, otherwise you could just browse all the independent styles to begin with. I think the factors to get a reasonably small list are the parsing accuracy, the size and diversity of the input data, and the algorithm used for comparing the references. I'd say it's certainly a worthwhile project!
Yes, I've considered using CSL styles to generate training data but haven't tried it out yet. One reason why I think this may not be that helpful is because very consistent references (like those produced by CSL) are typically not that problematic to parse; it's good for training data to contain inconsistencies, typos, errors, etc., because that's what we usually have to work with. We also have so much curated training data via anystyle.io that there's no real need to generate more.
A CSL round trip sounds like a promising project too. Using a common data set, generating references with all known styles, then parsing them back and comparing the result to the input would be a good way to identify styles which the parser does not parse well. We could then use data generated using those styles to create more training material. I really like that idea!