staticSearch icon indicating copy to clipboard operation
staticSearch copied to clipboard

Should we make more config items mandatory?

Open martindholmes opened this issue 3 years ago • 12 comments

If, as argued in #194 and originally arising out of #115, schema-based default values for configuration items are not practical or desirable, another option for making our use of de facto defaults in config processing more explicit would be to make more -- or all -- configuration elements mandatory. I'm actually in favour of this. It would mean that we could provide a single straightforward template for a config file (at least, for the initial option collection -- and we could also provide commented-out examples for all the other components). We could then validate the config file and simply fail the build if anything is missing, forcing users to see all the config they're using. We could have an initial deprecation period in which we would simply warn about missing elements, but then in the subsequent release start failing builds. The interdependence of various config options could also then be tested with Schematron, so incompatibilities would be explicit and easily understood by the user.

For cases where one option makes another redundant, the latter could have an explicit value of "redundant because X".

This would also mean that there would be less pressure on the documentation to explain any mysterious defaulting that might take place in the background to handle a missing element; there could be no missing element, so regular documentation of the values would be sufficient.

martindholmes avatar Jan 26 '22 17:01 martindholmes

I think I'm convinced that we should make these mandatory. My only thought is whether some of these config options are necessary at all—that said, I don't want to spin this ticket out into a broader discussion of particular elements. So, for now, I'll just ask: if we are to cull config options, should we deprecated them at the same time that we make all options mandatory or should we do it after? I'd say we deprecated them at the same time so that whenever the rule comes into effect, the mandatory elements are ones we actually want.

joeytakeda avatar Jan 26 '22 18:01 joeytakeda

I agree: deprecate unnecessary elements and make the others mandatory at the same time.

Thinking about interdependent config items: could we unify them into single elements with a more sophisticated and explicit range of options?

martindholmes avatar Jan 26 '22 18:01 martindholmes

For our reference, here's a table of the child elements of <params> noting whether that option depends on anything and whether it's already mandatory.

Element Depends Currently Mandatory
<searchFile>
<versionFile>
<stemmerFolder>
<recurse>
<minWordLength>
<createContexts>
<linkToFragmentId> <createContexts> = true
<scrollToTextFragment> <createContexts> = true
<phrasalSearch> <createContexts> = true
<wildcardSearch> <createContexts> = true (and possibly <phrasalSearch> = true)
<maxKwicsToShow> <createContexts> = true
<totalKwicLength> <createContexts> = true
<kwicTruncateString> <createContexts> = true
<maxKwicsToHarvest> <createContexts> = true , <phrasalSearch> = false, <wildcardSearch> = false
<scoringAlgorithm>
<verbose>
<stopwordsFile>
<dictionaryFile>
<indentJSON>
<outputFolder>
<resultsPerPage>
<resultsLimit>

joeytakeda avatar Jan 26 '22 19:01 joeytakeda

In terms of interdependence, it all comes back to <createContexts>. My feeling here is that not creating contexts is going to be fairly rare—I can't imagine many cases where resources are so limited that you wouldn't want to store any contexts at all. That said, I also don't want there to be a case where someone has a completely valid config, but we have schematron warning them about some option they've been forced to set.

Brainstorming a few possible solutions:

  1. We keep the structure the same, but use schematron to dictate the content model. That's simplest, but the least helpful when it comes to editing in oXygen, of course.

  2. We could transform createContexts into an attribute on <params>, which would then dictate the content model. If @createContexts is true, then linkToFragmentId, scrollToTextFragment, phrasalSearch, wildcardSearch, maxKwicsToShow, totalKwicLength, and kwicTruncateString are all mandatory; otherwise, they're forbidden. We still have the issue where maxKwicsToHarvest is superfluous unless phrasalSearch and wildcardSearch are both set to false, but it's mostly just annoying.

  3. The elements that depend on <createContexts> could be child elements on <createContexts>: we'd have to put some sort of boolean attribute on <createContexts> so that it could dictate the content model. if <createContexts value="false"/> then it must be empty; otherwise, you must put in all of the other elements. maxKwicsToHarvest could then just be an attribute that would have no effect unless createContexts/phrasalSearch and createContexts/wildcardSearch are both set to false (with schematron warnings etc that say that)

Option 3 is the most disruptive and introduces some nesting, but it is also the most clear in my opinion: those options all have to do with the creation of contexts, so it makes sense to nest them. Option 2 is less disruptive, but introduces an attribute and still has the superfluous option. Option 1 is the least disruptive, but not particularly helpful for those wanting content completion.

joeytakeda avatar Jan 26 '22 20:01 joeytakeda

Thanks for this detailed breakdown. I like option 3. I also think that a) we should deprecate scrollToTextFragment and get rid of the related code, because it doesn't show signs of taking hold; b) we could get rid of linkToFragmentId because we should just always do that if createContexts is true.

martindholmes avatar Jan 26 '22 20:01 martindholmes

I agree re: deprecating scrollToTextFragment and linkToFragmentId. If we are deprecating options, I think indentJSON could be deprecated (mea culpa since I know I advocated for it) and verbose could be deprecated and superseded by a parameter sent to ant since it's annoying to have to change the verbose parameter in the config to get verbose logging when trying to debug something in a single place.

joeytakeda avatar Jan 26 '22 23:01 joeytakeda

Attached to this ticket is a zip archive that contains an ODD, a produced RNG, and some sample configuration files that model the changes proposed based on the above discussion. For my own edification, I did this via ODD chaining (using the GitHub repo ODD as the base and modifying) to show exactly what we're changing, which worked precisely as advertised—of course when we actually do this, we'll do this in the real ODD, but it was easier to show the changes up for discussion.

staticSearchSchemaFor195.zip

Here's a sample of the ODD (with schematron removed for brevity):

      <elementSpec ident="params" ns="http://hcmc.uvic.ca/ns/staticSearch" module="ss" mode="change">
          <content>
            <!--Order isn't preserved, but
              listed here in logical groups anyway-->
            <sequence preserveOrder="false">
              <!--Input stuff-->
              <elementRef key="searchFile"/>
              <elementRef key="recurse"/>
              <elementRef key="versionFile"/>
              <elementRef key="stemmerFolder"/>
              <elementRef key="stopwordsFile"/>
              <elementRef key="dictionaryFile"/>
              <!--Tokenizing/Parsing-->
              <elementRef key="scoringAlgorithm"/>
              <elementRef key="createContexts"/>
              <!--Output stuff-->
              <elementRef key="outputFolder"/>
              <elementRef key="resultsPerPage"/>
              <elementRef key="resultsLimit"/>
            </sequence>
          </content>
        </elementSpec>
        
        <elementSpec ident="createContexts" ns="http://hcmc.uvic.ca/ns/staticSearch" module="ss" mode="change">
          <content>
            <!--Could also be <alternate> with <empty/>, I think-->
            <sequence minOccurs="0" maxOccurs="1">
              <elementRef key="phrasalSearch"/>
              <elementRef key="wildcardSearch"/>
              <elementRef key="maxKwicsToHarvest"
                minOccurs="0" maxOccurs="1"/>
              <elementRef key="maxKwicsToShow"/>
              <elementRef key="totalKwicLength"/>
              <elementRef key="kwicTruncateString"/>
              <elementRef key="linkToFragmentId"/>
            </sequence>
          </content>
       <!--snip: bunch of constraintSpecs-->
          <attList>
            <attDef ident="value" usage="req" mode="add">
              <datatype>
                <dataRef name="boolean"/>
              </datatype>
            </attDef>
          </attList>
        </elementSpec>
        
        <!--DEPRECATIONS-->
        <elementSpec ident="verbose" module="ss" ns="http://hcmc.uvic.ca/ns/staticSearch" mode="delete"/>
        <elementSpec ident="indentJSON" module="ss" ns="http://hcmc.uvic.ca/ns/staticSearch" mode="delete"/>
        <elementSpec ident="scrollToTextFragment" module="ss" ns="http://hcmc.uvic.ca/ns/staticSearch" mode="delete"/>
        <elementSpec ident="linkToFragmentId" module="ss" ns="http://hcmc.uvic.ca/ns/staticSearch" mode="delete"/>
        

joeytakeda avatar Feb 09 '22 05:02 joeytakeda

Thanks for this; I'm slightly surprised that the ODD-chaining worked, but that's good. :-)

I suggest we set everything to version 2 right now, and first do the deletions (deleting rather than deprecating; we can handle the mechanics of alerting users by having Ant do a quick check of the file version, and if it's not 2, then run a special XSLT to help users upgrade, then exit. So I don't think we actually need a deprecation period for these four specific things. As they're deleted, we can a) remove any handling for them from the XSLT, and b) add the info to the what's new page.

It's refreshing to be able to delete stuff from a project, rather than having all the old cruft just pile up with the new floating on top of it.

martindholmes avatar Feb 09 '22 16:02 martindholmes

I'm going to do the deletions now in a separate branch--I'll open a ticket for that to keep things clean.

joeytakeda avatar Feb 14 '22 04:02 joeytakeda

Arising from #270—we should make sure the documentation is clear about the mandatory elements and resolve the ambiguous language that params "requires two elements"

joeytakeda avatar Jul 22 '23 05:07 joeytakeda

I've added in #271 another case where the "default" behaviour of Static Search seems problematic. My two cents worth: version 2 (I think that is what we are talking about) should enact what the current version states: just two <params> should need to be explicitly declared in each project's config file, with default values implemented in the build. Any serious user is going to need to delve into various <params> and <rules> soon enough, and will find her or his own way to declaring these as needed. This approach would also spare people like me the initial puzzlement: why isn't this working?

peterrobinson avatar Jul 26 '23 06:07 peterrobinson

This is the key deliverable for version 2.0 at this point. Once this is sorted, we can release, assuming no other breaking or blocking things emerge from testing with real sites.

martindholmes avatar Sep 13 '24 18:09 martindholmes

Following a very productive meeting today (Oct 18), @martindholmes and I have rethought this quite significantly so that each configuration element describes its purpose with properties specified as attributes — this means far fewer elements and more straightforward values with better ways of constraining values (and enforcing co-occurrence constraints). Here's what we're thinking:

    <params>
        <searchPage file="test/search.html"/>
        <index recurse="yes"/>
        <stopwords
            file="test/test_stopwords.txt"/>
        <dictionary
            file="xsl/english_words.txt"/>
        <tokenizer
            minWordLength="2"/>
        <scoringAlgorithm name="tf-idf"/>
        <stemmer
            dir="stemmers/en/"/>
        <contexts
            create="true"
            phrasalSearch="true" 
            wildcardSearch="true"
            maxKwicsToHarvest="5"
            maxKwicLength="15"
            kwicTruncateString="..."
            linkToFragmentId="true"
        />
        <results 
            resultsPerPage="5"
            maxKwicsToShow="#"/>
        <version file="test/VERSION"/>
        <outputDir name="ssTest"/>
    </params>
  • @file can be an attribute class
  • Some elements (like index), which are more generic, will allow for better future-proofing
  • outFolder => outDir
  • createContexts => contexts
  • totalKwicLength => maxKwicLength

This will be a major change, so we'll certainly need to make sure that we have clear guidance and mapping for upgrading from 1.x to 2.0 (and an automated process for doing so).

joeytakeda avatar Oct 18 '24 22:10 joeytakeda