Add support for case-insensitive VBA comparisons
VBA is (in)famous for helping developers by globally applying letter-case changes without regard to actual identifier scope, as I wrote about here: VBA's Case Changing "Feature".
I've attempted to minimize the impact of this "feature" by sticking closely to a Pascal-case-only naming convention: PascalCase All the Things. Unfortunately, no matter how zealous you are about that, false positives in the form of case-only changes still pop up frequently when version controlling VBA with Git (and my preferred DVCS, Mercurial).
Years ago, I got sufficiently fed up with the problem that I wrote an AutoHotkey script that would, after each export to source:
- Use Mercurial to revert any pending file changes with backup (which would create a second file with a
.origfile extension) - Read in the text from the original and changed files
- Convert the text from both files to all lowercase
- Compare the lower case only text
- If the files match (i.e., the only differences are in letter casing) then ignore the change for that file
Obviously, this approach means that--in theory--you can miss meaningful changes, such as fixing a typo in a MsgBox string (e.g., "Me, myself and i" vs. "Me, myself and I"). But, that requires that the ONLY change to the entire file is a case change (e.g., if my previous fix involved adding an Oxford comma in addition to the case change, that would get picked up because of the extra comma: "Me, myself, and I").
In practice, you avoid lots of false positives and almost never get a false negative.
I think an option like this (and it should probably be opt-in-only or maybe included in the more aggressive "sanitize" option) could be added relatively easily to vcs-addin by incorporating it into the hashing feature and IDbComponent_IsModified functions of the following classes:
- clsDbForm
- clsDbModule
- clsDbReport
The next question would be whether to store both hashes in the index (mixed case AND lowercase) or just the hash that corresponds with the .accdb file's chosen vcs options. I would lean toward storing only one for performance reasons and because the index can be rebuilt by manually forcing a full export.
I also agree this is a big problem and I've had to deal with this too many times.
However, in my case, the issue is usually something like:
I modify a file A, write Dim x As String : x = "Derp".
This modifies the file B and file C which had Dim X As String.
The status now shows 3 modified files.
If I'm understanding your proposal, this would revert the changes to file B and C but we still have file A with the mismatched lettercase. We want to modify the file A so that it's now Dim X As String: X = "Derp" to eliminate the unwanted changes -- not only in file B/C but also in future commit/pull which will mutate the codebase. It would do us no good to revert the changes on file B and C but commit the file A anyway and helping our fellow developers the next time they pull.
What I think we really want is a way to peek in the patches, and identify any line changes where the lettercase is the only difference. Gather all the differences and then maybe alert the user so that they can confirm if letter case is deliberate or accidental. If accidental, either execute VBIDE.CodeModule.Replace to revert the line and restore the lettercase and re-export or just revert the file and rebuild/merge from source. Note that re-exporting or re-building would be necessary to ensure that the lettercase mismatch is corrected at VBA source code. Else, we end up going into a grudge match slugging it out perpetually.
A pie in the sky would be to stop us from changing the lettercase in the first place by alerting us as soon as we type in a new identifier, but I don't see an easy way to do this without maintaining a symbol dictionary.
I agree that A needs to be changed and not just the change to C and B should be ignored, otherwise there may be random lettercase when importing. Re-importing C and/or B would adjust the lettercase in A.
I usually correct this by writing a declaration with the variable name in the desired case in any codemodule and then deleting it again. That way, I don't have to search for the relevant places. A new export is then required.
For some more reading, see #299 and #227
@mwolfe02, I actually referenced your article in that issue :)
FWIW, the sanitize functions already do kind of do what you're talking about for hash comparison; they export them to a temp directory, hash, then compare the hashes to determine if more work is needed. Adding a "set all to lower case" function could be feasible. The downside of this is non-ASCII encoding (aka, lots of folks) could be corrupted in the process if we're not careful.
without maintaining a symbol dictionary.
We resorted to this; and ... it's pretty okay. Not great, but has dramatically cut down on issues. We don't have all the names in there, just the ones most commonly causing headaches, and it's been pretty significant in terms of cutting it down. The other thing I noticed is if you have a DLL or a SQL / External Database connection and one of the columns/items in that DLL share a name, VBAIDE will preferentially use the SQL case, regardless of your wishes. This is outlined a bit in the SO Post StackOverflow "Stop Visual Basic 6 from changing my casing".
We accomplish our dictionary by having clsSpellingHelper detect declaration changes. Declaring a class instance in modConstants lets us use SpellingHelper.[whatever] as it is declared globally (and therefore relevant to the whole of the code). We then add a comment after the variable with the "correct" spelling.
While this dictionary based approach requires building, once built has made it much easier to keep cases correct and correct them if they stray.
Example declarations.
Public Mid As Variant ' Mid
Public Mid$ As String ' Mid$
Public hWnd As Long ' hWnd
Public VersionBuildID As Long ' VersionBuildID
Public VersionNumber As String ' VersionNumber
Public VersionNote As String
@mwolfe02 @bclothier @josef-poetzl @joyfullservice
Eureka?
... I just had a eureka moment about the cases.
- We already have a sanitize tool.
- VBA declares things in a pretty standard way.
-
We could have the AddIn build the dictionary.
- Parse all definitions in all code modules.
- Discard any duplicates so things are only declared one time.
- Include enums and static values, too.
- Users would then check the dictionary (or not, whatever)
- Once checked, the AddIn could check the code module against the case in the comment (or a unit testing framework similar to
modUnitTestingor how RubberDuck does it.
This process could be automated, and run against ONLY the clsSpellingHelper or clsCaseHelper or whatever one desires which would fix the rest of the code modules on reimport (I think; we could also potentially force a compile, though that doesn't always work in my experience).
Downside: devs would have to have this code module in their code.
Alternately, this wouldn't even need to be in the IDE, this could be an import prior to exporting and at the end of build to correct any issues; the only place the dictionary would live is in the .src file, perhaps under \CaseHandling?
@hecon5 wrote:
The downside of this is non-ASCII encoding (aka, lots of folks) could be corrupted in the process if we're not careful.
Two points:
- Making this new feature non-default behavior would ensure we didn't break existing repos
- I'm not suggesting changing how we store the code itself; only what we use to generate the hash
- The hashing should be deterministic, so even if converting a non-Western ANSI code page to lowercase "corrupted" the code, the "corruption" should happen the same way every time and the hashes would still match if the underlying code was identical
@bclothier wrote:
If I'm understanding your proposal, this would revert the changes to file B and C but we still have file A with the mismatched lettercase.
Yes, that's exactly right. This proposal is very much an 80/20 solution. Having used a version of it for years myself, though, I can tell you it makes a big difference in reducing the number of false positives. At the very least, it limits the number of changed files to only those that have actual changes. Now, you do still have to deal with the false positive case changes within those changed files, but it makes it much easier to see at a glance which files have changed. And that goes a long way to boosting the signal-to-noise ratio when looking at the commit window.
@mwolfe02, completely agree; just wanted to bring them up for consideration. I presumed the "corruption" or whatever would end up being a non-issue, but wanted to put it out there in case there's a caveat I missed.
I will emphasize that the 80/20 solution that we implemented (creating a "dictionary" of sorts) hasn't eliminated false positives (and due to case inconsistencies in SQL vs the VBA code we still get some irritating recasing, but the number of them has gone way down, and the time to fix has, too, because "fixing" it involves going to the dictionary, correcting the case there, and re-exporting, vs trying to go through the arduous process of sifting through each and every diff.
NOTE: Somewhat off-topic (i.e., these questions are not relevant to a potential vcs-addin solution but would likely be of interest to other readers of this thread)
@hecon5 Do you have an automated approach for updating the "dictionary" module, as the canonical casing would get changed there the same way it would everywhere else when an identifier is updated, right?
I'm imagining three possible approaches to that:
- Reimport the "dictionary" module from text
- Compare the commented canonical case with the declared case and fix any mismatches manually
- Use VBIDE Extensibility to rewrite the "dictionary" module replacing the declared case with the canonical commented version automatically
Or are you using some fourth approach? I've been thinking about this approach a lot recently, but haven't decided on a mechanism to minimize the ongoing maintenance effort. I'm curious to know what others are doing.
I understand the 80/20 rationale but what I am not understanding is how that helps us detect the letter case changes. Because the changes to B and C has been reverted, you are left with A's change and no clue that the lettercase now has changed? As mentioned, if that is then committed, then how the lettercase will change in next pull will be somewhat random, which would then pollute further commits/builds.
When I'm writing code especially in someone else's codebase, I'm often wondering, "did this guy use x or X in his code?" and the ctrl+space can be a false friend, especially where we have one library that uses Text while other library uses text (Yes, I'm talking about you, Mister MSXML library).
The dictionary sounds like it could be used pre-export by doing a import of the dictionary then deleting it before proceeding with the export. That would then eliminate all bad lettercase changes... right?
Right now, this "automation" is done by observation and fixing in post, as it were. I just check if there's any changes to the clsSpellingHelper code, and if so, fix that by hand and reexport. I haven't spent much time on automating it, as once I'd spent the time building it, it made a huge reduction in case changes overall. So, I guess you could say I'm doing option 2 in my case; it's all manual, but pretty rare overall to see many changes now.
as the canonical casing would get changed there the same way it would everywhere else when an identifier is updated, right?
Correct, it is! BUT: Comments are NOT changed. so, in the example code I posted above, this might happen:
-- Public Mid As Variant ' Mid
++ Public mid As Variant ' Mid
Note just the name changes, the comment remains the same; and that's the key. We could read the commented portion case, and compare to the declared case, and fix that.
As for the options:
- Reimport the "dictionary" module from text
it SEEMS that reimporting (building) the code module fixes it, but I've not forced a case change error and reimported the dictionary.
- Compare the commented canonical case with the declared case and fix any mismatches manually
That's what we're doing now; it works pretty well overall. Arduous to build the file manually at first though, but the work was worth it in the end (to me).
- Use VBIDE Extensibility to rewrite the "dictionary" module replacing the declared case with the canonical commented version automatically
That'd be cool to do. IMHO, the BIGGEST time saver is if we can automate building this file initially. Reimporting and fixing cases automatically would be icing, but an auto-built dictionary of spelling would be 95% of the solution, as you could check one file in your repo, fix it in one known place in VBA or your repo and poof.
Just a thought: Let's say a VBE add-in fills up a dictionary with all declarations when loading the project (or the add-in). As soon as a change is made in a codemodule, the dictionary is checked, missing values are added and a selection for the correct spelling is displayed if the values differ.
I thought this would be particularly complex for pure VBA code. It gets interesting with forms and reports, because the control/data field names also have an influence on the capitalization as soon as they are used in the VBA code. And these are then not so easy to change.
It gets interesting with forms and reports, because the control/data field names also have an influence on the capitalization as soon as they are used in the VBA code. And these are then not so easy to change.
Yep, same thing with linked tables names, too. I don't think this will solve 'all' the problems, but getting nearly there will be much better than not.
So this is what I'm envisaging....
- We introduce a new text file. Doesn't have to be JSON; could be just a simple list of all identifiers that is used in a project. We are not interested in identifiers defined in other libraries or projects.
- We build a dictionary of identifiers and write it out to this text file. It is then added to the repository. It then functions as a canonical list of names.
- During export and before the add-in actually performs an export, we would then do the following:
- Create a new module. We want to use temporary in-memory objects to avoid messing with the VBA project.
- Using VBIDE, write out something like
Private <identifier>. We don't care about scope or data type; just declaring it is enough to change the lettercase globally. - Delete the temporary module
- Proceed with the export as normal
- Post-export, we would then update the identifier list for any new identifier introduced. This would then show up in the commit window.
For the identifier that are affected by external objects (control, linked table, whatever), there'll be some churns but that is something that'll show up in the commit window and the user can then change the identifier list and thus stop the churning.
If this all works out, then every time we build from source, the lettercase will always be the same and we should hopefully minimize the manual steps that the user has to perform to correct the lettercasing since after exporting, the bad lettercasing will be corrected by the temporary module generated just prior to the export.
This avoid having to have a permanent module in user's project, and any deliberate lettercase change can be achieved by updating the text file directly, then building from source. Furthermore, that can be easily executed as an standalone operation (e.g., without exporting) for those who are OCD about this king of thing (definitely not talking about me here, nuh-uh).
The only uncertainty is how intensive the task will be. I can easily see identifiers going into hundreds or thousands. I'm fairly sure we can re-case the identifier quickly enough with the temporary module. Detecting whether there is a lettercase or building the list from scratch, OTOH...
Linking to this comment as I think the suggestion of editing a patch file might be very relevant for handling updates to the identifier lists and detecting lettercases.
As a test, I created a small Access application (accda) to read the declarations and their changes. Declaration Dictionary add-in
Note: Although this is only VBA code, the execution time is acceptable. It should run faster with compiled code (.net/twinBasic). Presumably not all relevant values are currently being read in - it is only a test. Declarations with Dim, Public, Private and procedure variables should be recognized.
@josef-poetzl that's awesome; seems like a great start to this.
I am experimenting with the add-in. If necessary, we can then adopt the code or use it as a suggestion. :)
Searching the VBA code works quite well in my opinion. This is my test CodeModule: DeclarationDictTestCodemodule.cls
Now comes the tricky part: how do I deal with the form/report objects (Controls, AccessFields, ...) ?
- The easiest way: ignore it, it only becomes relevant when it affects the VBA code.
- Consider only indirectly: when exporting the forms, use the content of the letter case from the (code only) dictionary.
- Include the spellings from the form objects in the dictionary as well.
My $0.02:
I think for starters we go with Option 1: Ignore it. We're not trying to solve the world, and we don't want to cause too much export delay.
We can always add those features in, perhaps there's a multi-tiered approach, like reverse - sanitization.
Thoughts on how this might work:
Off - don't use CaseDetect features. Minimal - VBA Code only, only exports dictionary for diff tools. Standard - VBA Code only, imports changes from dictionary Aggressive - Includes Form/Report objects as well in dictionary
Video (Youtube): Dictionary add-in meets MSAccessVCS
Over the next few days, I'll be testing its usability in everyday life. :-)
Video (Youtube): Dictionary add-in meets MSAccessVCS
Is there a non-video writeup? I can read and understand printed material 10-20x faster than videos.
Building on @bclothier's suggestion, this morning I started building out some of the functionality for scanning the current project to build a list of symbol names. At this point I am just parsing variable names, but I am really pleased with what I am seeing on the performance side. Testing against this add-in project, (parsing 38K lines of code and comments) I am seeing parsing at a tenth of a second. I am optimistic that we will be able to implement this with minimal performance impact.
Given how the case-change "feature" works (I actually figured this out 3 years ago, but forgot about it), correcting Access' case-changing errors is trivial.
Say Access changes one of your identifiers to the wrong case, and you detect this when, for instance doing git diff.
All you need to do is open any code module and type
Dim [name-with-correct-case] as [anything]
and then re-compile your project (Debug/Compile). This will force all instances of the name back to the correct case. You can then delete the statement you just typed.
In my most recent run-in with Access, it changed Class_Initialize to class_initialize because that's how I carelessly typed it when creating the method in a new class module.
Entering Dim Class_Initialize as Long and "committing" the statement by pressing Enter or moving the cursor away causes the symbol table to be updated back to the correct case.
If you do end up implementing case-insensitive hash generation please make it possible to disable it. I'd rather manage this myself since it's so easy to fix it just before committing a changeset to git.
@jhgarrison - Yes, it will be an optional feature, and basically automate what you are describing. The key part is that it will help developers make an intentional decision how they want to case their names, and keep the exports more consistent.
@joyfullservice: In my opinion, it is better to use regular expression to filter out the declarations. This allows a simpler code structure.
[...]
Dim Patterns(2) As String
'Procedures
Patterns(ProcIndex) = "(?:\r|\n|^)\s*(?:Sub|Function|Property Get|Property Let|Property Set|Event)\s+([^\r\n]*)"
' Enum/Type
Patterns(EnumTypeIndex) = "(?:\r|\n|^)\s*(?:Enum|Type)([\s\S]*?)(?:End\s+(?:Enum|Type))"
' Variable/Const/...
Patterns(2) = "(?:\r|\n|^)\s*(?:Dim|ReDim|Private|Friend|Public|Const|Static|Global|Implements)\s+([^\r\n]*)"
Dim i As Long
For i = 0 To UBound(Patterns)
RegEx.Pattern = Patterns(i)
AddFromCode Code, RegEx, i = ProcIndex, i = EnumTypeIndex
Next
Private Sub AddFromCode(ByVal Code As String, ByVal RegEx As RegExp, ByVal IsProcedure As Boolean, ByVal IsEnumTypeBlock As Boolean)
Dim Match As Match
Dim i As Long
For Each Match In RegEx.Execute(Code)
For i = 0 To Match.SubMatches.Count - 1
AddWordFromDeclaration Match.SubMatches(i), IsProcedure, IsEnumTypeBlock
Next
Next
End Sub
My test codemodule: DeclarationDictTestCodemodule.cls This code contains different variants of how variables, procedures etc. can be declared.
@josef-poetzl - Thanks for the suggestion! I was going off of the assumption that RegEx would be slower than parsing the lines using native VBA InStr functions. But after cloning your project this morning, adding in some performance testing, I think in the end both approaches are actually pretty similar in performance. Your overall approach was very efficient in that you are using RegEx to scan the entire code module at once rather than using the line parsing approach that I was doing. Nicely done! I also agree that RegEx is a much more elegant way to approach this from a code readability and maintainability standpoint.
What are your thoughts going forward? Your add-in is a nice purpose-built tool as it is, and we could certainly direct people to install and use that to manage letter case changes in VBA. We could also consider incorporating it as a feature directly in this add-in to help manage these case change issues during the routine export process. Either way seems like a reasonable approach to me.
@bclothier, @mwolfe02, and others, feel free to chime in with any thoughts on this... Do you think this would be a helpful feature in this add-in, or does a stand-alone add-in make more sense? In the big picture, I can imagine other add-ins for unit testing and deployment that would work alongside this add-in, but focus on different needs. I can also see a benefit to a growing ecosystem of open-source Access add-ins that are as easy to clone, build and install as @josef-poetzl's Declaration Dictionary add-in.
The advantage of the additional add-in as an independent but pluggable solution is that it can also be used for other VCS add-ins. The direct installation of the functionality would have the advantage that you have everything with one installation and there can be no interface problems. From a code perspective, both could exist, as the one working class can be further developed or adopted in both add-ins without any problems. The class just needs to be optimized a little so that you can resolve all dependencies via dependency injection or similar, for example.
I am generally a fan of not putting too much functionality in one add-in. When solving the upper/lower case problem, I am not sure how best to implement this, as it is very closely related to VCS.
For example: I extended in my msaccess-vcs-addin fork the procedures RunSubBefore/AfterExport so that I can call several other add-ins from the add-in. I start the DeclarationDict add-in and AccUnit one after the other. However, I am also overloads the VCS add-in, as the unit tests has nothing to do with VCS. But it is a good idea to run the tests before or after the (full) export. ;-) The next thing missing is another add-in/add-on that creates an accde in 32 and 64 bit after the export. :-) :-)
I don't have strong feelings either way. There are good points for each approach.
With the Access team planning built-in support for version control (but no plans to fix the case changing issue in VBA), having a standalone add-in that could work with both Access's future version control offering and this add-in starts to sound pretty compelling.
That said, there's enormous benefit to removing one of the major sources of annoyance for developers using version control with Access for the first time by including the add-in's functionality automatically with VCS Addin. What about updating the vcs-addin installer to include a checkbox (defaulted to True) to download and install the DeclarationDict add-in as part of running the VCS Addin setup?
@josef-poetzl , would you have any objection to that?
What do others (@bclothier, @jhgarrison, @joyfullservice, @hecon5) think of that approach?
A small thought experiment:
Adam:
In the big picture, I can imagine other add-ins for unit testing and deployment that would work alongside this add-in, but focus on different needs. I can also see a benefit to a growing ecosystem of open-source Access add-ins
Josef:
The next thing missing is another add-in/add-on ...
Kevin/Karl (DevCon 2023):
Click Here to Add Feature X!
Mike:
What about updating the vcs-addin installer to include a checkbox (defaulted to True) to download and install the DeclarationDict add-in as part of running the VCS Addin setup?
msaccess-vsc + add-ins/add-ons (not only DeclarationDict add-in):
- The download should not be a problem.
- An add-in does not necessarily have to be installed if it is called by the msaccess-vcs (RunSubBefore/AfterExport or Build).
- The add-ins could support a common interface so that communication with the msaccess-VCS add-in is possible. For example: output text via log, cancel export, etc.
In my opinion, other add-ins are particularly interesting for the msaccess-vcs add-in for calling before or after an export. What if you could not only set the call via the text box, but also add another option to define add-in calls in a list? Then you could also have the option of starting the download there.
Which extensions could enhance the VCS add-in?
- A build tool that, for example, creates an accde, sets the usual locks, assigns a version number, sets the ribbon, etc. I currently implement this using a VB script, which I always adapt and start manually for each application.
- Code quality inspection
- Maybe something like this: #507
- ...
TL/DR: both are valid approaches. I lean to having it be delivered with the Addin like @mwolfe02 mentioned, but as a separate tool...maybe even as an Addin-Addin, if you want to go meta...
Detail version: I think for me, the case change issue arises directly from doing version control tracking. Case change being an actual issue goes hand in hand with version control and tracking, and really isn't an issue if you're not tracking or controlling changes. So it makes sense it would be integrated with this addin.
I also think the "do one thing well" approach should keep us from getting code bloat.
Additionally, having a portable module would be AMAZING to add to non Access VBA tools. And having the case change class stay outside of the built / delivered file will ensure end users don't have excess code bloat.
All this to say, we already facilitate addin-like behavior with the run before/after settings, and @josef-poetzl I think is on the right track with this with overloading it.
What if we delivered the case change module along with vcs, and made it easier to add other extensions? This would keep the VCS Addin scope to what it is, but also make it so other things like the case change can be added as needed?
I'm inclined toward doing what we did with the hook; make it a dependent package of the VCS tool.
Yes, it could be exposed as a separate add-in, and it could be automatically included in the installation, but from what I saw what Eclipse did, I'm convinced that too much extensibility is a bad thing™. If I were to look at 100 different users of Visual Studio and load up their Visual Studio, I can probably figure out what's what -- some might have it dark, others have it light; some might have the solution explorer on the left, others might have on right. No biggie. I can wrap my brain around that. But If I were to look at 100 different users of Eclipse, the odd is that I'll be asking, "is that a different IDE? Doesn't look like Eclipse". Please don't get me started with how they ship barebone installer, then tell you that to do what you want do, you have to get 100+ optional addins, agree to 10 different licenses, and after installing, configure, configure, configure until everything breaks.
And all I wanted to do was use this one lousy tool that only runs in Eclipse!
So, the Eclipse model is absolutely what we should NOT do.
Exposing an add-in model potentially opens that door that should not be opened.
There are other considerations:
-
Having add-ins means we cannot guarantee the quality of the final product. Keep in mind that while we might say we are not responsible for the add-in's quality, the final product ultimately gets judged. See: SimCity hack.
-
This presents the user with more configuration options and therefore a larger matrix of things that could go wrong.
-
I'm generally more sympathetic toward the Apple model where the less is more, especially in relation to configuration. Even Visual Studio has adopted this model to some degree (consider for example the relative difficulty in changing whether C# code will use the K & R indenting style). The more popular OSS out there tend to be opinionated and I think that's a good thing because that simplifies the expectation for the same reasons people like to buy all-in-ones over building their own computers.
So using Josef's add-in as a dependent package that is installed means that it will just work and if for some reason a newer version no longer is compatible, the VCS tool is free to continue using the prior version that is known to work.
So that's my $0.02ZWD.
So using Josef's add-in as a dependent package that is installed means that it will just work and if for some reason a newer version no longer is compatible, the VCS tool is free to continue using the prior version that is known to work.
I think that's a very important aspect. In my opinion, this is easy to secure with GitHub. For the automated integration of an add-on, you only use the fork that you manage yourself. Then you can only use what you need.
Example:
From a VCS perspective, only the area highlighted in green is actually interesting. I do not have to load or save any values from a table or file via the add-in. I would control this from the VCS add-in. I don't even need to be able to switch to the entire list, as I am only interested in the differences.
To make this possible, I would design this area as an extra form that can be used either as a subform for the complete dialog or as a separate form for the VCS display.
In the VCS add-in, it should also be taken into account that this check should also run when exporting an single object. Note: The class already provides for the ability to check single codemodules. This would then have to be made available for add-on use via an API.
From a practical point of view, however, I believe that direct integration into the VCS add-in is easier. The class can still be improved together. You use the class in both add-ins and use your own suitable form for user access. Then there are no conflicts in the localization/translation.
However, the use of additional add-ins has a special charm. :)