XlsxWriter Updated xl_col_to_name() method to add recursion

Updated xl_col_to_name() method to add recursion

Open agriyakhetarpal opened this issue 2 years ago • 15 comments

This PR links to issue #881 which I opened recently – I propose that the xl_col_to_name() function (here) can be improved to a recursive function to bring it to O(log₂₆ n) time complexity and O(1) auxiliary space.

I created a GitHub Gist here to compare the times of the older and the newer functions I propose with %timeit. IPython defaults to n = 100 anyway. The updated function is faster.

Writing tests

I haven't written any unit tests to resolve the issue as of now – I'm not aware of how to do so, but I can learn.

Jun 26 '22 18:06 agriyakhetarpal

Hi @jmcnamara, how may I resolve the unsuccessful checks? The test is failing due to a whitespace of some sort in my code

Jun 27 '22 18:06 agriyakhetarpal

Update: I fixed "W293 blank line contains whitespace" with autopep8 in another commit. The tests probably should not fail now

Jun 27 '22 18:06 agriyakhetarpal

You can run the tests as follows:

# One time:
python -m pip install --upgrade pip
pip install pytest flake8

# Then run the tests:
make test_flake8
pytest

Jun 27 '22 19:06 jmcnamara

I'm on a Windows machine so I cannot run make, is there any other way to go about this? I tried it on WSL (Ubuntu) as well but there it says test_flake8 does not exist.

Running pytest by itself without make works on Windows, 1572 passed and 0 errors detected – albeit the same gives me 15 errors during collection on WSL

Jun 27 '22 19:06 agriyakhetarpal

Update: resolved the above by installing make using Chocolatey.

I ran the flake8 tests, and resolved four issues – E501, a couple of W291s, and a E115. pytest also passes 1572, no errors. Pushed once again

Jun 27 '22 20:06 agriyakhetarpal

For future reference you don't need make. You just need to run flake8 on the file(s) you modified:

# Change path to suit
flake8 xlsxwriter\utility.py

Also, you will need to squash your 3 commits into 1.

In order to merge this I will need to see some benchmarks using timeit or similar.

You will need to benchmark for the 1, 2 and 3 digit column numbers for example 0, 701 and 16383 (equivalent to A, ZZ and XFD).

You will also need to benchmark with and without your "alphabets" lookup to compare like for like with the existing implementation (which uses this col_letter = chr(ord('A') + remainder - 1).

So in summary can you put a benchmark together to test:

Column numbers 0, 701, 16383 (and maybe a combination) for:
- The existing implementation
- The new implementation
- The new implementation with chr(ord('A')

Jun 28 '22 09:06 jmcnamara

Thanks a lot @jmcnamara, – I shall look into benchmarking them this week and get back!

Jun 28 '22 09:06 agriyakhetarpal

Thinking about this a little more and there are a few things to consider:

In general most real life column name conversions are in the column range A-Z so there won't be any tail recursion.
The maximum conversion would be for column XFD which would only be 2 recursions
The conversion is hidden behind a hash/cache lookup so it is only called once per column

Anyway, it would still be interesting to see your performance data when you get it. There is no rush on it.

Jul 01 '22 10:07 jmcnamara

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

Jul 02 '22 10:07 sonarcloud[bot]

I get your points. The recursive method may be a tad confusing to write (and understand!) in code, and also may not even be needed for practical purposes. I have just squashed the commits into one and force-pushed it.

I'm working on testing the functions now using the timeit library – in what way are the benchmarks needed? Should I put them in a Colab notebook and upload as a GitHub Gist? Or, are they to be put in a .py file in the xlsxwriter directory?

Jul 02 '22 10:07 agriyakhetarpal

Should I put them in a Colab notebook and upload as a GitHub Gist

Yes, that would be great.

Jul 02 '22 18:07 jmcnamara

Apologies for the slight delay in sending these performance benchmarks – here they are in a GitHub Gist, as requested.

The new implementation with chr(ord('A') has not been pushed yet – I shall push them in another commit and squash the two into one later, based on your feedback on the above.

Jul 14 '22 12:07 agriyakhetarpal

Apologies for the slight delay in sending these performance benchmarks

No problem. I'm not in a rush. :-)

The new implementation with chr(ord('A') has not been pushed yet

You don't need to put this in a PR. Just add it to your benchmark as another option. So to be clear, just add another function to your gist called xl_col_to_name3() or similar, that uses chr(ord('A') instead of alphabets so that we can compare like with like.

Jul 14 '22 14:07 jmcnamara

So to be clear, just add another function to your gist called xl_col_to_name3() or similar

Oh, I've actually done that in the benchmark file, just not in the PR. So it contains a comparison of three functions; the original implementation, the newer implementation, and the chr(ord('A') implementation.

(I did not define the functions with different names. I ran the benchmarks sequentially)

Jul 14 '22 18:07 agriyakhetarpal

Oh, I've actually done that in the benchmark file,

Sorry I missed that. Thanks.

Jul 15 '22 11:07 jmcnamara

XlsxWriter XlsxWriter copied to clipboard

Updated xl_col_to_name() method to add recursion

Writing tests

XlsxWriter
XlsxWriter copied to clipboard