sheet2dict icon indicating copy to clipboard operation
sheet2dict copied to clipboard

Empty cells in xlsx become the non-empty string "None" in the dictionary

Open JoshuaCapel opened this issue 3 years ago • 1 comments

Thank you for writing this. I have recently moved from using csv files to excel files and this has made the transition easier.

One issue I have encountered is that empty cells are not being returned as a Falsey value like None, or the empty string ''.

From what I have been able to understand, openpyxl reads the empty cell as None, and the following code in sheet2dict uses str to convert this into the string 'None'.

https://github.com/Pytlicek/sheet2dict/blob/93bf731d327d620755ffd5585aedb0b23a17a6a8/sheet2dict/main.py#L30-L34

I have filtered my own dictionaries to turn 'None' into '' but that will cause issues when any of my sheet really do contain the word None.

I just did some quick testing, and it seems that the csv.DictReader() in the python standard library uses the empty string '' for missing values in the a csv file, so that might be a good thing to do for these cases.

JoshuaCapel avatar Apr 24 '22 12:04 JoshuaCapel

Good idea @JoshuaCapel 👍 thanks for pointing this out 🙏 I hope I will have some spare time this weekend and I can work on this.

Pytlicek avatar Apr 27 '22 11:04 Pytlicek

@JoshuaCapel thank you for raising this. I dusted off sheet2dict today for a new project and tried the latest version and experienced this. I had a un-merged a lot of cells and was shocked to see them full of "None". It's a real shame that openpxyl doesn't support merged cells... but then again, I have no idea how sheet2dict would handle them.

@Pytlicek - I lost your contact information somehow, please ping me. In the meantime, let's talk about a solution. I need to re-set up my IDE/computer to push/pull correctly.

Maybe we can add an if just to check to see if the result is None. I did not modify the sheet2dict source, I made allowances in my code.

  • This is untested.
  • I cannot recall if the current Python convention is to use equality checks or is/is not. Is there a more elegant solution? Is type checking the way to go?
  • I know this implementation will add CPU cycles... I'm not sure if we should use if/else instead since every cell is checked twice with my implementation.
 def item(row, col):
     if type(sheet.cell(row=row, column=col).value) is None:
        return (sheet.cell(row=1, column=col).value, '')
    if type(sheet.cell(row=row, column=col).value) is not None:
        return (sheet.cell(row=1, column=col).value, str(sheet.cell(row=row, column=col).value)) 

Also, whatever we do, we should add some test cases to check for this (You taught me well @Pytlicek!)

bai-yi-bai avatar Oct 20 '22 10:10 bai-yi-bai

@bai-yi-bai do you have an example file? I want to look at this and better understand what is the case, what exactly we need and so on. Thanks

Pytlicek avatar Oct 23 '22 18:10 Pytlicek

@Pytlicek:

I created a PR https://github.com/Pytlicek/sheet2dict/pull/9 (Actually there were 2 PRs, my IDE didn't like the paths hard-coded into the test files.)

  • inventory.xlsx is sufficient since it has empty cells.

  • The existing tests were insufficient to detect this bug.

  • I'm not sure why empty headers are not affected, but this uncovered a new bug: if two header columns are empty (None), then one of the column's data will be ignored/overwritten. We should examine this further; meaning we need to generate a new excel file with two empty columns.

  • main.py

        def item(row, col):
            if sheet.cell(row=row, column=col).value is None:
                return (sheet.cell(row=1, column=col).value, '')
            else:
                return (sheet.cell(row=1, column=col).value, str(sheet.cell(row=row, column=col).value))
  • \tests\test_3_xlsx.py
def test_parse_xlsx_sheet_items(worksheet):
      ws = worksheet
      ws.xlsx_to_dict(path="inventory.xlsx", select_sheet='SJ3')
      ws_items = ws.sheet_items
      assert "Taipei" in str(ws_items)
      assert "Bratislava" not in str(ws_items)
      assert "None:" in str(ws_items)
      assert "'None'," not in str(ws_items) # Tests whether empty cells were converted to "" strings and not "None"
      assert None in ws_items[0]
      assert len(ws_items) > 1
      assert len(ws_items) == 6
  • I inserted a print(ws_items) and compared the output with https://www.diffchecker.com/diff
  • Compare the difference between the d8d363b and my proposed changes. I tried to use BOLD on 'None' and ''. Theref are four changes.
  • d8d363b:
  • test_3_xlsx.py::test_parse_xlsx_sheet_items [{'country': 'Taiwan', 'city': 'Taipei', 'area (km)': '271.8', None: '88', 'rand_field': 'ee'}, {'country': 'South Korea', 'city': 'Seoul', 'area (km)': '605.2', None: '77', 'rand_field': 'ee'}, {'country': 'USA', 'city': 'New York', 'area (km)': '783.8', None: '66', 'rand_field': 'ee'}, {'country': 'None', 'city': 'None', 'area (km)': 'None', None: '55', 'rand_field': 'ee'}, {'country': 'USA', 'city': 'Miami', 'area (km)': '143.1', None: 'None', 'rand_field': 'ee'}, {'country': ' ', 'city': ' ', 'area (km)': ' ', None: ' ', 'rand_field': ' '}]
  • My change:
  • test_3_xlsx.py::test_parse_xlsx_sheet_items [{'country': 'Taiwan', 'city': 'Taipei', 'area (km)': '271.8', None: '88', 'rand_field': 'ee'}, {'country': 'South Korea', 'city': 'Seoul', 'area (km)': '605.2', None: '77', 'rand_field': 'ee'}, {'country': 'USA', 'city': 'New York', 'area (km)': '783.8', None: '66', 'rand_field': 'ee'}, {'country': '', 'city': '', 'area (km)': '', None: '55', 'rand_field': 'ee'}, {'country': 'USA', 'city': 'Miami', 'area (km)': '143.1', None: '', 'rand_field': 'ee'}, {'country': ' ', 'city': ' ', 'area (km)': ' ', None: ' ', 'rand_field': ' '}]

bai-yi-bai avatar Oct 31 '22 05:10 bai-yi-bai

Thank you @bai-yi-bai 🙏 Merged and published on PyPI with version: sheet2dict 0.1.5 @JoshuaCapel pls test if you are still interested in this issue. THX 👍

Pytlicek avatar Oct 31 '22 12:10 Pytlicek