sheet2dict
sheet2dict copied to clipboard
Empty cells in xlsx become the non-empty string "None" in the dictionary
Thank you for writing this. I have recently moved from using csv files to excel files and this has made the transition easier.
One issue I have encountered is that empty cells are not being returned as a Falsey value like None, or the empty string ''.
From what I have been able to understand, openpyxl reads the empty cell as None, and the following code in sheet2dict uses str to convert this into the string 'None'.
https://github.com/Pytlicek/sheet2dict/blob/93bf731d327d620755ffd5585aedb0b23a17a6a8/sheet2dict/main.py#L30-L34
I have filtered my own dictionaries to turn 'None' into '' but that will cause issues when any of my sheet really do contain the word None.
I just did some quick testing, and it seems that the csv.DictReader() in the python standard library uses the empty string '' for missing values in the a csv file, so that might be a good thing to do for these cases.
Good idea @JoshuaCapel 👍 thanks for pointing this out 🙏 I hope I will have some spare time this weekend and I can work on this.
@JoshuaCapel thank you for raising this. I dusted off sheet2dict today for a new project and tried the latest version and experienced this. I had a un-merged a lot of cells and was shocked to see them full of "None". It's a real shame that openpxyl doesn't support merged cells... but then again, I have no idea how sheet2dict would handle them.
@Pytlicek - I lost your contact information somehow, please ping me. In the meantime, let's talk about a solution. I need to re-set up my IDE/computer to push/pull correctly.
Maybe we can add an if just to check to see if the result is None. I did not modify the sheet2dict source, I made allowances in my code.
- This is untested.
- I cannot recall if the current Python convention is to use equality checks or is/is not. Is there a more elegant solution? Is type checking the way to go?
- I know this implementation will add CPU cycles... I'm not sure if we should use if/else instead since every cell is checked twice with my implementation.
def item(row, col):
if type(sheet.cell(row=row, column=col).value) is None:
return (sheet.cell(row=1, column=col).value, '')
if type(sheet.cell(row=row, column=col).value) is not None:
return (sheet.cell(row=1, column=col).value, str(sheet.cell(row=row, column=col).value))
Also, whatever we do, we should add some test cases to check for this (You taught me well @Pytlicek!)
@bai-yi-bai do you have an example file? I want to look at this and better understand what is the case, what exactly we need and so on. Thanks
@Pytlicek:
I created a PR https://github.com/Pytlicek/sheet2dict/pull/9 (Actually there were 2 PRs, my IDE didn't like the paths hard-coded into the test files.)
-
inventory.xlsx is sufficient since it has empty cells.
-
The existing tests were insufficient to detect this bug.
-
I'm not sure why empty headers are not affected, but this uncovered a new bug: if two header columns are empty (None), then one of the column's data will be ignored/overwritten. We should examine this further; meaning we need to generate a new excel file with two empty columns.
-
main.py
def item(row, col):
if sheet.cell(row=row, column=col).value is None:
return (sheet.cell(row=1, column=col).value, '')
else:
return (sheet.cell(row=1, column=col).value, str(sheet.cell(row=row, column=col).value))
- \tests\test_3_xlsx.py
def test_parse_xlsx_sheet_items(worksheet):
ws = worksheet
ws.xlsx_to_dict(path="inventory.xlsx", select_sheet='SJ3')
ws_items = ws.sheet_items
assert "Taipei" in str(ws_items)
assert "Bratislava" not in str(ws_items)
assert "None:" in str(ws_items)
assert "'None'," not in str(ws_items) # Tests whether empty cells were converted to "" strings and not "None"
assert None in ws_items[0]
assert len(ws_items) > 1
assert len(ws_items) == 6
- I inserted a print(ws_items) and compared the output with https://www.diffchecker.com/diff
- Compare the difference between the d8d363b and my proposed changes. I tried to use BOLD on 'None' and ''. Theref are four changes.
- d8d363b:
- test_3_xlsx.py::test_parse_xlsx_sheet_items [{'country': 'Taiwan', 'city': 'Taipei', 'area (km)': '271.8', None: '88', 'rand_field': 'ee'}, {'country': 'South Korea', 'city': 'Seoul', 'area (km)': '605.2', None: '77', 'rand_field': 'ee'}, {'country': 'USA', 'city': 'New York', 'area (km)': '783.8', None: '66', 'rand_field': 'ee'}, {'country': 'None', 'city': 'None', 'area (km)': 'None', None: '55', 'rand_field': 'ee'}, {'country': 'USA', 'city': 'Miami', 'area (km)': '143.1', None: 'None', 'rand_field': 'ee'}, {'country': ' ', 'city': ' ', 'area (km)': ' ', None: ' ', 'rand_field': ' '}]
- My change:
- test_3_xlsx.py::test_parse_xlsx_sheet_items [{'country': 'Taiwan', 'city': 'Taipei', 'area (km)': '271.8', None: '88', 'rand_field': 'ee'}, {'country': 'South Korea', 'city': 'Seoul', 'area (km)': '605.2', None: '77', 'rand_field': 'ee'}, {'country': 'USA', 'city': 'New York', 'area (km)': '783.8', None: '66', 'rand_field': 'ee'}, {'country': '', 'city': '', 'area (km)': '', None: '55', 'rand_field': 'ee'}, {'country': 'USA', 'city': 'Miami', 'area (km)': '143.1', None: '', 'rand_field': 'ee'}, {'country': ' ', 'city': ' ', 'area (km)': ' ', None: ' ', 'rand_field': ' '}]
Thank you @bai-yi-bai 🙏 Merged and published on PyPI with version: sheet2dict 0.1.5 @JoshuaCapel pls test if you are still interested in this issue. THX 👍