umya-spreadsheet icon indicating copy to clipboard operation
umya-spreadsheet copied to clipboard

Possibly make cells be stored in the order in which they are encountered

Open iancormac84 opened this issue 10 months ago • 11 comments

The Cells type stores the list of cells in a HashMap that keeps track of the coordinates and the cell data. I have run into a scenario where it is better to have the cells stored in the order in which they would be seen in the workbook. I had to work around the problem by sorting the cells by their column number, after the fact. Would it be possible to change the HashMap to an IndexMap, or if this isn't desirable due to possible performance regressions caused by using IndexMap, can it be made into a feature? I have a branch in my fork where I used an IndexMap instead of the HashMap.

iancormac84 avatar Mar 11 '25 16:03 iancormac84

Another issue with using the HashMap is that it is very slow to get a list of all cells in a row or column, operations that are common.

Moreover people would probably want to iterate all the cells in a row in order, so the list should be sorted.

I'm thinking a different data structure might be more convenient. Or build two indices over the collection, one for rows and one for columns.

schungx avatar Mar 12 '25 02:03 schungx

@iancormac84 Thank you for your suggestion. I too do not believe that holding in HashMap type is the only correct answer. If there is a more suitable format, I would like to change to it. (Since this will be a major change, I would like to implement it in ver 3.0.0.)

Would it be best if sorting could be done by row number as well as column number?

can it be made into a feature? Yes, it can be made into a feature. This can of course be supported. The impact is minimal, so it can be handled in the current version.

MathNya avatar Mar 14 '25 06:03 MathNya

Just build two separate indices with maybe mapping row and column to lists of cells. They can then help greatly with searching and in order iteration.

If you use BTreeSets with (row,col) and (col,row), then you get the in order iteration for free and very fast row/column based searching.

schungx avatar Mar 14 '25 12:03 schungx

@iancormac84 is there an advantage to using IndexMap over adding the indices as suggested by @schungx?

c-git avatar Mar 14 '25 21:03 c-git

@c-git I don't have a strong preference for either using an IndexMap or adding indices, TBH. It's just that I was writing code that needed to be able to reference the contents of previous columns in the spreadsheet (my code makes use of the get_collection_by_row() API, which returns Vec<&Cell>) and because of the non-deterministic nature of HashMap, I was getting panics. I came up with the workaround of sorting the cells after the fact. Storing the cells in an IndexMap was the solution that came to mind, but the plumbing that allows one to get a Vec<&Cell> in the order in which they occur in the worksheet doesn't matter that much to me.

iancormac84 avatar Mar 14 '25 22:03 iancormac84

Ok thanks for explaining. I guess given that. Adding the index accomplishes the same thing with the bonus of having them be "naturally" ordered.

c-git avatar Mar 14 '25 23:03 c-git

@iancormac84 is there an advantage to using IndexMap over adding the indices as suggested by @schungx?

IndexMap in the back is just an index pointing to slots in a Vec. It keeps insertion order when you iterate it, but does not keep order in rows/columns. Therefore, you always have to iterate the entire Vec to find all the cells for a particular row/column. It is for different uses.

schungx avatar Mar 15 '25 01:03 schungx

Ok thanks for explaining. I guess given that. Adding the index accomplishes the same thing with the bonus of having them be "naturally" ordered.

Correct. Using a BTreeSet<(u32, u32)> or BTreeMap<u32, u32> makes it naturally ordered in the way you want it. BTreeSet<(u32, u32)> has the additional benefit of sorting the column after the sorted row, and vice versa.

The caveat is that now you have three data structures to update whenever you change something, not one. Maybe it should be encapsulated into the CellsCollection type...

schungx avatar Mar 15 '25 01:03 schungx

I'd second the wrapping data structure. How would BTreeMap<u32, u32> work though? Wouldn't it need to be a collections of u32's for each key? Personally I prefer the set but I'd want to play around with the API of BTreeSet a bit first to ensure we could get all for a given column or row.

Edit: In the past I've used structures similar to BTreeMap<u32, BTreeSet<u32>> so that I could quickly get all for a row or column. And we would need two of them again one for rows as the key and one for columns as the key. I'm not sure there is an easier way.

c-git avatar Mar 15 '25 01:03 c-git

How would BTreeMap<u32, u32> work though?

Ah, you're right. BTreeMap<u32, u32> won't work because the index needs to be unique.

In the past I've used structures similar to BTreeMap<u32, BTreeSet< u32>>

You don't need to. BTreeSet<(u32, u32)> works just fine as each row/column tuple is unique.

Using BTreeSet::range, you can get an iterator over the range of keys for a particular group of indices. For example, (4, 1) to (4, u32::MAX) inclusive will iterate all the keys for row/column 4, and the great thing is that iteration will naturally be ordered for the subsequent columns/rows.

For example it is a row/column index, then the cells stream will start from column 1 and upwards in a sorted manner. And if you simply iterate the index, it'll yield coordinates that are sorted by row/column or vice versa.

And since Cells is already an encapsulated type, we can just add the two indices there without disrupting existing code. Then we can over add iter_by_row_column, iter_by_column_row, iter_row_cells, iter_rows_cells, iter_column_cells, iter_columns_cells, iter_row_cells_range, iter_column_cells_range etc.

I'll see if I can make a PR.

schungx avatar Mar 15 '25 04:03 schungx

Here is the PR: https://github.com/MathNya/umya-spreadsheet/pull/273

schungx avatar Mar 15 '25 11:03 schungx