canvas-datagrid icon indicating copy to clipboard operation
canvas-datagrid copied to clipboard

[Question/Feature request] virtualized data model

Open jcmonnin opened this issue 4 years ago • 8 comments

I would like to show a huge data set in a datagrid. The source data can have multiple million rows and sometimes hundreds of columns. It's not practical to fetch all the data as it won't fit in browser memory (~10GB).

I'd like to be able to specify the columns (names and width, and the number of rows) and then fetch asynchronously the data as the user scrolls through.

I haven't seen a way to use the datagrid without having the full dataset in browser memory.

I just started to look at this library. Would this fit well into the existing code, or are there assumptions about having the full set of data upfront in many places in the library?

I have started to read through #334, which is very vaguely related.

jcmonnin avatar Feb 09 '21 05:02 jcmonnin

Would #145 provide that functionality, or is the DataProvider about something else?

jcmonnin avatar Feb 09 '21 08:02 jcmonnin

I just starting maintaining this library in my spare time, so I cannot answer with the knowledge and experience of the original author (@TonyGermaneri), but I think your use case fits the design, for the most part. I don't think sorting or filtering would make any sense on a dataset that large, either in terms of (as you mentioned) memory usage, compute cycles, or UX. But you can disable those two features.

A data provider interface would, I believe, definitely make that type of functionality possible, although we'd have to hammer out the details of such an interface a bit. It's certainly something I'd have a use for eventually as well, but I don't have a lot of time to spend on it myself at the moment, unfortunately.

(I do wonder, more out of curiosity than anything else, what type of user experience scrolling through literally millions of rows of data would be like?)

ndrsn avatar Feb 09 '21 20:02 ndrsn

I think the XHR paging demo here is a good starting point: https://canvas-datagrid.js.org/xhrPagingDemo.html (you have to "enable unsafe scripts" because github pages is HTTPS and the source of data in the demo is HTTP. You can also just look at the code.

The way it works is by loading the grid with empty array (e.g.: const i = []; i.length = 10000000;) then attach an event listener to the scroll function which will fire anytime the user moves the view area. From there you can check which columns and rows are visible (using the grid and event data) and you can request the data from you remote and populate the array then call grid.draw() to update the data. It's very fast as long as your API is fast. Data in memory stays in memory. You can also write a data adapter that uses indexDB on the browser to sync data. This is in fact the original use case of this program.

LIke @ndrsn I am not actively maintaining the project but I'm happy to jump and answer questions when I can.

(in case you want to run the demo and can't figure out how to disable security: https://stackoverflow.com/questions/37387711/page-loaded-over-https-but-requested-an-insecure-xmlhttprequest-endpoint)

(paging demo source and relevant event hook: https://github.com/TonyGermaneri/canvas-datagrid/blob/master/tutorials/xhrPagingDemo.js#L111)

The size of the data is not related to the performance of the grid which is pretty cool. So you shouldn't have a problem with millions of rows or columns. It was designed with that use case in mind.

TonyGermaneri avatar Feb 11 '21 07:02 TonyGermaneri

Thanks for the detailed answers. I had a look at the approach outlined in the XHR demo and could get it to work. Given it needs placeholder object per row, it doesn't scale optimally for really big tables. In the test page I did, I had to select a subset of columns to keep it really responsive, especially for the initial loading time (On a 1 million row table, I selected 100 columns of of the 1500 available columns.). With the subset of columns, it runs nicely. I haven't been able to do any further test yet. For my use case, I think it would be useful if I didn't have to provide placeholder rows (eg. fully abstract virtual data provider).

jcmonnin avatar Mar 15 '21 00:03 jcmonnin

Given it needs placeholder object per row

oh no, don't do that. Specify the schema in the schema object then just set the length of the data. Don't actually put data in the rows until the XHR response comes back with the real data. The grid will make it look nice even if the rows have no data (undefined vs. a stub object). It should go quite fast and scale to millions of rows.

TonyGermaneri avatar Mar 15 '21 17:03 TonyGermaneri

Thanks for the hints. I have done a few more experiments and could get usable results with more columns, but encountered some issues. I have prepared a jsfiddle to illustrate the issues.

  • I couldn't just set the length of the data array, but had to initialize each row with an empty object, otherwise it gives an error when assigning the data. You seem to say that an array of undefined should work. Anything I'm doing wrong in my example? This is not a major as preparing the array of empty objects only takes ~80ms per 1 million rows.
  • When implementing the lazy loading of the column, I got some unexpected empty cells. It looks like the scrollIndexRect for the columns is not always reliable. Sometimes left > right. It can be reproduced in the jsfiddle by scrolling to the very right. Any hints of how to investigate that issue? "scrollIndexRect issue", { bottom: 11, left: 987, right: 969, top: 0 }

jcmonnin avatar Mar 17 '21 01:03 jcmonnin

Looking at my example I provided it looks like it is creating an object for each row to show some information. IIRC I made that the same object ref or undefined somehow. It's been several years since I made this program and I've since left the company that owns the source. Try assigning every index to the same object then replacing the index with a new object as the request comes back from the server. I'm pretty sure you can get it working with a bit of effort. Your use case is the one that this project was created for. I had several million rows and dozens of columns I had to move through and got it working. I wish I could be more precise in helping.

TonyGermaneri avatar Mar 17 '21 02:03 TonyGermaneri

the company that owns the source

What source is owned by that company?

The code content in this repository is BSD-3-Clause-licensed and the copyright owner is you @TonyGermaneri , correct?

pdkovacs avatar Jun 24 '23 20:06 pdkovacs