mwparserfromhell icon indicating copy to clipboard operation
mwparserfromhell copied to clipboard

Implement Wikicode.nodes as a doubly-linked list

Open Rua opened this issue 10 years ago • 3 comments

In my experience, it's not very common to want random access to nodes by their index. Instead, it's more common to go through the list sequentially with one of the filter iterators, possibly skipping some along the way. Modifying the node list by insertion or deletion is doubly-slow with the current implementation: first the list has to be traversed to find the index of the insertion point, then to insert means that all subsequent elements have to be shifted over. Both insertion and deletion of nodes are common operations.

So I think that it would make more sense to use linked lists for this implementation. Insertion and deleting will be fast, and iteration is not affected, nor is searching. Indexing will be slow, but that's not such a common thing to do anyway. Another thing that is sorely needed is, given a node, to find the previous and following nodes. This is also much simpler and faster with a linked list, as it doesn't mean having to find the index first. I think that linked lists would also allow modifying the list during iteration, but I'm not sure. If possible, that would be yet another pro.

If this is done, it should probably be done for template parameters and other lists of nodes too.

Rua avatar Oct 28 '15 19:10 Rua

Out of curiosity: Is this a real problem in the parsing process, i.e., are there lots of inserts or deletes? And would linked lists solve this? Complexity theory tells you that lists are better than arrays when manipulated. However, copying bits of memory is really fast on modern hardware. Lists are implemented as arrays in CPython and those have no problem with being changed a lot. I think John Carmack once said you shouldn't bother with lists and just use arrays always.

jfolz avatar Oct 28 '15 23:10 jfolz

Actually, @jfolz, I think you might be right. I recall that the overhead of linked lists (both in terms of memory usage and their conflicts with CPU caching, etc) often make them less-than-helpful on modern machines when compared to regular arrays. Also, even though mwpfh nodes can contain a fair bit of data overall, individual SmartListss usually don't have that many elements.

Still, it might be worth profiling.

There is a broader question of making nodes aware of their parents and siblings. I can see some use cases for that.

earwig avatar Oct 28 '15 23:10 earwig

That's what I tell my students, though they never listen ;) My suspicion is that it's not really spending a substantial amount of time in list at all.

jfolz avatar Oct 29 '15 11:10 jfolz