Overlapping words / inconsistencies?
Hey,
I've noticed that dictionary words with significant overlap might display in an inconsistent manner. I'm highlighting my questions with (Q).
Consider the set of valid dictionary terms sharing the term "week" (星期):
(days of the week):
- 星期一
- 星期二
- 星期三
- 星期四
- 星期五
- 星期六
- 星期日
- 星期天 and
- 星期 (week)
- 下个星期 (next week)
Searching for 星期 and then clicking on the edge between 星 and 期 should list multiple words, with all words in the list above I assume. However
- I only get 星期 (week) and Friday (星期五) listed; (Q) why not all other dictionary words containing 星期?
Also
- Considering the nodes 星 and 期 and the edge between them, (Q) what determines the word appearing on the edge, given that there are multiple candidates, in this case all words in the list above? I sometimes get 星期, other times 星期天, for example.
Thanks a lot!
Thanks for the questions!
(Q) why not all other dictionary words containing 星期?
I set this up to just display the first 2 words encountered in the word list, with the constant being here. You're probably right that I should increase that, though I think it's still best with some upper limit (when the word list has the entire dictionary in it, it's common to get a bunch of rarely used words on most edges in the graph). Or I could allow more than 2 if the word being processed is in the first N words encountered in the word list. Lots of options. This would be an easy one to use env variables for.
(Q) what determines the word appearing on the edge, given that there are multiple candidates, in this case all words in the list above?
It should be the first word it encounters in the word list, which is assumed to be in order of word frequency (i.e., higher priority to learn the words earlier in the list than later). If it's inconsistent with the same word list, then I have a bug!
Hey,
Thanks for the explanations, that makes lots of sense. I think the design choices are reasonable - I was actually wondering by what means you avoid graph "explosions".
Yes, I agree, having some configurability (why not in Docker) would be useful. This is also very good:
Or I could allow more than 2 if the word being processed is in the first N words encountered in the word list.
If it's inconsistent with the same word list, then I have a bug! I later realised those inconsistencies were due to me playing with the custom wordlist sets.
On the back of this discussion, can I also assume that:
- The word appearing on the edge is always the most frequent amongst the candidates?
- There is also an upper limit to the number of edges drawn from a node when clicked on it? How many is "too many" in that case?
Thanks!
word appearing on the edge is always the most frequent amongst the candidates
yes, assuming the word list is in order of frequency (this is the case for hanzigraph.com, though I suppose user-supplied word lists could be anything).
avoid graph "explosions"
ah, not sure where I said it, but besides the words per edge limit, it might've also been referring to:
an upper limit to the number of edges drawn from a node when clicked
yes, there's a limit of 8 right now (code). I've been thinking I should bump this to 10 or 12 for very common characters though. For additional context, I used to show all characters in HSK in the graph, and so 不 had like 100 edges or something, which made it hard to use (see #1 for some feedback to that effect).
though I suppose user-supplied word lists could be anything
Therefore users should be made aware of this when activating custom wordlists: order matters.
Regarding the matter of "multiple edges" between nodes, I was thinking, would it be difficult to implement some arrow buttons on the edge that would cycle the words around? Something like a spinner input widget. This would allow the user to select the word they would like to jump to, instead of deciding that with word ordering in data files.
that is a really good question. I will check into whether Cytoscape.js supports something like that