third-stats icon indicating copy to clipboard operation
third-stats copied to clipboard

Semantic analysis features for fun and elucidation.

Open gessel opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe. I'd really enjoy additional semantic data, both for fun and for research:

Describe the solution you'd like

One smol bug:

The activity heat map year select has a bug if there are future dated emails - the current year (2024 as of this writing) includes messages from 2033 which smooshes the heat map. yes, that should be impossible, but junk mail does that.

Semantic Visualizations:

There are a few useful/interesting semantic analysis tools I'd like to visualize in addition to raw message count, that is being able to switch the view from raw message count to:

  • Word count (excluding quoted/reply text and signatures)
  • Flesch reading-ease value
  • Vocabulary density

By changing the statistical basis from raw message count to a semantic analysis measure, the "Most received from" chart would follow the lead, ranking by sender the relevant semantic score.

An additional widget might be a word cloud.

gessel avatar Mar 13 '24 15:03 gessel

Hi David, thanks for your great suggestions 👏🏻

The activity heat map year select has a bug if there are future dated emails - the current year (2024 as of this writing) includes messages from 2033 which smooshes the heat map. yes, that should be impossible, but junk mail does that.

Good catch, never had this case on my end before, but should be easily fixable. We could just ignore messages from the future on stats analysis or we could even classify those as junk.

  • Word count (excluding quoted/reply text and signatures)

This is only possible, if Thunderbird provides a function that returns only the actual email content (without quotes, signatures etc) via webext API. I'm afraid this doesn't exists yet.

  • Flesch reading-ease value

This is easy to calculate, but depends on the first point. When we don't have an accurate word or sentence count, this would make no sense.

  • Vocabulary density

How is this calculated? But again this depends on having the actual email content.

An additional widget might be a word cloud.

What value can be retrieved from a word cloud? In the past, I found them rather less useful.

devmount avatar Mar 13 '24 15:03 devmount

Hi devmount!

Thanks for an awesome project. I didn't realize email content wasn't available :-( As an edge case, and probably only useful to an audience of approximately 1, it wouldn't be that hard to write a server-side script to append headers with the semantic values.

Vocabulary Density would count unique words after generating a word count, then computing the ratio. As I think about examples, it'd be pretty useless for short messages. Lexical Density might be more interesting, but that requires not only having access to message contents but also to a parts of speech model.

Word clouds... I feel your lack of enthusiasm. I imagine it as being a more useful exploratory tool, comparing the dominant results for one folder or correspondent vs. another. Also the TF-IDF internal function yields some useful semantic metrics.

-David

gessel avatar Mar 13 '24 18:03 gessel

My pleasure. Yeah it's unfortunate. I could try to parse the contents myself (e.g. ignoring every line starting with > and every line below --), but that might be inefficient and unprecise too. It would most probably increase stats computation time by a multiple.

Lexical density sounds interesting, I already worked with parts of speech models in the past and trained some language models myself. But ThirdStats is a multilingual tool and this means to have a model for each supported language. Which is definitely out of scope for this project 😅

Sorry, I didn't mean to downplay your idea 🙏🏻 I was actually interested in use cases for word clouds. Only because I didn't found them useful doesn't mean they really are not 🤷🏻‍♂️ I guess the crucial part is the metric that is the basis. Count of occurences? TF-IDF?

Again thank you for your suggestions. It makes me think more in the language statistics direction, which is nice.

devmount avatar Mar 13 '24 19:03 devmount