deltachat-core remove Re: when using the subject as name for ad-hoc groups

when creating ad-hoc groups, the Subject: is used as the name for the group. If this string contains the prefix Re:, it should be removed. The same possibly with the prefix Fwd:.

Mar 06 '18 16:03 r10s

Should prefixes in other languages be handled as well or is it better to stick to Re: and Fwd: first as long as we don't have tests (I think so based on another issue I saw)?

Apr 14 '18 19:04 mbeko

when parsing messages we prepend non-Re-subjects from non-Delta-messages to the message. For detection we only check if the 2nd for 3rd character is a colon, see https://github.com/deltachat/deltachat-core/blob/master/src/mrmimeparser.c#L1483

maybe this is also sufficient here?

Apr 14 '18 21:04 r10s

Good idea. This could be improved by finding the last occurence of : and remove it with the preceding string. So whole chains of Re: and Fwd: will be removed and I assume : is rarely part of the actual subject. Also, including the space avoids confusion with emotes.

Apr 16 '18 16:04 mbeko

GitHub seems to have removed the space, so just in case: There is a space after the colon in the search string 😉

Apr 16 '18 16:04 mbeko

Better not to assume the colon is not to be used in the subject, even if just in "new subject (was: this old subject).

Maybe remove all repetitions from left to right, using something like "at the beginning, all permutations of the known case insensitive strings, optionally separated with colons or spaces and with a colon and space at the end? "^([re|fwd]*:* *): " (not sure if it is valid regexp)

Apr 16 '18 17:04 testbird

Then we're back at the question whether to check only for Re: and Fwd: or their translations, too. If we're to check for the specific words, then I indeed thought about something along the lines of that regex.

So, now the questions are:

Check for the colons?
- At the second or third position only?
- At the last position in the string?
Check for specific words?
- Only Re: and Fwd:?
- Also the translations?

Apr 16 '18 17:04 mbeko

I would search for the first colon, if it is at the 2nd or 3rd position, remove the colon and the text left of this colon and left-trim the rest. After that I would repeat this procedure until there are no more matches.

This way we would get rid of all Re: Fwd: AW: and so on stuff at the beginning of the subject.

Questions: are the 2 or 3 characters sufficient? has anyone seen a localized Re: or Fwd: prefix with more or less characters? If it is not sufficient, we may use a list. Btw. maybe we should move all this into a handy function (mrtools.c (which needs some cleanup some ime)) , seems as if it is needed at least at two positions.

Regarding RegEx: Currently there is no RegEx-library available and i think we should not add one for such a minor thing.

Apr 16 '18 18:04 r10s

[repeat remove up to + trimming] first colon, if it is at the 2nd or 3rd position

Good idea, works even in the "Re: new subject (was: old subject)".

Apr 16 '18 22:04 testbird

i would add a final test to check the resulting string is not empty. If so, I would discard all modifications and use the original subject. Only if this is empty, we can use a fallback as "no subject". all this could go to a separate function eg. char* mr_cleanup_subject(const char* subject)

Apr 17 '18 10:04 r10s

Sounds great with the first colon and the repetition! Thanks for the hint where to put the code for reuse and about the empty check.

Questions: are the 2 or 3 characters sufficient? has anyone seen a localized Re: or Fwd: prefix with more or less characters?

Yes, in the Wikipedia article I linked, you can see there can be even 10 characters. Still, most cases seem to be 2 or 3 characters. In the case of Chinese, Arabic etc. it's 2, but there we need to keep in mind that 2 characters are not necessarily 2 bytes. I don't know how well Unicode can be handled in C nowadays, I remember something with wchar.

Regarding RegEx: Currently there is no RegEx-library available and i think we should not add one for such a minor thing.

Good to know. I had already imagined that there is no regex library in use, it's C after all :) So I planned anyway to imitate the functionality with the standard library functions.

Apr 17 '18 20:04 mbeko

there can be even 10 characters

Maybe the idea can still work, if you can exclude spaces (i.e. cut only single words plus colon and space)?

Apr 17 '18 20:04 testbird

I don't know how well Unicode can be handled in C nowadays, I remember something with wchar.

wchar is not used by Delta Chat; we use UTF-8 anywhere. But you're right the simple 2-3 bytes approach only works for ascii. To be safe, seems as if we need a dict of all translations, or at least the prefixes not catched by the 2-3-bytes rule. I would suggest to hard-code the dict into the code - there won't be that many entries, they won't grow and the current locale is no good hint when receiving mails from another languge.

However, we can also start with the 2-3-bytes rule and keep the dict for later optimization. I think it's not the most important thing to do currently.

Apr 17 '18 20:04 r10s

There should be a concept called topic, which applies to all chats, not just group chats. A topic and a group name are two different things. I could talk to the same group (or individual) on different topics.

The user should optionally be able to use the topic in creating the email Subject. One possible Subject is "Chat: [topic] [first few words]". However, some users might prefer the Subject to be something else. Not all conversations need a topic. See #239

Aug 16 '18 16:08 WinAuthFan

Incidentally in this issue, the received subject is to be used as the default group name for ad-hoc groups. The received subject might refer to a topic, event, name or anything the sender chose. (And might be changed later.)

I don't think introducing a "topic" concept would help in maintaining simplicity and email compatibilty. Just let deltachat use good default subjects, and allow the user to specify any subject they like (with or without brackets in there).

But let's keep this issue only for removing the Re: and #239 to discuss subject handling in general.

Aug 16 '18 17:08 testbird

A short info about the progress on this:

I have implemented the removal of the subjects listed on Wikipedia. The prefixes are recognised case-insensitively, also the non-ASCII characters.

During my research, I found prefixes missing on Wikipedia, so the code is written in such a way that prefixes can be added anytime by adding entries to an array and additionally related lower and upper case letter mappings if necessary.

~I have written quick tests in an own variation of MinUnit and ran everything through Valgrind. It looks good now. What's left for the PR is rewriting the tests in Python and documenting the function signatures.~ After discussion on the mailing list, I've used Cmocka for the unit tests.

I'd also like to add real world examples with subjects in different languages to the tests because my trust in the Wikipedia list has been lowered. Where do you think would be a good place to ask people for this? The Delta Chat forum?

Jan 28 '19 20:01 mbeko

great, that you push this forward :)

Where do you think would be a good place to ask people for this? The Delta Chat forum?

you can try it there, of course. maybe also add point to the forum entry from the mailing list and/or irc.

Jan 28 '19 21:01 r10s

As far as I understood, the PR probably won't be reviewed due to the switch to Rust.

I'll try to port the code and tests, but it can take some time as I'm currently working on another project and have no experience with Rust.

Aug 12 '19 06:08 mbeko

yip, currently there is huge ongoing effort to port core-c to core-rust.

when the basic port is done, new features can be added to core-rust. core-c is probably sort of deprecated then.

Aug 12 '19 09:08 r10s

deltachat-core deltachat-core copied to clipboard

remove Re: when using the subject as name for ad-hoc groups

deltachat-core
deltachat-core copied to clipboard