wikum
wikum copied to clipboard
New branch after "merge_fin".
Includes resolving issues other than the 7 mini pull requests.
- Cleans up text before parsing in website/import_data.py , such as
- fixing wrong signature forms that has spaces or new lines between user name and time or time and (UTC)
- get rid of user name's italics which broke up parsing
- get rid of NBSP which breaks parsing
- add new line after "(UTC)}} " so comment will be parsed properly
- fix (UTC to (UTC) so the line will be considered it as a signature
- fix wrong outdent templates, such as when editor puts ":" in front of {{outdent}}, which breaks parsing.
- get rid of templates like "" that causes comments to be blobbed
- get rid of certain characters that breaks up parsing
- fix cases when ":" and "" are mixed together such as in "\n:::"
-
In wikichatter/indentblock.py, changed from only checking "timestamp" to also checking "username" because there are cases when an editor will call another user.
old_contains_sig = _contains_user_sig(line) and _contains_timestamp(line)
-
Handle cases when signature is added by bot, such as in " <small… …><span class="autosigned">\u2014 Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Dougweller|Dougweller]] ([[User talk:Dougweller|talk]] \u2022 [[Special:Contributions/Dougweller|contribs]]) ".
-
Enabled grabbing signatures without timestamps, which is a frequent case. In this case the author will be "Anonymous" (change in wikichatter/signatureutils.py).
-
Added more keywords for outdent in wikichatter/indentutils.py
_INDENT_TEMPLATE_RE = re.compile(r'|'.join(["out(dent)?", "un(in)?dent", "od", "anchor\|Lbelow"]), re.I)
-
Fixed cases when indentation is broken due to comment containing files and templates (change in "wikichatter/indentblock.py").
re.match(re.compile(ur'({{.*}}|<.*?>.*</.*?>)|\[\[File:.*?\]\]', re.DOTALL|re.I), str(line))
-
Loosened time format in wikichatter/signatureutils.py.
Also, do we still need this pull request? if so can you clean up the merge conflicts?
Fixed conflicts. They were simple ones. But since this branch contains changes from #73 , could you first take a look at it?
TO CLARIFY, THIS BRANCH... Includes resolving issues other than the 7 mini pull requests.
- Cleans up text before parsing in website/import_data.py , such as
- fixing wrong signature forms that has spaces or new lines between user name and time or time and (UTC)
- get rid of user name's italics which broke up parsing
- get rid of NBSP which breaks parsing
- add new line after "(UTC)}} " so comment will be parsed properly
- fix (UTC to (UTC) so the line will be considered it as a signature
- fix wrong outdent templates, such as when editor puts ":" in front of {{outdent}}, which breaks parsing.
- get rid of templates like "" that causes comments to be blobbed
- get rid of certain characters that breaks up parsing
- fix cases when ":" and "*" are mixed together such as in "\n:*::"
-
In wikichatter/indentblock.py, changed from only checking "timestamp" to also checking "username" because there are cases when an editor will call another user.
old_contains_sig = _contains_user_sig(line) and _contains_timestamp(line)
-
Handle cases when signature is added by bot, such as in " <small… …><span class="autosigned">\u2014 Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Dougweller|Dougweller]] ([[User talk:Dougweller|talk]] \u2022 [[Special:Contributions/Dougweller|contribs]]) ".
-
Enabled grabbing signatures without timestamps, which is a frequent case. In this case the author will be "Anonymous" (change in wikichatter/signatureutils.py).
-
Added more keywords for outdent in wikichatter/indentutils.py
_INDENT_TEMPLATE_RE = re.compile(r'|'.join(["out(dent)?", "un(in)?dent", "od", "anchor\|Lbelow"]), re.I)
-
Fixed cases when indentation is broken due to comment containing files and templates (change in "wikichatter/indentblock.py").
re.match(re.compile(ur'({{.*}}|<.*?>.*</.*?>)|\[\[File:.*?\]\]', re.DOTALL|re.I), str(line))
-
Loosened time format in wikichatter/signatureutils.py.
@trusttri merge conflicts here again. Thanks!
Hi @trusttri think there are still conflicts above.