wikum icon indicating copy to clipboard operation
wikum copied to clipboard

New branch after "merge_fin".

Open trusttri opened this issue 7 years ago • 4 comments

Includes resolving issues other than the 7 mini pull requests.

  1. Cleans up text before parsing in website/import_data.py , such as
  • fixing wrong signature forms that has spaces or new lines between user name and time or time and (UTC)
  • get rid of user name's italics which broke up parsing
  • get rid of NBSP which breaks parsing
  • add new line after "(UTC)}} " so comment will be parsed properly
  • fix (UTC to (UTC) so the line will be considered it as a signature
  • fix wrong outdent templates, such as when editor puts ":" in front of {{outdent}}, which breaks parsing.
  • get rid of templates like "" that causes comments to be blobbed
  • get rid of certain characters that breaks up parsing
  • fix cases when ":" and "" are mixed together such as in "\n:::"
  1. In wikichatter/indentblock.py, changed from only checking "timestamp" to also checking "username" because there are cases when an editor will call another user. old_contains_sig = _contains_user_sig(line) and _contains_timestamp(line)

  2. Handle cases when signature is added by bot, such as in " <small… …><span class="autosigned">\u2014 Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Dougweller|Dougweller]] ([[User talk:Dougweller|talk]] \u2022 [[Special:Contributions/Dougweller|contribs]]) ".

  3. Enabled grabbing signatures without timestamps, which is a frequent case. In this case the author will be "Anonymous" (change in wikichatter/signatureutils.py).

  4. Added more keywords for outdent in wikichatter/indentutils.py _INDENT_TEMPLATE_RE = re.compile(r'|'.join(["out(dent)?", "un(in)?dent", "od", "anchor\|Lbelow"]), re.I)

  5. Fixed cases when indentation is broken due to comment containing files and templates (change in "wikichatter/indentblock.py"). re.match(re.compile(ur'({{.*}}|<.*?>.*</.*?>)|\[\[File:.*?\]\]', re.DOTALL|re.I), str(line))

  6. Loosened time format in wikichatter/signatureutils.py.

trusttri avatar Jul 02 '17 04:07 trusttri

Also, do we still need this pull request? if so can you clean up the merge conflicts?

amyxzhang avatar Jul 17 '17 06:07 amyxzhang

Fixed conflicts. They were simple ones. But since this branch contains changes from #73 , could you first take a look at it?

TO CLARIFY, THIS BRANCH... Includes resolving issues other than the 7 mini pull requests.

  1. Cleans up text before parsing in website/import_data.py , such as
  • fixing wrong signature forms that has spaces or new lines between user name and time or time and (UTC)
  • get rid of user name's italics which broke up parsing
  • get rid of NBSP which breaks parsing
  • add new line after "(UTC)}} " so comment will be parsed properly
  • fix (UTC to (UTC) so the line will be considered it as a signature
  • fix wrong outdent templates, such as when editor puts ":" in front of {{outdent}}, which breaks parsing.
  • get rid of templates like "" that causes comments to be blobbed
  • get rid of certain characters that breaks up parsing
  • fix cases when ":" and "*" are mixed together such as in "\n:*::"
  1. In wikichatter/indentblock.py, changed from only checking "timestamp" to also checking "username" because there are cases when an editor will call another user. old_contains_sig = _contains_user_sig(line) and _contains_timestamp(line)

  2. Handle cases when signature is added by bot, such as in " <small… …><span class="autosigned">\u2014 Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Dougweller|Dougweller]] ([[User talk:Dougweller|talk]] \u2022 [[Special:Contributions/Dougweller|contribs]]) ".

  3. Enabled grabbing signatures without timestamps, which is a frequent case. In this case the author will be "Anonymous" (change in wikichatter/signatureutils.py).

  4. Added more keywords for outdent in wikichatter/indentutils.py _INDENT_TEMPLATE_RE = re.compile(r'|'.join(["out(dent)?", "un(in)?dent", "od", "anchor\|Lbelow"]), re.I)

  5. Fixed cases when indentation is broken due to comment containing files and templates (change in "wikichatter/indentblock.py"). re.match(re.compile(ur'({{.*}}|<.*?>.*</.*?>)|\[\[File:.*?\]\]', re.DOTALL|re.I), str(line))

  6. Loosened time format in wikichatter/signatureutils.py.

trusttri avatar Jul 18 '17 09:07 trusttri

@trusttri merge conflicts here again. Thanks!

amyxzhang avatar Aug 05 '17 18:08 amyxzhang

Hi @trusttri think there are still conflicts above.

amyxzhang avatar Aug 12 '17 06:08 amyxzhang