yourAI icon indicating copy to clipboard operation
yourAI copied to clipboard

How do you clean your dataset?

Open kodxana opened this issue 4 years ago • 4 comments

For me I'm getting something like this after cleaning

User1#6840
n/m

 User2#6840
got it working

 User3#6840
anyone else in same boat as i was

kodxana avatar Feb 24 '21 10:02 kodxana

If you're using Python, loop through each line as line.lstrip().

h4nkyn avatar May 19 '21 02:05 h4nkyn

can you send a code for cleaning it by python

shreesha345 avatar Jul 01 '21 05:07 shreesha345

Hey, I had this issue as well, so I made a script that does this.

#!/bin/sh
# Description: Scrub txt data downloaded by DiscordChatExporter for use in yourAI
# Usage: discrub <input file> <output file>

# Remove 'Guild' message up top
tail -n +6 "$1" |
	# Delete unnecessary data, bad users, ^M characters, urls and magnet links and format code blocks better
	awk '/{Embed}/ || /{Attachments}/ || /{Reactions}/ || /Joined the server./ || /Pinned a message/ || /Dad Bot/ || /NotSoBot/ { found1 = 1 ; next } /\[..-.*\]/ { found1 = 0 } ! found1 { gsub(/
/, "") ; if (!/\[..-.*\]/) { gsub(/http[s]?:\/\/[^[:space:]]*/, "") ; gsub(/magnet:\?xt=urn:btih:.*/, "") } sub(/```/, "|||") ; print }' |
	# Remove empty messages
	awk 'found1 { if (/\[..-.*\]/ || /^$/) { prev = $0 ; next } found1 = 0 ; print prev ; print ; next } /\[..-.*\]/ { found1 = 1 ; prev = $0 ; next } { print }' |
	# Change two empty lines to one empty lines after each message, remove timestamp in message, and add ':' after usernames
	awk 'found2 { found2 = 0 ; if (/^$/) { next } } found1 { if (/^$/) { found2 = 1 } } /\[..-.*\]/ { found1 = 1 ; sub(/\[..-.*\] /, "") ; print $0":" ; next } { print }' > "$2"

It only works on Unix systems, but I hope it helps.

ghost avatar Nov 13 '21 13:11 ghost

For some reason, on my system (Debian 11), that script just created a blank text file.

wisplite avatar Dec 24 '21 04:12 wisplite