yourAI
yourAI copied to clipboard
How do you clean your dataset?
For me I'm getting something like this after cleaning
User1#6840
n/m
User2#6840
got it working
User3#6840
anyone else in same boat as i was
If you're using Python, loop through each line as line.lstrip().
can you send a code for cleaning it by python
Hey, I had this issue as well, so I made a script that does this.
#!/bin/sh
# Description: Scrub txt data downloaded by DiscordChatExporter for use in yourAI
# Usage: discrub <input file> <output file>
# Remove 'Guild' message up top
tail -n +6 "$1" |
# Delete unnecessary data, bad users, ^M characters, urls and magnet links and format code blocks better
awk '/{Embed}/ || /{Attachments}/ || /{Reactions}/ || /Joined the server./ || /Pinned a message/ || /Dad Bot/ || /NotSoBot/ { found1 = 1 ; next } /\[..-.*\]/ { found1 = 0 } ! found1 { gsub(/
/, "") ; if (!/\[..-.*\]/) { gsub(/http[s]?:\/\/[^[:space:]]*/, "") ; gsub(/magnet:\?xt=urn:btih:.*/, "") } sub(/```/, "|||") ; print }' |
# Remove empty messages
awk 'found1 { if (/\[..-.*\]/ || /^$/) { prev = $0 ; next } found1 = 0 ; print prev ; print ; next } /\[..-.*\]/ { found1 = 1 ; prev = $0 ; next } { print }' |
# Change two empty lines to one empty lines after each message, remove timestamp in message, and add ':' after usernames
awk 'found2 { found2 = 0 ; if (/^$/) { next } } found1 { if (/^$/) { found2 = 1 } } /\[..-.*\]/ { found1 = 1 ; sub(/\[..-.*\] /, "") ; print $0":" ; next } { print }' > "$2"
It only works on Unix systems, but I hope it helps.
For some reason, on my system (Debian 11), that script just created a blank text file.