RedditExtractor
RedditExtractor copied to clipboard
Encoding of text seems inconsistent
Hi there - and thank you for this package!
The "get_thread_content" seems to return comments and threads in an inconsistent format. Sometimes unicode characters are included, sometimes not. Disclaimer: I'm not an expert in textual encoding, so could be wrong.
Here is are three examples:
Example 1
o <- get_thread_content("https://www.reddit.com/r/FundRise/comments/1003dqq/anyone_else_feel_its_deceptive_to_post_this/")
thread <- o$threads
comments <- o$comments
print(thread)
1 https://www.reddit.com/r/FundRise/comments/1003dqq/anyone_else_feel_its_deceptive_to_post_this/ CherylStoned 2022-12-31 1672519912 title 1 Anyone else feel it\031s deceptive to post this \034increase\035 especially after the recent drop? Last login was 12/27 and valued at $1,569.81& text subreddit score upvotes downvotes up_ratio total_awards_received golds cross_posts comments 1 FundRise 13 13 0 0.74 0 0 0 36
It's and the opening and closing brackets have a numeric encoding
Example 2
o <- get_thread_content("https://www.reddit.com/r/FundRise/comments/q28nv5/doubling_down_on_fundrise_vs_buying_a_property/")
thread <- o$threads
comments <- o$comments
print(thread)
l 1 https://www.reddit.com/r/FundRise/comments/q28nv5/doubling_down_on_fundrise_vs_buying_a_property/ author date timestamp title 1 theenlivened 2021-10-05 1633477751 Doubling down on fundrise vs. buying a property? text 1 I\031ve decided to have 30% of my assets in real estate and 70% in stocks. For the real estate portion, I\031m torn on putting everything in fundrise vs. splitting between fundrise and down payment towards a single family home. I love the passive style of putting everything in fundrise, but worried about the worst case scenarios (e.g. fundrise going bankrupt, fees increasing when someone else manages the underlying LLCs). How are you all thinking about this? subreddit score upvotes downvotes up_ratio total_awards_received golds cross_posts comments 1 FundRise 21 21 0 1 0 0 0 26
I've or I'm have the same encoding pattern
Example 3
o <- get_thread_content("https://www.reddit.com/r/FundRise/comments/owq9ls/east_coast_ereit_now_open_for_direct_investment/")
thread <- o$threads
comments <- o$comments
print(thread)
1 https://www.reddit.com/r/FundRise/comments/owq9ls/east_coast_ereit_now_open_for_direct_investment/ fatagrafah 2021-08-02 1627945644 title 1 East Coast eREIT now open for direct investment text 1 Looks like the East Coast eREIT just opened up for direct investment if you're at the Core level or higher.\n\nI've had my eye on it for a while. It's a balanced fund that provides some strategy diversification outside of Sunbelt rental housing. (There's also some geographic diversification, too; right now there's a property in New Jersey and a few around DC.) subreddit score upvotes downvotes up_ratio total_awards_received golds cross_posts comments 1 FundRise 15 15 0 0.94 0 0 0 8
Here I've had seems to be correctly encoded.
Would be very grateful for any suggestion on how to deal with this!