pokeapi flavor_text_entries contain newline and other characters within the string

Summary

The source data for flavor_text uses characters like \n and \u000c which make it difficult to read normally
PokéAPI contains the same characters which might not be desirable
Suggest to parse the raw text into a readable format for the benefit of most common use-cases
If not, consider adding this info into the docs so users might have a chance to know about it and handle the data appropriately

Problem

Some flavor_text_entries have issues with additional unexpected characters.

"flavor_text_entries" : [
{ "flavor_text" : "It hates light and\nshock. If attack\ned, it inflates\u000cits body to pump\nup its counter\nstrike." }
...
{ "flavor_text" : "To keep its pitch-\nblack tail hidden,\nit lives quietly\u000cin the darkness.\nIt is never first\nto attack." }
...
{ "flavor_text" : "In order to con\nceal its black\ntail, it lives in\u000ca dark cave and\nonly moves about\nat night." }
...
]

Seems like it is also not limited to PokemonSpecies, nor to language. See the move Seismic Toss:

"Inflicts damage identical\nto the user’s level."
"L’ennemi est projeté grâce au pouvoir de\nla gravité. Inflige des dégâts équivalents\nau niveau du lanceur."
"いんりょくを　つかい　なげとばす。\nじぶんの　レベルと　おなじ　ダメージを\nあいてに　あたえる。"

Details of the problem

These characters seem to originate from the CSV file data from veekun/pokedex. A comment on this closed issue about "parsing pokemon_species_flavor_text.csv" confirms that the presence of these characters within the files is the intended behaviour, to reflect the original format of the text within the game files, which contain "explicit line breaks, hyphenation, and page breaks".

There is also a suggested method to replace these characters to make it more readable, although it is a little more complicated than simply "replace \n with whitespace/empty string".

Impact and Suggestions

Above is an example of someone's application using the unprocessed flavor_text directly. which has an unfortunate result of making it look somewhat unrefined. It's an avoidable outcome which the PokéAPI team can perhaps help mitigate.

The team may want to implement parsing of the flavor_text for the API, since I would expect that most users of the API would simply take the flavor_text as it is, assuming that it is formatted correctly.

Alternatively, if it is neither feasible nor desirable to do this parsing on the API side, then perhaps it should be stated within the documentation that the flavor_text are in an unprocessed form which will require parsing on the user's side. As far as I can tell, this information is currently not being reflected in the docs. If it was, then at least some users would take note and be able to process the data on their end.

May 19 '22 19:05 tanxh33

I agree with this change. Formatting should be done by the client, not hardcoded into the data.

Jun 08 '22 20:06 DDriggs00

Anyone who wants it authentically spaced would want it as is, otherwise it is relatively simple to strip all newline characters from the string. It is minimal processing before presenting it to the user.

Jun 08 '22 20:06 Selim042

It's not authentically spaced. The flavor text is designed to function as fake paragraphs in some cases. They shouldn't be using Unicode space characters in the place of actual spaces. There are improvements that can be made here. But we shouldn't just be purging all formatting from these.

Some sanitation needs to take place (namely characters such as <0xad> and <0x0c> appearing), and a choice needs to be made if this should abandon veekun's internal formatting (which in some cases functions as a sort of markdown-like formatting via their front-end) in favor of none at all, markdown, or raw HTML.

At the very least, we should convert Unicode to a human readable format. As for spacing, newlines and returns. That is going to have to be on a case-to-case basis. While a majority of the flavor_text_entries for Pokemon are one line, that isn't the case for other things that have flavor text in the db. In some instances veekun has used multiple newlines to create fake paragraphs, which also puts a burden on the end-user to figure out how to parse these.

This is a heavy lift all around, with the pokemon_species_flavor_text coming in at 182,667 lines which would all have to be checked function as expected. Simple enough, but time consuming. Going forward, I think a good first step is swapping out the Unicode characters for their equivalents; spaces, hyphens, etc., and then looking into how to manage potential paragraph changes.

Jun 14 '22 02:06 merfed

Hi, thanks for bringing back this issue. First, I'll add it to the docs. So people know what to expect from the data.

Then, I guess the best scenario is that we read the CSV data as is, and replace the characters with their visible alternative. We have to figure out how to do it in python, though.

Jun 21 '22 13:06 Naramsim

pokeapi pokeapi copied to clipboard

flavor_text_entries contain newline and other characters within the string

Summary

Problem

Details of the problem

Impact and Suggestions

pokeapi
pokeapi copied to clipboard