pokeapi
pokeapi copied to clipboard
flavor_text_entries contain newline and other characters within the string
Summary
- The source data for
flavor_text
uses characters like\n
and\u000c
which make it difficult to read normally - PokéAPI contains the same characters which might not be desirable
- Suggest to parse the raw text into a readable format for the benefit of most common use-cases
- If not, consider adding this info into the docs so users might have a chance to know about it and handle the data appropriately
Problem
Some flavor_text_entries
have issues with additional unexpected characters.
Example from Wobbuffet's entry:
"flavor_text_entries" : [
{ "flavor_text" : "It hates light and\nshock. If attack\ned, it inflates\u000cits body to pump\nup its counter\nstrike." }
...
{ "flavor_text" : "To keep its pitch-\nblack tail hidden,\nit lives quietly\u000cin the darkness.\nIt is never first\nto attack." }
...
{ "flavor_text" : "In order to con\nceal its black\ntail, it lives in\u000ca dark cave and\nonly moves about\nat night." }
...
]
Seems like it is also not limited to PokemonSpecies
, nor to language. See the move Seismic Toss:
"Inflicts damage identical\nto the user’s level."
"L’ennemi est projeté grâce au pouvoir de\nla gravité. Inflige des dégâts équivalents\nau niveau du lanceur."
"いんりょくを つかい なげとばす。\nじぶんの レベルと おなじ ダメージを\nあいてに あたえる。"
Details of the problem
These characters seem to originate from the CSV file data from veekun/pokedex. A comment on this closed issue about "parsing pokemon_species_flavor_text.csv" confirms that the presence of these characters within the files is the intended behaviour, to reflect the original format of the text within the game files, which contain "explicit line breaks, hyphenation, and page breaks".
There is also a suggested method to replace these characters to make it more readable, although it is a little more complicated than simply "replace \n
with whitespace/empty string".
Impact and Suggestions
data:image/s3,"s3://crabby-images/84f2a/84f2af9d94e0e4492f85190ffe0d95a2a2a97065" alt=""
Above is an example of someone's application using the unprocessed flavor_text
directly. which has an unfortunate result of making it look somewhat unrefined. It's an avoidable outcome which the PokéAPI team can perhaps help mitigate.
The team may want to implement parsing of the flavor_text
for the API, since I would expect that most users of the API would simply take the flavor_text
as it is, assuming that it is formatted correctly.
Alternatively, if it is neither feasible nor desirable to do this parsing on the API side, then perhaps it should be stated within the documentation that the flavor_text
are in an unprocessed form which will require parsing on the user's side. As far as I can tell, this information is currently not being reflected in the docs. If it was, then at least some users would take note and be able to process the data on their end.
I agree with this change. Formatting should be done by the client, not hardcoded into the data.
Anyone who wants it authentically spaced would want it as is, otherwise it is relatively simple to strip all newline characters from the string. It is minimal processing before presenting it to the user.
It's not authentically spaced. The flavor text is designed to function as fake paragraphs in some cases. They shouldn't be using Unicode space characters in the place of actual spaces. There are improvements that can be made here. But we shouldn't just be purging all formatting from these.
Some sanitation needs to take place (namely characters such as <0xad>
and <0x0c>
appearing), and a choice needs to be made if this should abandon veekun's internal formatting (which in some cases functions as a sort of markdown-like formatting via their front-end) in favor of none at all, markdown, or raw HTML.
At the very least, we should convert Unicode to a human readable format. As for spacing, newlines and returns. That is going to have to be on a case-to-case basis. While a majority of the flavor_text_entries
for Pokemon are one line, that isn't the case for other things that have flavor text in the db. In some instances veekun has used multiple newlines to create fake paragraphs, which also puts a burden on the end-user to figure out how to parse these.
This is a heavy lift all around, with the pokemon_species_flavor_text
coming in at 182,667 lines which would all have to be checked function as expected. Simple enough, but time consuming. Going forward, I think a good first step is swapping out the Unicode characters for their equivalents; spaces, hyphens, etc., and then looking into how to manage potential paragraph changes.
Hi, thanks for bringing back this issue. First, I'll add it to the docs. So people know what to expect from the data.
Then, I guess the best scenario is that we read the CSV data as is, and replace the characters with their visible alternative. We have to figure out how to do it in python, though.