Removes control characters in the input/output of content
JIRA Ticket: https://jira.duraspace.org/browse/ISLANDORA-2412

What does this Pull Request do?
Sanitize MODS data into Google Scholar submodule
This does not change the data being stored, it should only impact the data as it processes for Google Scholar meta tags.
What's new?
String replace for ASCII control characters
How should this be tested?
Please use the example in the JIRA ticket for a proof of concept and MODS file with a Control Character embedded into the abstract.
Try ingesting prior to testing this PR.
- View the error during ingest and/or viewing the object
Pull this PR
- Navigate to previous submission
- Error should be gone
- Ingest a 2nd ETD to test the ingest process works without errors now.
Additional Notes:
N/A
Interested parties
@Islandora/7-x-1-x-committers
@DonRichards First things first, theres someting about the JIRA issue you linked to that makes me unable to see it (says "You can't view this issue. It may have been deleted or you don't have permission to view it."), so I don't know the backstory to this.
Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice.
@bryjbrown are you logged in to JIRA? I think the ticket should be visible to you, but not to anon right now.
@manez @DonRichards I'm logged in and also getting the "You can't view this" message.
@DonRichards First things first, theres someting about the JIRA issue you linked to that makes me unable to see it (says "You can't view this issue. It may have been deleted or you don't have permission to view it."), so I don't know the backstory to this.
Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice.
If this is the case would this helper be more useful: https://github.com/Islandora/islandora/blob/7.x/includes/utilities.inc#L878
Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice.
So would these bad characters get into the MODS record in the first place if they were ingested via a form? (Can't see the ticket, so can't access the example to test.)
IF the form strips it, then the bad characters are perhaps getting in via batch ingest?
In general, it would be a good idea to stop bad XML by validating it at the start of the ingest process.
@bondjimbond If you can discern that the datastream you're ingesting in a batch is XML, it could conceivably get sanitized there. Still wouldn't stop someone who is manually jamminng in bad XML through Fedora's API, though.
@bryjbrown We allow students to submit directly through the GUI and this comes up about twice a semester for us. And you should be able to see the Jira ticket if you're logged in and are on the security response group (you should be). @jordandukart Thanks for that, that might be a good long term solution to integrate. @bondjimbond The form isn't stripping this out currently.
Assuming Github doesn't strip this stuff out of postings I'm going to paste a string that breaks it here. The control character is between the two "nn". Paste this into the title or abstract.
total Lagrangiannite
To check if Github filtered go to https://www.textmagic.com/free-tools/unicode-detector and paste it in. It should say "Character not present in GSM charset, forces to use Unicode encoding"
@DonRichards
And you should be able to see the Jira ticket if you're logged in and are on the security response group (you should be).
I am logged in, but I'm not part of the security response group. Is that what could be causing me not to be able to see it?
If you paste that directly into a input field this is the results.

@bryjbrown Yes, I set it to sensitive because it was a security response ticket. I've basically showed most of what was in the ticket in the comments here.
The "islandora_sanitize_input_for_valid_xml" @jordandukart suggested is currently firing for Full_text generation but I don't see it being fired prior to original file is ingested.
Apologies for the confusion @bryjbrown . The message on the ticket implies that folks in the Developers group should be able to see, but I forgot that our permissions are customized, so that's not accurate under the hood. My bad 😞
@bryjbrown Yes, I set it to sensitive because it was a security response ticket. I've basically showed most of what was in the ticket in the comments here.
Out of curiosity, if this is a sensitive security issue being handled by the Islandora Security Group, are there not protocols in place to keep things private until a fix has been identified and released? I mean otherwise, having the exploit posted publically as it is here and on Jira puts the entire community at risk, no? I'm not privy to the protocols for handling security concerns but, if this is standard practice, I'd suggest the protocols be reviewed asap.
@fuel37 There are protocols in place. They just weren't followed. I'm trying to pull the conversation out of this and into private channels, although I see nothing wrong with taking the time to review how we handle these sorts of things.
@fuel37 Through a few conversations it became apparent to me that this may cause headaches for viewing, it isn't providing any obvious platform for a malicious user to leverage an exploit. This causes functionality to not render completely but because it originates from a user submission it was originally set as sensitive.
I worried when I submitted it that this might have a further reaching issue. It turns out that Google Scholar submodule is using the fields as strings to generate embedded tags and it doesn't handle them well. Drupal gives too many options for filtering and cleaning these strings.
ALthough we have a PR coming in to filter everything, this should be good to filter existing material.
I'm sorry but this isn't isolated to Scholar and should probably live further upstream. I just did a large image ingest with the same problem.
Maybe this shouldn't be closed. Not sure.
@DonRichards Did you find a place "further upstream" to fix the display issue?
@bondjimbond No, I'm not sure where it would be.
I believe this this needs to be merged here and not upstream at this point.
This PR is being marked as stale from inactivity and will be automatically closed in 90 days unless further action is taken. If this PR is still relevant please comment. Please also consider attending the weekly Tech Call to discuss the PR