JIRA Ticket: https://jira.duraspace.org/browse/ISLANDORA-2412

Broken MODS Error

What does this Pull Request do?

Sanitize MODS data into Google Scholar submodule

This does not change the data being stored, it should only impact the data as it processes for Google Scholar meta tags.

What's new?

String replace for ASCII control characters

How should this be tested?

Please use the example in the JIRA ticket for a proof of concept and MODS file with a Control Character embedded into the abstract.

Try ingesting prior to testing this PR.

View the error during ingest and/or viewing the object

Pull this PR

Navigate to previous submission
- Error should be gone
Ingest a 2nd ETD to test the ingest process works without errors now.

Additional Notes:

N/A

Interested parties

@Islandora/7-x-1-x-committers

Apr 03 '19 12:04 DonRichards

@DonRichards First things first, theres someting about the JIRA issue you linked to that makes me unable to see it (says "You can't view this issue. It may have been deleted or you don't have permission to view it."), so I don't know the backstory to this.

Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice.

Apr 03 '19 12:04 bryjbrown

@bryjbrown are you logged in to JIRA? I think the ticket should be visible to you, but not to anon right now.

Apr 03 '19 13:04 manez

@manez @DonRichards I'm logged in and also getting the "You can't view this" message.

Apr 03 '19 13:04 bondjimbond

@DonRichards First things first, theres someting about the JIRA issue you linked to that makes me unable to see it (says "You can't view this issue. It may have been deleted or you don't have permission to view it."), so I don't know the backstory to this.

Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice.

If this is the case would this helper be more useful: https://github.com/Islandora/islandora/blob/7.x/includes/utilities.inc#L878

Apr 03 '19 13:04 jordandukart

Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice.

So would these bad characters get into the MODS record in the first place if they were ingested via a form? (Can't see the ticket, so can't access the example to test.)

IF the form strips it, then the bad characters are perhaps getting in via batch ingest?

In general, it would be a good idea to stop bad XML by validating it at the start of the ingest process.

Apr 03 '19 14:04 bondjimbond

@bondjimbond If you can discern that the datastream you're ingesting in a batch is XML, it could conceivably get sanitized there. Still wouldn't stop someone who is manually jamminng in bad XML through Fedora's API, though.

Apr 03 '19 14:04 dannylamb

@bryjbrown We allow students to submit directly through the GUI and this comes up about twice a semester for us. And you should be able to see the Jira ticket if you're logged in and are on the security response group (you should be). @jordandukart Thanks for that, that might be a good long term solution to integrate. @bondjimbond The form isn't stripping this out currently.

Apr 03 '19 15:04 DonRichards

Assuming Github doesn't strip this stuff out of postings I'm going to paste a string that breaks it here. The control character is between the two "nn". Paste this into the title or abstract.

total Lagrangiannite

To check if Github filtered go to https://www.textmagic.com/free-tools/unicode-detector and paste it in. It should say "Character not present in GSM charset, forces to use Unicode encoding"

Apr 03 '19 15:04 DonRichards

@DonRichards

And you should be able to see the Jira ticket if you're logged in and are on the security response group (you should be).

I am logged in, but I'm not part of the security response group. Is that what could be causing me not to be able to see it?

Apr 03 '19 15:04 bryjbrown

If you paste that directly into a input field this is the results. Islandora Vagrant

Apr 03 '19 15:04 DonRichards

@bryjbrown Yes, I set it to sensitive because it was a security response ticket. I've basically showed most of what was in the ticket in the comments here.

Apr 03 '19 15:04 DonRichards

The "islandora_sanitize_input_for_valid_xml" @jordandukart suggested is currently firing for Full_text generation but I don't see it being fired prior to original file is ingested.

Apr 03 '19 16:04 DonRichards

Apologies for the confusion @bryjbrown . The message on the ticket implies that folks in the Developers group should be able to see, but I forgot that our permissions are customized, so that's not accurate under the hood. My bad 😞

Apr 03 '19 18:04 manez

@bryjbrown Yes, I set it to sensitive because it was a security response ticket. I've basically showed most of what was in the ticket in the comments here.

Out of curiosity, if this is a sensitive security issue being handled by the Islandora Security Group, are there not protocols in place to keep things private until a fix has been identified and released? I mean otherwise, having the exploit posted publically as it is here and on Jira puts the entire community at risk, no? I'm not privy to the protocols for handling security concerns but, if this is standard practice, I'd suggest the protocols be reviewed asap.

Apr 05 '19 12:04 fuel37

@fuel37 There are protocols in place. They just weren't followed. I'm trying to pull the conversation out of this and into private channels, although I see nothing wrong with taking the time to review how we handle these sorts of things.

Apr 05 '19 12:04 dannylamb

@fuel37 Through a few conversations it became apparent to me that this may cause headaches for viewing, it isn't providing any obvious platform for a malicious user to leverage an exploit. This causes functionality to not render completely but because it originates from a user submission it was originally set as sensitive.

I worried when I submitted it that this might have a further reaching issue. It turns out that Google Scholar submodule is using the fields as strings to generate embedded tags and it doesn't handle them well. Drupal gives too many options for filtering and cleaning these strings.

Apr 05 '19 17:04 DonRichards

ALthough we have a PR coming in to filter everything, this should be good to filter existing material.

Apr 08 '19 13:04 DonRichards

I'm sorry but this isn't isolated to Scholar and should probably live further upstream. I just did a large image ingest with the same problem.

Apr 16 '19 13:04 DonRichards

Maybe this shouldn't be closed. Not sure.

Apr 16 '19 14:04 DonRichards

@DonRichards Did you find a place "further upstream" to fix the display issue?

Jan 22 '20 14:01 bondjimbond

@bondjimbond No, I'm not sure where it would be.

Jan 22 '20 16:01 DonRichards

I believe this this needs to be merged here and not upstream at this point.

Apr 14 '20 15:04 DonRichards

This PR is being marked as stale from inactivity and will be automatically closed in 90 days unless further action is taken. If this PR is still relevant please comment. Please also consider attending the weekly Tech Call to discuss the PR

Jul 29 '25 12:07 github-actions[bot]

Removes control characters in the input/output of content

What does this Pull Request do?

What's new?

How should this be tested?

Additional Notes:

Interested parties