Property type: RichText
Description
In order to support formatted text that could be used as advanced comments and descriptions there is a need to develop a new type RichText. The need and implementation complexity of this type lies in its dual nature – as formatted text expressed in a markup language and core text (i.e., without markup).
For example, given Markdown as the markup language, the following are examples of formatted text and core text respectively:
`cons` **does not** evaluate its arguments in a *lazy* language.
cons does not evaluate its arguments in a lazy language.
For values of RichText to be searchable, the markup needs to be ignored – only the core text needs to be searched. However, in order to preserve formatting for display purposes, there is also the need to store the text value with markup (formatted text).
Therefore, a single value of the RichText type should provide access to both core and formatted text. From the search perspective, which happens at the database engine level and should ignore formatting, there is a need to persist both the core and formatted values. EQL should interpret a RichText value as its core text component.
Initially, Markdown was used as the markup language for RichText, but then it was decided that HTML should be used instead.
-
[x] 1. Introduce
RichTextat the model level. -
[x] 2. Extract core text from formatted text (depends on the markup).
-
[x] 3. Add Hibernate support for
RichText: define a Hibernate user type and make it a composite one (seeMoneyUserType). -
[x] 4. Make sure
PersistDomainMetadataModelsupports properties of typeRichText. It might be necessary to modify and generalise it to support arbitrary composite types. -
[x] 5. Add support for
RichTextto EQL: interpretRichTextas core text. For example, in a query that containsprop("richText").eq().val(str), whererichText: RichText, the resulting SQL should compare the value ofstrto the core text component of propertyrichText.- [x] 5.1. Properties of type
RichText. - [x] 5.2. Values of type
RichText.
- [x] 5.1. Properties of type
-
[x] 6. Determine the necessity for HTML sanitization at the model level. In particular, the question of "When should sanitization be performed?".
The goal is to ensure safety of the formatted text component. In the case of Markdown, which allows embedding arbitrary HTML, this boils down to the safety of HTML. One approach is to enforce safety as an integrity constraint, through a validator which rejects
RichTextvalues that contain unsafe HTML. Safety can be determined by using an HTML sanitizer.~- [ ] 6.1. Implement an implicit validator for
RichTextproperties, enabled by default, that disallows unsafe content.~- [x] 6.1. Implement validation of input text used to construct
RichTextvalues: reject inputs that contain unsafe HTML.
- [x] 6.1. Implement validation of input text used to construct
-
[x] 7. Define a set of rules for markup stripping/transformation, which will be used to form the core text component.
-
[x] 8.
IsProperty.lengthat the level of aRichTextproperty should apply to thecoreTextcomponent.- [x] 8.1. Max. length validation
- [x] 8.2. DB schema generation
- ~~[ ] 8.3. Property metadata~~ (excluded due to unwarranted complexity: no strict need to have it implemented at the moment)
-
[x] 9. The type of
RichText.formattedTextshould be mapped to a DB type for variable-length text (varcharfor SQL Server,textfor PostgreSQL). -
[x] 10.
RichTextserialisation from & deserialisation into JSON objects.- Serialised
RichTextshould have the following shape:{ "formattedText": string, "coreText": string } - Deserialisation can assume validity of serialised objects that are received. This is due to the fact that only unmodified property values are subject to deserialisation.
- Serialised
-
[x] 11. Deserialisation of modified
RichTextvalues.ua.com.fielden.platform.web.utils.EntityResourceUtils#convertshould be enhanced. OnlyformattedTextneed be considered. Validation must be performed.
HTML
-
- Extract core text from formatted text. Consider using jsoup.
-
- Define a set of rules for markup stripping/transformation, which will be used to form the core text component.
- [x] 7.1. Inline tags that modify text style should stripped (e.g.,
<b>,<i>,<code>):<b>text</b> ==> text. - [x] 7.2. Blocks should be replaced by the core text of their contents
Formatted text:
Core text:<pre> hello world </pre>hello world - [x] 7.3. Links should be transformed as follows:
<a href='link'>text</a> ==> text (link) - [x] 7.4. Image links should be transformed as follows:
<img src='link' alt='text' /> ==> text (link) - [x] 7.5. Thematic breaks should be removed.
- [x] 7.6. Newline characters should be removed to facilitate search without a clumsy use of wildcards.
- [x] 7.7. Lists should flattened and then joined into a single line.
into<ul> <li> one <ol> <li> two three </ol> <li> first <ul> <li> second third </ul> </ul>one two three first second third
Markdown
-
- Extract core text from formatted text. Consider using commonmark-java.
-
- Define a set of rules for markup stripping/transformation, which will be used to form the core text component.
- [x] 7.1. Boldface should be stripped (
**text** ==> text). - [x] 7.2. Italics should be stripped (
*text* ==> textand_text_ ==> text). - [x] 7.3. Code backticks should be stripped
`text` ==> text - [x] 7.4. Quote blocks should be transformed into regular text by removing the block quote marker (
>). - [x] 7.5. Links should be transformed as follows:
[text](link title) ==> text (link title) - [x] 7.6. Fenced code blocks should be transformed into regular text by removing backtics and tildes.
- [x] 7.7. Image link - in the same way as ordinary links.
- [x] 7.8. Heading characters should be removed (Setext, ATX).
- [x] 7.9. Thematic breaks should be removed.
- [x] 7.10. Newline characters should be removed to facilitate search without a clumsy use of wildcards.
- [x] 7.11. List markers, both bullet and ordered, should be removed.
into1. one 2. two three * first - second thirdone two three first second third - [x] 7.12. Inline HTML should be removed. It is highly unlikely that searching by HTML elements will be needed.
the <b>big</b> bang ==> the big bang - [x] 7.13. HTML blocks should be removed. Markdown already provides a number of useful block structures, so occurences of HTML blocks containing information that will need to be searched is highly unlikely.
HTML sanitization
CommonMark establishes very liberal rules for embedded HTML (4.6 HTML blocks). Just some examples: 1. A block can start with a closing tag 2. An open tag need not be closed 3. A partial tag need not even be completed 4. The initial tag doesn’t even need to be a valid tag, as long as it starts like one.
To sanitize HTML inside a CommonMark document, there are 2 approaches:
- Run the whole document through a sanitizer. This is likely to go awry becase a sanitizer treats everything as HTML, which can result in unintended transformation of non-HTML content (e.g., backticks may be escaped).
- Sanitize only the HTML parts.
This approach guarantees that non-HTML parts of a document won't be touched, but requires additional effort of processing a document and sanitizing only the HTML parts, which can be accomplished with the
commonmark-javalibrary.
Since we don't want to modify anything but the unsafe parts of a document, the second approach is preferred.
For HTML sanitization, the OWASP Java HTML sanitizer can be used.
Validate and Sanitize
Given the above described integrity constraint that ensures safety of RichText contents, a mechanism for detecting unsafe parts is required so that the validator can do its job.
However, there are certain considerations to be taken into account when using the OWASP Java HTML sanitizer:
-
The OWASP Java HTML sanitizer does not provide a predicate that would determine validity of a given HTML document.
-
Policy violations can be tracked via
HtmlChangeListener. This is the primary means for the validator to do its job. -
If a string sanitizes with no change notifications (via
HtmlChangeListener), it is not the case that the input string is necessarily safe to use. Only use the output of the sanitizer.This is taken from The OWASP Java HTML sanitizer project page. In case no policy violations are reported, it should be safe to treat the validation as successful, but the correct value (a sanitizer's output) might not necessarily be the same as the validated one (a sanitizer's input). Therefore, the actual value assigned to a
RichTextproperty must be the the one produced by the sanitizer. Validators have no control over the value that ultimately gets assigned, so value substitution must happen somewhere else:-
In a definer. This will require an implicit definer to be installed for all
RichTextproperties. - Inside the property setter's body. Running a sanitizer twice for the same input is not efficient and might cause performance issues for large inputs.
Yet another approach is to shift from property validation to value validation. Specifically, to limit the validation to construction of
RichTextvalues by prohibiting invalid inputs. Then, validation of properties would no longer be necessary due to the invariant that guarantees validity ofRichTextvalues. -
In a definer. This will require an implicit definer to be installed for all
For more details see the page on validation.
Database mapping
The types for database columns should be chosen as follows:
-
formattedText: Stringshould be mapped to a column with namepropertyName__formattedTextof the largest text type with UTF-16 support. For SQL Server this would beNVARCHAR(MAX), for PostgreSQL this would beTEXT. No indexes are required when generating a database schema. -
coreText: Stringshould be mapped to a column with namepropertyName__coreTextof a text type with UTF-8 support with the size specified in attributelengthof@IsPropertyforpropertyName. For both SQL Server and PostgerSQL this would beVARCHAR(length)(SQL Server requires a collation name ending_UTF8to support UTF-8 inVARCHAR). Indexes are required when generating a database schema.
Note for future self:
- Hibernate type for
RichTextwas implemented to mapformattedTextusing regularStringType, which has proven to work well withNVARCHARcolumn type in SQL Server.
Future work
- Disallow RichText as composite key member. This should be enforced by the verifier.
- Update the verifier to allow properties with type
RichText.
Note: the property verifier will have to be updated to support the new property type RichText