sequenceserver icon indicating copy to clipboard operation
sequenceserver copied to clipboard

HTML code in fasta description

Open tomas-pluskal opened this issue 5 years ago • 9 comments

Hi,

I noticed that the beta versions of 1.1.0 changed the way HTML tags are rendered in FASTA file description fields.

In version 1.0.9, HTML entities were interpreted as HTML code, which allowed us to place things like links (<a href=...) or formatted annotations into the FASTA files that are loaded to sequenceserver. However, in version 1.1.0 all HTML tags are shown as plan text instead (I assume the < and > characters are translated to their corresponding HTML entities &lt; and &gt;).

Is this a desired change? In my view, it actually limits the scope of sequenceserver - I thought being able to add HTML formatting to the FASTA descriptions was quite useful.

tomas-pluskal avatar Oct 06 '18 15:10 tomas-pluskal

To minimize security risks, not interpreting any unknown HTML is the right thing to do by default. Any HTML snippet not already defined in the software's source code is unknown HTML, i.e., any user input. So the new behavior is indeed the preferred one.

Why not use the link generation feature to create custom links instead? http://www.sequenceserver.com/doc/#plugin

yeban avatar Oct 09 '18 14:10 yeban

The link generator is nice, but having some basic formatting capabilities would be nice, too (at least translating \n to <br>). Perhaps supporting something like markdown syntax for the descriptions could be nice, too?

tomas-pluskal avatar Oct 09 '18 23:10 tomas-pluskal

Hi, I would like to return to this issue. I think having an option to add simple formatting to the sequence descriptions would be nice and useful. And using a markdown parser like kramdown (https://kramdown.gettalong.org/) this would be very easy to implement. What do you think?

tomas-pluskal avatar Feb 17 '19 18:02 tomas-pluskal

I see the utility of it. Currently, adding custom links requires a bit of Ruby. But with embedded markdown, users can add them to the FASTA files using Perl, Python, bash, etc. Maybe embedded markdown can become the standard for adding custom links, while the link generator remains for automatic linking to public databases based on ID/title pattern. I think this feature should be opt-in (that is, disabled by default).

yeban avatar Mar 03 '19 14:03 yeban

I agree with the opt-in. I think it is useful not only for links, but also for highlighting stuff etc.

I can try to code this and make a pull request.

tomas-pluskal avatar Mar 03 '19 17:03 tomas-pluskal

@photocyte any thoughts on this?

tomas-pluskal avatar Mar 03 '19 17:03 tomas-pluskal

I think it is a good idea. Markdown formatting would support encoding of links and newlines, and the other formatting would be useful too.

photocyte avatar Mar 03 '19 19:03 photocyte

I like these ideas as they should make it easier to customize outlinks & add complementary information.

However, it can be considered bad practice to modify a FASTA file just to add metadata, because this makes it difficult to verify its integrity in comparison with reference databases/original downloads.

So I suggest a slightly different approach:

  • alongside mygenome.fasta (which has been formatted into my genome.fasta.nin and all the other blast database files), optionally have a file called mygenome.fasta.links. This could be a 2-column file, where the left-most column is sequence id, and the other one includes the html or markdown).
  • when we display results, we

A major reason against this approach is that it doesn't piggy-back off BLAST's indexing. It is unclear to me how much of a burden (on server or on client-side) the additional RAM/time/download overhead of parsing the links files would be.

yannickwurm avatar Oct 27 '21 10:10 yannickwurm

Hello, thanks for the feedback. Regarding verifying a FASTA files integrity vs original downloads: internally I've come up with a seqkit based FASTA checksum that pays attention to different levels of the sorted sequence content (e.g. all uppercase, to ignore if softmasking was performed) - in brief it looks at a FASTA file w/ 4 different levels of scrutiny w/ a standard md5sum checksum being the highest level of scrutiny & makes a 4-piece checksum (so, matching of part 1,2,3,4 vs just part 4 matching means different things). In my opinion just the file content checksum breaks too easily w/ minor modifications of the FASTA file (e.g. shortening the FASTA record names). I thought the bioinfo field should have come up with such a FASTA specific checksum but I haven't come across it... If there is interest I could try to polish the documentation & release the checksum publicly.

Regarding this case here: I still conceptually like the idea of coding metadata that could be displayed with sequenceserver, in the FASTA header, because as a general rule I like the idea of metadata being explicitly linked to files (too easy for it to get lost if in a separate file). Actually, pure Markdown doesn't encode newlines to my recollection so may not be suitable vs escaped HTML. But I think @tomas-pluskal came up with a different approach for making metadata links using sequenceserver for https://github.com/transXpress/transXpress , that I am not immediately familiar with how it was done.

photocyte avatar Oct 27 '21 16:10 photocyte