DnaChisel icon indicating copy to clipboard operation
DnaChisel copied to clipboard

AvoidPattern with regex

Open deto opened this issue 2 years ago • 1 comments

I ran into an issue recently where I was using a regex in my AvoidPattern and didn't realize that I needed to specify the size. What happened was that even though there plenty of available solutions avoiding the pattern, I was still getting an error that no solution could be found. This was because without a size, regexes dont' optimize on small windows and instead were introducing random perturbations on the entire sequence. Since it was a large sequence, the odds of any of these perturbations fixing the constraint was very low.

The fix was just to specify it as a SequencePattern instead (where I could specify the size) and feed that into AvoidPattern.

However, this really wasn't obvious to me - took quite a while of debugging and reading through the codebase to figure out.

I wanted to file this issue to suggest either a different approach to documenting or warning about this. Maybe either:

A) allow specifying 'size' in AvoidPattern (if providing a string for the pattern instead of SequencePattern object) and document the necessity of 'size' if it's a regex

or B) throw an error when a regex is specified without a size

or C) disallow specifying patterns with strings with AvoidPattern so that users have to go through SequencePattern and are more likely to see the note about how specifying size is needed (still relies on people reading all the docs, though).

Thanks for this tool - it has been incredibly useful and I'm just trying to think of ways to make it better.

deto avatar Oct 07 '22 21:10 deto

Thanks and sorry for your pain! That's one of the risks of user-friendly interfaces that try to do too much with the inputs.

My two cents:

  • (B) isn't great because some regex patterns don't have a size, for instance "all ORFs" ATG(.*)TGA (I'm simplifying!). These are useful, DnaChisel might have some trouble removing them, but it will work well in the case where the pattern is originally absent and you just want to prevent sequence edits that would make them appear.
  • (C) I understand that basically enforcing stronger typing would make things safer, but it would be some work (checks, redoing examples, redoing the Genbank import feature, which relies on that automatic detection)
  • (A) could work but will make the understanding/docstring a bit more complicated.

Leaving the decision to @veghp, but I would suggest (D) no change in the code but have the docstring (and everywhere else relevant!) clearly specify that if using a regex to actively remove sites, then it should be provided as SequencePattern(my_regex, size=6) so as to inform DnaChisel during optimization.

Zulko avatar Oct 10 '22 14:10 Zulko

This has already been pointed out in the SequencePattern docstring. I noted this in other places where relevant. I point to the docstring to single-source the information. But in most SequencePattern classes the size is fix and automatically calculated.

I deleted the SequencePattern docstring "(if none provided, the size of the pattern string is used)" as that's not true.

https://github.com/Edinburgh-Genome-Foundry/DnaChisel/commit/a060a9e25c7dbaa62f6a7b1b5581152eeb6b3edc

veghp avatar Jun 03 '23 13:06 veghp