taxonworks icon indicating copy to clipboard operation
taxonworks copied to clipboard

Import nexus file as ObservationMatrix

Open kleintom opened this issue 9 months ago • 18 comments

Some questions and comments

  • Do we want to make ObservationMatrix documentable and document it with the nexus file, or should that be an option?
  • I added the beginnings of Document filter queries so that I could match on nexus file extensions when selecting a document. If that's not something you want to be able to filter on, no problem just let me know and I can revert it. If it is:
    • Right now I'm passing around a text string '.nex, .nxs' indicating nexus extensions, repeating that string in javascript, ruby, and specs - what's the right way to do that?
    • Are there other nexus extensions we should recognize here? .nexus (.tre?)
  • Do we want an option to generate descriptor short names like mx has?
  • I assume that ExceptionNotifier in production sends an email to somebody when a background job fails?
  • Currently ExceptionNotifier is configured by default to ignore the following errors: ["ActiveRecord::RecordNotFound", "Mongoid::Errors::DocumentNotFound", "AbstractController::ActionNotFound", "ActionController::RoutingError", "ActionController::UnknownFormat", "ActionController::UrlGenerationError"] - ActiveRecord::RecordNotFound is one I can theoretically hit in the background if I pass a bad document id (which I don't), so currently I'm converting that error to TaxonWorks::Error. I just wanted to check that it's intentional that we ignore RecordNotFound in background jobs (even if it's not intentional it may not be worth changing just for this).

kleintom avatar Apr 29 '24 15:04 kleintom

Code organization: I suspect this may need some work, feel free to say so.

Right now the import code is basically in two places:

  • code that returns a controller error is in the observation_matrices_controller
  • the rest of the import code is in helpers/lib/vendor/nexus_helper.rb (is that a good place?)

Partly that's so the helper code can be shared by both preview in the controller and the actual import in a background job. Partly it's to avoid nexus bloat in the controller.

You'll also notice I focused quite a bit on error detection in the specs; feel free to note where I maybe overdid it.

  • We do the nexus file parse in two places when the user clicks Convert to start the background job: once in the controller to check for errors that we can catch and report back to the user, then again in the background job. The alternative would be to serialize/deserialize the NexusParser object so it can be passed directly to the ActiveJob (or to require Preview before Convert so that we can just check for errors on the preview and then not worry about it on Convert). None of that really matters until you get into matrices in the multiple hundeds by multiple hundreds sizes, in which case you should expect to wait anyway...

kleintom avatar Apr 29 '24 15:04 kleintom

Some numbers from my desktop computer with 16G ram. Parsing:

  • a 100x100 matrix takes .1 sec
  • 200x200 takes .7 sec
  • 998x998 takes 69 seconds (nexus_parser fails consistently on a strange-to-me error on 999x999)

Preview times are basically the same as parse times.

Import (the background job):

  • 100x100 otus: 1.2 sec, descriptors: 2, codings: 27, total 30 sec
  • 200x200 otus: 2.4, descriptors: 3.9, codings: 102, total: 109
  • 998x998 otus: 11.4, descriptors: 20.7, codings: job failed with

[Worker(host:b1c58fa834ac pid:1107)] Job ImportNexusJob [d57458c2-dd48-4c93-8ca2-11036687b0b4] from DelayedJob(import_nexus) with arguments: [14, {"_aj_globalid"=>"gid://taxon-works/ObservationMatrix/2"}, {"_aj_hash_with_indifferent_access"=>true}, 1, 1] (id=2) (queue=import_nexus) FAILED (0 prior attempts) with Delayed::WorkerTimeout: execution expired (Delayed::Worker.max_run_time is only 14400 seconds) = 4 hours

That's a million observations to create for a 1000x1000 matrix.

kleintom avatar Apr 29 '24 15:04 kleintom

Remaining issues:

  • If the user chooses to match both otus and descriptors (rows and columns), then it's possible that we're affecting existing matrices with the same otu/descriptor pair but different observations: observations would get merged on both new and existing matrices. That seems bad (if rare), I'm not sure how to handle it yet, any thoughts? We might need to require preview before doing the import if this is something we would only catch in either preview or in the background.

  • nexus_parser allows repeated taxa and repeated descriptor names. For now we tolerate repeats

    • if there's no otu or descriptor matching then we just create repeats and all is fine (apart from possible confusion when viewing repeats in isolation)
    • if there's otu and/or descriptor matching then we're using the same otu/descriptor in multiple rows/columns and so observations get merged for those positions in the matrix. Options: just match the first otu/descriptor in that case and create new ones for the rest of the repeats? Don't match at all for repeats? Just return an error if there are repeats? Other thoughts?
  • Currently after the user clicks the Convert button to kick off the background job, we display a message with a link to the newly created matrix, basically saying refresh that link to check if the import has completed, or the link goes bad if the import failed (the matrix should be destroyed in that case). That seems fine for small imports, but I'll think some more about alternatives showing progress.

kleintom avatar Apr 29 '24 15:04 kleintom

@kleintom Lovely! Looking forward to poking at this, and maybe even mentioning it next week at TWT (btw, email coming to you about that shortly). Without testing here is a rapid-fire pass at your questions:

Do we want to make ObservationMatrix documentable and document it with the nexus file, or should that be an option?

Yes! Let's just do it, it can be removed if for some reason it's not desired, however it's important provenance.

added the beginnings of Document filter queries so that I could match on nexus file extensions when selecting a document. If that's not something you want to be able to filter on, no problem just let me know and I can revert it.

Sounds fine to me. I would say we want to operate as little as possible on strings from the file name if possible and rather perhaps shell-out to detect the image type using some *nix utility/mime-type etc. to the best of its ability (going to struggle with .nex regardless). I haven't looked at how specifically you're implementing this.

Right now I'm passing around a text string '.nex, .nxs' indicating nexus extensions, repeating that string in javascript, ruby, and specs - what's the right way to do that?

This is generally tricky, but we've done a buch of work via this route- Server-side you can add a config/initialize/constant, that will make it accessilble to specs and app. You can expose this constant as a micro JSON endpoint to populate the JS use.

Are there other nexus extensions we should recognize here? .nexus (.tre?)

Best to explore Mesquite documentation, if they can handle it then we should try to emulate.

I assume that ExceptionNotifier in production sends an email to somebody when a background job fails?

Yes, our core team gets emails.

I just wanted to check that it's intentional that we ignore RecordNotFound in background jobs

Yes, very intentional, we don't want to get emails when API calls, or users, hit pages that don't exist. In-App we catch these and return 404 not found in the vast majority of cases.

the rest of the import code is in helpers/lib/vendor/nexus_helper.rb (is that a good place?)

It can named exactly the same name as the vendor library nexus_parser.rb

Seems non-linear doesn't it? Not suprising but

...then it's possible that we're affecting existing matrices with the same otu/descriptor pair but different observations: observations would get merged on both new and existing matrices.

Perfect. We want this. Matrices are views onto observations in TaxonWorks. Coding in one means you get it everywhere. Can explain further but this is (our should be) learned/expected beviour to TW users.

mjy avatar Apr 29 '24 15:04 mjy

Hello @kleintom !

I see the smart selector shows all documents and not only nexus files

image

You can filter them on Smart Selector using filter prop, for example:

    <SmartSelector
      klass="Documents"
      model="documents"
      v-model="nexusDoc"
      label="document_file_file_name"
      pin-section="Documents"
      pin-type="Document"
      :add-tabs="['new', 'filter']"
      class="selector"
      :filter="(item) => ALLOW_CONTENT_TYPES.includes(item.document_file_content_type)'"
    >

//...

<script setup>
const ALLOW_CONTENT_TYPES = ['application/nexus']

//...

Not sure if that's the correct content type, but you get the idea

EDIT:

It seems the content type for it is text/plain, maybe check by extension:

    <SmartSelector
      klass="Documents"
      model="documents"
      v-model="nexusDoc"
      label="document_file_file_name"
      pin-section="Documents"
      pin-type="Document"
      :add-tabs="['new', 'filter']"
      class="selector"
      :filter="(item) => ALLOW_EXTENSIONS.some(ext => item.document_file_file_name.endsWith(ext))"
    >

//...

<script setup>
const ALLOW_EXTENSIONS = ['.nex']

//...

jlpereira avatar Apr 29 '24 19:04 jlpereira

Tried with this file: https://github.com/mjy/nexus_parser/blob/main/test/MX_test_03.nex

After click "Preview conversion" I'm getting: "Error converting nexus to TaxonWorks: TaxonWorks character names must be unique for a given descriptor - duplicate name(s): '' detected for character 'Undefined'"

jlpereira avatar Apr 29 '24 20:04 jlpereira

Tried with this one too: https://morphobank.org/index.php/Projects/Matrices/project_id/5216

In this case, I got this meesage: Nexus parse error: File is missing at least some required headers, check formatting.

jlpereira avatar Apr 29 '24 21:04 jlpereira

Thanks for the feedback jlpereira!

You can filter them on Smart Selector using filter prop

Ah, thanks. This is filtered client side (right?) so it could return no recent nexus files if enough non-nexus files have been created recently after the nexus files, but I'm guessing seasoned users are aware of that?

Tried with this file: https://github.com/mjy/nexus_parser/blob/main/test/MX_test_03.nex

After click "Preview conversion" I'm getting: "Error converting nexus to TaxonWorks: TaxonWorks character names must be unique for a given descriptor - duplicate name(s): '' detected for character 'Undefined'

Funnily enough, and to my surprise, the error message is actually correct! The issue here is that TaxonWorks requires CharacerState names associated with a particular Descriptor to be unique: https://github.com/SpeciesFileGroup/taxonworks/blob/abffe3fc2a08ce37d804791575ad5c3b9e36f4b5/app/models/character_state.rb#L37-L38

My thinking has been that if that's not the case for a nexus file then we just have to fail instead of trying to make names unique in some way (with the exception of gap names, which I'll comment on elsewhere).

In the (cursed) file you referenced the error is on character 4, a ghost character that's only defined by its absence in the CHARSTATELABELS list between characters 3 and 5. In such a case nexus_parser parses the character as:

#<NexusParser::NexusParser::Character:0x00007f7937d15080
    @name=nil,
    @notes=
     [#<NexusParser::NexusParser::Note:0x00007f7937eed880
       @vars={:text=>"", :type=>"AN", :c=>"4", :a=>"JC", :dc=>"2008.4.13.20.31.50", :dm=>"2008.4.13.20.32.10", :id=>"01194a584b9f2", :i=>"_", :tf=>"(CM 'This is an annotation to charcter 4, that has no name.')"}>],
    @opt={:name=>"", :label=>"-"},
    @states={"0"=>#<NexusParser::NexusParser::ChrState:0x00007f7937df5798 @name="">, "1"=>#<NexusParser::NexusParser::ChrState:0x00007f7937df1f58 @name="">, "-"=>#<NexusParser::NexusParser::ChrState:0x00007f7937e12230 @name="gap">}>

The name of the character (= TW descriptor) is nil and the CharacterState names that are used in the matrix are '', '', and 'gap', so that the name '' is repeated. I've added the descriptor number to the error output, which should help in this case.

I should try to work through the rest of that file as well to catch any other tricky cases - character 8 might cause problems too - thanks for trying it (I guess :P).

Tried with this one too: https://morphobank.org/index.php/Projects/Matrices/project_id/5216

In this case, I got this meesage: Nexus parse error: File is missing at least some required headers, check formatting.

This is https://github.com/mjy/nexus_parser/issues/10, if you make that change in the file then you hit https://github.com/mjy/nexus_parser/issues/11

kleintom avatar Apr 30 '24 16:04 kleintom

@kleintom I released 1.2.1 of nexus_parser if you want to get the latest into this branch

mjy avatar May 03 '24 14:05 mjy

@kleintom I need to look further on this:

duplicate name(s): '' detected for character 'Undefined'

Can you describe what you are doing at the wrapper level for ? cells? There should be no CharacterState or Observation, they are skipped in TW.

mjy avatar May 03 '24 14:05 mjy

duplicate name(s): '' detected for character 'Undefined'

Can you describe what you are doing at the wrapper level for ? cells? There should be no CharacterState or Observation, they are skipped in TW.

Yes, ? cells are being skipped, nothing is being done at the wrapper level with them. The error you're seeing there is one I would expect to come from the situation I mentioned previously that arises with the MX_test_03.nex file - its character 4 is never actually defined, there's nothing in the characters list between character 3 and character 5. When that file gets parsed, character 4 is given a nil name and the character states from the non-? entries in the matrix are given names '' (empty strings). Since there are two character states with names '' and character state names associated with the same descriptor can't be repeated in TW (as opposed to nexus_parser) we raise the error you mention.

The most recent pushed code actually catches the other error with that descriptor 4 of a nexus_parser descriptor with an empty/nil name trying to be imported into TW, so you'll see a different error for that file now.

@kleintom I released 1.2.1 of nexus_parser if you want to get the latest into this branch

Great! Will do.

The failing tests are on matrix codings of the form (01). The test has always passed locally and I think it passed the last time I pushed here; my only guess right now is that somehow those multi-codings are sometimes getting created in the wrong order, but I don't see that in the code. (It could happen if the 1 observation already existed and then we created the 0 observation, but that shouldn't be happening in the spec.) I'll think about it some more but might not get to it until next week.

kleintom avatar May 03 '24 15:05 kleintom

When that file gets parsed, character 4 is given a nil name and the character states from the non-? entries in the matrix are given names '' (empty strings). Since there are two character states with names '' and character state names associated with the same descriptor can't be repeated in TW (as opposed to nexus_parser) we raise the error you mention.

I understand now. So in this case I think I might have missled you. I think we should Create a Qualitative Descriptor something like "Undefined (#{position}) from [#{filename}]" (I think maybe you did this) and the CharacterStates need to be normalized from the cells themselves (1,2,-) in a pre-process step.

mjy avatar May 03 '24 15:05 mjy

That makes sense, so given the following as an example with a descriptor with no name and two character states with no names:

#<NexusParser::NexusParser::Character:0x00007f7937d15080
    @name=nil,
    @notes=
     [#<NexusParser::NexusParser::Note:0x00007f7937eed880
       @vars={:text=>"", :type=>"AN", :c=>"4", :a=>"JC", :dc=>"2008.4.13.20.31.50", :dm=>"2008.4.13.20.32.10", :id=>"01194a584b9f2", :i=>"_", :tf=>"(CM 'This is an annotation to charcter 4, that has no name.')"}>],
    @opt={:name=>"", :label=>"-"},
    @states={"0"=>#<NexusParser::NexusParser::ChrState:0x00007f7937df5798 @name="">, "1"=>#<NexusParser::NexusParser::ChrState:0x00007f7937df1f58 @name="">, "-"=>#<NexusParser::NexusParser::ChrState:0x00007f7937e12230 @name="gap">}>
  • if a descriptor passed from nexus_parser doesn't have a name, we assign it one, like "Undefined (#{position}) from [#{filename}]" (I haven't been doing that, I've been erroring out)
  • when character states are undefined in the nexus file, nexus_parser does still pass in character states based on the cells that have non-? states (as in the example above), so we know which characters need to be created based on that, and they already have the corresponding labels from the codings (1, 2, etc.) in the character state data, we'd just need to create names for them, something like "Undefined (#{label}) from [#{filename}]".

I should be able to get to this next week some time. See you at the meeting!

kleintom avatar May 03 '24 16:05 kleintom

I should be able to get to this next week some time. See you at the meeting!

Sounds great, see you then! BTW - make sure you come to the SFG business meeting & new awards if it fits in your schedule.

mjy avatar May 03 '24 16:05 mjy

The failing tests are on matrix codings of the form (01). my only guess right now is that somehow those multi-codings are sometimes getting created in the wrong order, but I don't see that in the code.

Note to self: are multi-observations returned from observation_matrix#observations_in_grid explicitly ordered in any way? Does the ui even order them explicitly? If not (or maybe even if so) then the test can be changed.

kleintom avatar May 03 '24 18:05 kleintom

are multi-observations returned from observation_matrix#observations_in_grid explicitly ordered in any way

There are no semantics that depend on position. Rendering would likely be alphabetical in most cases, but this doesn't change meaning.

mjy avatar May 03 '24 18:05 mjy

Very interested in road-testing this to move data matrices from MX to TW! Let me know when I can jump in to try it.

JimWoolley avatar May 09 '24 14:05 JimWoolley

Are there other nexus extensions we should recognize here? .nexus (.tre?)

Best to explore Mesquite documentation, if they can handle it then we should try to emulate.

Regarding nexus extensions: the only place we actually check extensions is in the document chooser (where the user can still choose files with other non-nexus extensions) - otherwise we leave it up to nexus_parser to decide if the file is nexus or not, and it does so without consideration for extension. It looks like Mesquite basically does the same thing: https://github.com/MesquiteProject/MesquiteCore/blob/72ac14f5c3df2e58c20037e2b4cb4ae9dd474f36/Source/mesquite/minimal/InterpretNEXUS/InterpretNEXUS.java#L258-L277 (except it takes it on faith, at least to start with, that a .nex file doesn't need checking).

FWIW the nexus extensions we offer to recognize, .nex and .nxs, are the ones listed by Wikipedia as 'usually' used for nexus files: https://en.wikipedia.org/wiki/Nexus_file

kleintom avatar May 10 '24 18:05 kleintom

Codecov Report

Attention: Patch coverage is 23.33333% with 115 lines in your changes are missing coverage. Please review.

Project coverage is 72.47%. Comparing base (f7b716d) to head (139bd3a). Report is 5 commits behind head on development.

:exclamation: Current head 139bd3a differs from pull request most recent head b5dae7f

Please upload reports for the commit b5dae7f to get more accurate results.

Files Patch % Lines
app/controllers/observation_matrices_controller.rb 10.60% 59 Missing :warning:
app/models/document.rb 10.00% 18 Missing :warning:
lib/queries/document/filter.rb 40.00% 18 Missing :warning:
app/jobs/import_nexus_job.rb 29.41% 12 Missing :warning:
app/controllers/documents_controller.rb 20.00% 8 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@               Coverage Diff                @@
##           development    #3929       +/-   ##
================================================
+ Coverage        58.68%   72.47%   +13.78%     
================================================
  Files             1651     1905      +254     
  Lines            52614    67975    +15361     
================================================
+ Hits             30879    49265    +18386     
+ Misses           21735    18710     -3025     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar May 13 '24 13:05 codecov-commenter

Updated to nexus_parser 1.2.2 - I was able to import 15 random morphobank nexus files (w/ no notes) without error - meaning the jobs succeeded and at a cursory glance the matrices look believable. The largest one was 697x151, that took me 9 minutes to import.

Some nexus_parser fails on character names that I ran into with other files that may not get fixed:

  • ''Zangerlia'' neimongolensis' (those are all single quotes, known issue with quoting)
  • '[Code as state 2 if single crest is present at any point]' (known issue with handling of comments (removed globally without context))

Remaining issue here:

  • When I import the 15 files and then re-preview matching on otus and descriptors, every otu matches, but in some cases descriptors don't match, I'm looking into that.

kleintom avatar May 16 '24 12:05 kleintom

Maybe another issue: on long descriptor names the ui gets pretty jumbled, though you can more or less read the names when they change to black on hover: Screenshot 2024-05-15 at 18-13-44 TaxonWorks - _tasks_observation matrices_view

kleintom avatar May 16 '24 12:05 kleintom

  • When I import the 15 files and then re-preview matching on otus and descriptors, every otu matches, but in some cases descriptors don't match, I'm looking into that.

Most of these seem to be because of descriptor/state names with double spaces in them getting squished by https://github.com/SpeciesFileGroup/taxonworks/blob/d4cace210e9dfd1d12eb53c40d8764d04fa988ea/config/initializers/application_record/nilify_blanks.rb#L10-L14

Would we want to ignore_whitespace_on names in descriptors and character_states? If not I think I could still do correct matching by comparing nillified names from the nexus file with what's in the db.

kleintom avatar May 16 '24 15:05 kleintom

Would we want to ignore_whitespace_on names in descriptors and character_states?

We'd have to ignore whitespace on otu names too I think if we went that route.

To clarify, the issue is with descriptor names like 'Character 150 Maxilla, shape of the toothrow posterior to the first six alveoli: (after Brochu, 1997a [135]; Clark, 1994 [79]).' That one has a double space before the (. (It also has the issue that the bracketed references are removed, but in this case that doesn't cause an error.)

kleintom avatar May 16 '24 15:05 kleintom

Ignoring whitespace on those attributes is not good. We'll have to find another route. Can we pre-process the nexus names with the same string cleaning method before match? I'm actually working on some related improvements, I think it should be possible.

mjy avatar May 16 '24 19:05 mjy

@kleintom I haven't looked at the UI here at all yet, but we can plan for a 2 stage release, first is just match as best as possible, second (post release) is a UI that lets users manually choose target when no match, or over-ride choice that was made as well. I think a "create all as new" is going to be useful as well. Again, I haven't looked at UI, so apologies if this is variously addressed.

mjy avatar May 16 '24 19:05 mjy

Maybe another issue: on long descriptor names the ui gets pretty jumbled, th

Yes, definitely, a matrix renderer is another issue- that reminds me of another OS project I was involved in a long time ago.. for another story/issue.

mjy avatar May 16 '24 19:05 mjy

Maybe another issue: on long descriptor names the ui gets pretty jumbled, though you can more or less read the names when they change to black on hover: Screenshot 2024-05-15 at 18-13-44 TaxonWorks - _tasks_observation matrices_view

I pushed a fix for this, now long descriptors should be cut and visible on hover:

image

jlpereira avatar May 16 '24 20:05 jlpereira

Can we pre-process the nexus names with the same string cleaning method before match?

Yes that worked out fine (I think). Importing those 15 morphobank nexus files and then importing them again with otus and descriptors matched results in no new otus, descriptors, character states, or observations.

I think a "create all as new" is going to be useful as well. Again, I haven't looked at UI, so apologies if this is variously addressed.

No problem, that's helpful. We do already support 'create all as new' if you just don't select any of the matching options, but I can see where being able to adjust the matching that does occur could be very helpful in the future.

One other related comment: currently you can select 'match otu by taxon' and 'match otu by name' at the same time: it matches by taxon first, then by name. That's what made sense to me, but I'm not actually sure matching both at the same time is useful, and if so if we should give an option in which order to match. Feedback can wait though.

I pushed a fix for this, now long descriptors should be cut and visible on hover:

Thanks, looks great to me!

kleintom avatar May 17 '24 14:05 kleintom

@kleintom Thanks for the matching, sounds good. In your mind are the next steps to have us do some testing and get you feedback?

mjy avatar May 17 '24 14:05 mjy

In your mind are the next steps to have us do some testing and get you feedback?

Yes, I think things have pretty much settled on my end, testing and feedback would be great, thanks!

kleintom avatar May 17 '24 14:05 kleintom