droid icon indicating copy to clipboard operation
droid copied to clipboard

No Profile Mode - issues and questions.

Open nishihatapalmer opened this issue 4 years ago • 41 comments

The No Profile mode in the command line project suffers from multiple issues.

  • doesn't match extensions the same as profile mode.
  • only outputs filename and puid.
  • archive file processing support is limited (it usually can't handle nested archives)

Origin The original idea of no profile mode was to support outputting droid metadata without the overhead of a database. If the results were just going to go to the standard out, or a file, then there was no need to record them in a database as well.

Implementation A decision was made to re-implement all the core identification and archive handling in the command line project for the no profile mode. This was a short term decision made due to some annoying incompatibilities (filters being one). At the time there was only zip, gz and tar archive handling.

Outcomes The result of this is that there is a considerable maintenance burden imposed by the no profile mode, as the archive handling must be re-implemented every time a new type is added. In fact, the implementations provided by no profile mode are much less capable than the ones already available in the main droid projects (they can't handle nested archives for the most part).

Identification also suffers, as the logic doesn't match what is available in the rest of DROID, and results suffer as very little metadata is outputted in this mode (just filename and puid).

Original design In the original design, the idea of no profile mode was that it should use all the same profile machinery, but just plug in a csv output ResultsHandler (whether to standard out or a file), instead of the database results handler.

Questions

  1. Would implementing no profile mode to the original design be of interest? Doing so would massively simplify the command line project, as it would just use the standard droid modules used by the rest of DROID.
  2. Do people depend on the current output (just filename and puid) of no profile mode? This could be preserved, although I suspect most people will want the full metadata most of the time.

Note: I see there is an issue with no-profile-mode filenames containing commas here: https://github.com/digital-preservation/droid/issues/34 This would be fixed by using a CSV output format for droid metadata. The idea would be that no profile mode could output data essentially the same as if we had profiled with a database, and then performed an export. The output could still be given in the current format for backwards compatibility reasons.

nishihatapalmer avatar Feb 02 '20 17:02 nishihatapalmer

Error reporting for archival formats This issue surfaces when we identify an archival format - and record the result, then later encounter an error processing the archival format. The database handler updates the original entry to indicate an error. Obviously, you can't go back and correct a row previously written to standard out or a file.

The only approaches I can see to errors subsequent to the initial identification when streaming the output are:

  1. Defer recording the identification metadata of the parent archival format until we've successfully processed it. This would lead to archival parents always being recorded after their children in the output.
  2. Allow error reporting entries in the output. These are corrections which should be applied to the status of previous entries.
  3. Don't record errors processing an archival format in the results - the file was still identified as that format, even if there was a later problem processing it. The log will still contain evidence of the problems processing it.

nishihatapalmer avatar Feb 02 '20 20:02 nishihatapalmer

Would implementing no profile mode to the original design be of interest?

Yes! It would be great to have consistent results with CLI (no profile) and GUI DROID. See also #224 which I think is related.

marhop avatar Feb 03 '20 12:02 marhop

I also would have interested in seeing consistency between the two versions. I would first want to know how this would affect our enterprise installation of Rosetta and how it implements DROID. I believe it uses the CLI version, but also has added a way to set the Max Byte Scan size setting. It would also allow us to add the ability to scan more archival formats within the Rosetta preservation system.

thorsted avatar Feb 03 '20 19:02 thorsted

@nishihatapalmer we just have been thinking that we should really refactor the whole archive management. I had to fix a few inconsistencies between GUI and CLI like in https://github.com/digital-preservation/droid/pull/378 and https://github.com/digital-preservation/droid/pull/376, and this is truly a maintenance burden.

jcharlet avatar Feb 04 '20 18:02 jcharlet

I'm having a play with what's needed for this. So far looks mostly like implementing a WriterResultHandlerDao class in the results.handlers package.

nishihatapalmer avatar Feb 10 '20 23:02 nishihatapalmer

really looking forward to it Matt. I think it should be prioritized for next release, if you can do it, that would be amazing otherwise we'll work on it with @sparkhi .

jcharlet avatar Feb 11 '20 16:02 jcharlet

Well, definitely not this release!

Turns out profiles have a few other database dependencies (like reporting and exporting), although they all live behind interfaces, and they aren't needed for direct output to console or file. More to explore.

nishihatapalmer avatar Feb 11 '20 19:02 nishihatapalmer

yeah, this is definitely not trivial.

jcharlet avatar Feb 11 '20 23:02 jcharlet

Update

I've successfully run DROID GUI directly writing CSV results to the console, by just implementing a CSV result handler dao class and using it instead of the JDBC one. I also re-used the existing CSV item writer from the export module, so very little new code. So the principle seems solid, and doesn't require a lot of change.

I investigated the idea of creating profiles whose only output would be the results, with no database dependencies, or other on-disk dependencies. That started to look quite complex; profiles support reporting, exporting, saving and querying, and most of those things are tied to an on-disk representation in a temporary profile folder.

The simplest and safest approach is to continue to allow DROID to initialise a standard temporary profile folder, including an empty database, but then just to write the results out to a Writer (standard out or a file). DROID deletes the temporary profile folder when the profile is closed.

All the other functions with database or disk dependencies (e.g. reporting, exporting or saving the profile spec) still have a profile folder with an empty database, so they can't do anything useful, but they won't crash either, and we don't have to create lots of additional classes and reconfigure everything. In command line no-database profile mode, these functions will not be invoked in any case. Indeed, in any use-case where you don't write the results out to the database, you should not expect to be able to query or update it.

There's some additional refactoring required to obtain the list of formats. The database result handler can do this, the CSV one can't, so we need to abstract and refactor how formats are acquired to make this completely work.

nishihatapalmer avatar Feb 15 '20 20:02 nishihatapalmer

Turns out refactoring how formats are acquired is hugely complex. It hits into all kinds of horrid database initialisation problems and init fixup methods apparently caused by problems with spring not initialising things in the order you'd hope.

That code is a mess and and really should be refactored one day, but it's not necessary to achieve writing results out to a CSV file. Look at how database templates are initialized in the init() and initializeForNewTemplate() methods on the JDBCBatchResultHandler for an example of how messed up it is (e.g. using a mutable static flag to determine when it's safe to do certain operations). And why should a result handler be involved in setting up a database in the first place?

Instead of refactoring half the code base (well, many usages) to initialise databases and acquire formats sanely, it's going to be easier to just add a database dependency to the new CSV result handler dao (we can abstract the common functionality from the JDBCResultHandlerDao out into an abstract base class or inherit from it). This means all result handlers will have a database dependency to initialise database templates and acquire formats, but might need something else for result writing.

If we accept that a profile assumes a database, even if the results are never written to it, then this creates the least change for the result we want.

nishihatapalmer avatar Feb 16 '20 20:02 nishihatapalmer

@adamretter @jcharlet I have a spring question for you guys, as I'm really not an expert in it.

I have two result handler dao beans - one for JDBC output to database, and one to output CSV to standard out or a file. What's a good spring way to select the right bean to be used during profile initialization?

Right now, the JDBCBatchResultHandlerDao bean is set directly in the spring-results.xml file, as there was only ever that result handler dao previously. Should I create a factory class with properties passed in that select which bean is ultimately used? Or is there some other canonical spring way to select which bean is ultimately used?

nishihatapalmer avatar Feb 17 '20 19:02 nishihatapalmer

Current code is in my fork at https://github.com/nishihatapalmer/droid/tree/NoDatabaseProfiling

nishihatapalmer avatar Feb 17 '20 19:02 nishihatapalmer

@nishihatapalmer I prefer dependency injection from annotation, but it's the same as the one in place using xml. I wouldn't do anything smarter than you do already, probably inject the two beans in a service using the interface while giving the implementation (bean) name in the injection annotation, then use one or the other based on some condition. factory class sounds better, if it's not creating too much boilerplate code, in my humble opinion.

jcharlet avatar Feb 18 '20 10:02 jcharlet

right, the initialisations in the handler are... complex. So in the case of the GUI, we would use both handlers right, the jdbc one to enable all db related features (export, reporting), and the writerResultsHandlerDao for displaying the results?

jcharlet avatar Feb 18 '20 10:02 jcharlet

Yes, the whole dB initialisation is pretty hairy.

The idea would be that the JDBC would be used in the GUI, and the Writer handler would be used in the command line to enable pushing results directly to std out or a file, rather than a database.

Not much need for direct file output in the GUI I suspect (although it is a possibility). If you want the main profile window to have any visible results, it needs to go in the dB.

nishihatapalmer avatar Feb 18 '20 11:02 nishihatapalmer

@nishihatapalmer I am afraid that I stopped working with Spring in any serious capacity several years before I even contributed to DROID, so I am probably not the best person to ask about "Spring Best Practice" ;-)

adamretter avatar Feb 18 '20 12:02 adamretter

@adamretter A wise decision IMHO! I am in general not a fan of huge frameworks. They make easy things trivial, medium things hard, and hard things impossible! You spend more time fighting the framework and twisting the code to fit it than you gain by using it.

I'm sure there are useful frameworks out there. I have just found all the big all encompassing ones to be more trouble than they're worth.

nishihatapalmer avatar Feb 18 '20 13:02 nishihatapalmer

Managed to wire things in using a FactoryBean and a new profile property {$outputFilePath}.

If blank, it writes to the database as normal. If "stdout", it writes to the console, otherwise it creates a file at the file path provided.

Next step is to try this in the command line project.

nishihatapalmer avatar Feb 25 '20 16:02 nishihatapalmer

Hi @nishihatapalmer , we will start our new work on DROID by auditing the functionality and code discrepancies between GUI and CLI, to then merging the common codebase between them and fixing those inconsistencies. Any code or thoughts you'd like to share on this on top of those in that thread?

jcharlet avatar Jun 09 '20 11:06 jcharlet

Only a few comments; I haven't been coding much for the last few months, what with everything that's going on right now. I'll have to dig out all my old code to give a proper update.

To summarise where I got to:

  1. I still believe the correct way to merge the code bases is to essentially ditch most of the CLI code and make it use the code used by the GUI project. This was always the original intention - and that code wasn't supposed to be "GUI" code - it was just supposed to be DROID code, to be used by whatever interface needed to do DROIDy things.

  2. I have created code which replaces the database writer with a CSV file writer, which works in the GUI by specifying a new profile property $outputFilePath. This is really easy to do and involves almost no new code.

  3. I have realised that so many things assume that a profile has a database attached to it, that it's not worth trying to remove databases from profiles. It's simply easier to always have an empty database with a profile, but just to write out the results to a file instead of the database.

  4. The reason I submitted the PR https://github.com/digital-preservation/droid/pull/407 is to allow the command line to set profile properties. In order for the command line to use the profile code, it needs to set the $outputFilePath property. This also has the nice effect of allowing the command line to override things like maxBytesToScan and things like that.

Next steps

  1. On its own, this still wasn't enough, as $outputFilePath isn't currently a global profile property, so can't be passed in. Arbitrary properties can be added programmatically at runtime (which is how I hacked it into the GUI for experimentation), but there is no good mechanism to allow them to be user-specified at run time. So the command line (or the GUI for that matter) can't pass the one property it actually needs to currently to enable this functionality - assuming we want the user to be able to select the file path that is written to!

  2. I've played with expanding the command line properties to allow arbitrary properties to be passed in, but this requires lots of changes to existing classes and code paths. Maybe when I look at it again, it will all click into place and it will be easy... An alternative strategy is just to make the one property I really need to pass ($outputFilePath) a global property and be done with it. This is less flexible, but much, much easier to implement.

Summary I'm close to being able to use all the profile machinery from the command line writing out to a file instead of to the database. At that point, it would be possible to ditch the command line implementations of binary and container matching and archive handling.

Backwards Compatibility The current no profile mode outputs just the path and PUID. Do we need to preserve that behaviour, or would we just have a CSV output? I suspect most people will actually want more of the metadata, but we may need to consider a backwards compatible output mode. This would be very easy to do.

nishihatapalmer avatar Jun 09 '20 19:06 nishihatapalmer

I'll dig through my code to see what branches I have that I haven't pushed to my fork. Then you'll be able to review what I have. None of it is quite ready to be a PR yet.

nishihatapalmer avatar Jun 09 '20 19:06 nishihatapalmer

to be fixed with https://github.com/digital-preservation/droid/issues/470

jcharlet avatar Jun 10 '20 10:06 jcharlet

Brilliant Matt. We are now focusing on refactoring the CLI and GUI code, so this will be really helpful and we can carry on from there.

jcharlet avatar Jun 10 '20 10:06 jcharlet

Hi @nishihatapalmer Been trying to find where to override the bean definition of resultsDao to put the writer instead of the db one when using the CLI.

There are spring contexts for each module like ui-spring.xml which I expected they would import spring contexts of submodules like spring-results.xml. But instead, they register factory beans (like SpringProfileInstanceFactory) which when called, generate the modules beans contexts. Why make this so? Isn't it causing the issues you're having in #407 with some properties not being easily overridden? I understand you want to instantiate some beans for every new profile which are maybe not thread safe, but what about daos, shouldn't they be singletons registered once and for all in the app context?

I couldn't find your $outputFilePath property you mentioned being used in the GUI, where can I find that? But also I'm understanding for your latest comment on #407 that you haven't sorted this out just yet am I right? So in the current state, #407 won't help just yet with this issue. Do you have any idea how to use this resultsDao otherwise? I'll carry on searching this afternoon.

Matt you complained about Spring framework above, although I don't have your much larger experience, I agree with the fact that they make uncommon things hard to implement, but I strongly disagree on the overall benefits. If the project was running on springboot, we wouldn't be having this conversation, properties overriding through command line would be available out of the box. Context management would be sorted with a couple of conditional annotations on beans. If we don't abandon entirely some responsibilities to the framework (like IoC), and do a mix of custom code and framework use like here, we're bound to get in trouble, lose the benefits of using the framework, and make our lives harder indeed for anything involving the framework. I also think than keeping some of those libraries with a higher level of abstraction in place (rather than rewriting stuff at a lower level) make the application easier to maintain over time and might still be favourable to small performance gains, which feels paramount to me in this context of a project with both a very long life expectancy and a high developer turn over.

Not sure if this is the right place to debate on this, I should probably have bounced back right away on your conversation with Adam above. Matt, @adamretter , I'm really keen to hear your thoughts on this and better understand from your perspectives.

jcharlet avatar Jun 30 '20 12:06 jcharlet

Hi @jcharlet ,

I didn't write the spring implementation used in DROID and I'm certainly no spring expert. It does seem like it manages to make quite simple changes really hard. In fact I tend to dislike huge frameworks in general as I always seem to end up fighting them. Maybe I'd learn to love springboot, who knows?

The outputFilePath property was introduced in a couple of later commits (I've now pushed them to my branch), along with a FactoryBean to instantiate the right result Dao. You could only use it by hard coding this property, it wasn't properly wired in yet (in fact, the GUI is a red herring, it's just what I was using to test it). But if you set this property, you can make the profile output to a file rather than the database. There is a reason why this branch isn't yet a PR!

I've now pushed a few more commits that allow a ProfileInstance to set the outputFilePath property. My next step is to combine this with being able to set command line profile properties. If we can then set the outputFilePath via a command line property, I can begin the work of removing the special archive and container signature processing in the command line project.

nishihatapalmer avatar Jun 30 '20 17:06 nishihatapalmer

If you ever wanted to refactor the spring implementation in DROID to be sane, I'd love to see that. It does seem to manage to make simple things hard and hard things impossible, which I'm thinking isn't generally the idea of using things like spring.

My preferred method would just be to get rid of it entirely and instantiate everything in code with constructor injection, but then I've made my feelings on big frameworks known already!

nishihatapalmer avatar Jun 30 '20 17:06 nishihatapalmer

hi @nishihatapalmer ,

I'm sorry, I think Richard set up that spring system, and I'm sure he had really good reasons to follow that pattern in the first place. I got frustrated because it's nothing like I saw, and that makes this simple task way more complicated than it should.

haha indeed you made it very clear :), and I really agree with you, in the current state, Spring is not helping, on the contrary, I really don't like this setup. I am not even sure it would help to setup springboot since many of the things that it abstracts away are already implemented. But it's a good point, there is space for refactoring, maybe removing the Spring context initialisation might simplify things, but if I do it, I have to see how we can include that in our priorities.

Ok great thanks for those additions. Would it help if I merge the properties PR, and merge it into your branch? I know it's not enabling setting properties used in droid-results, but it can maybe get us a step closer.

jcharlet avatar Jul 03 '20 09:07 jcharlet

Ah nevermind I saw your comments on the PR I created, let's carry on the discussion there

jcharlet avatar Jul 03 '20 09:07 jcharlet

Yes, they do need to be merged at some point. I'm not a git expert though, so I will be guided by you.

My plan was to create another branch off my "no database profiling" branch (just to be safe), and merge in the profile properties PR branch. This would give me a merged branch, which I could then submit as a new PR once it was ready.

But... you have already created a PR from my incomplete branch, so I'm not sure what the best strategy is. I guess if all the same code is merged, it won't be hard to reconcile (plus or minus a few minor merge conflict differences I guess)...

nishihatapalmer avatar Jul 03 '20 16:07 nishihatapalmer

I'd rather merge your properties PR first, which I'm working on (but I'm facing random unit test fails on the CI with appveyor).

then merge/rebase master (including that properties PR) on your branch. Sorry I'm a bit picky and I prefer to rebase master on feature branches (rather than merge, to keep git history graph clearer), and I didn't want to risk losing stuff in the merge conflicts' resolution, this is why I created my own branch. Then I cherry picked your new commits. but we can just as simply use your branch, merge master on it, and create a PR from it (I will close the one I did)

jcharlet avatar Jul 06 '20 09:07 jcharlet