lucene-solr icon indicating copy to clipboard operation
lucene-solr copied to clipboard

SOLR-14701: GuessSchemaFields URP to replace AddSchemaFields URP in schemaless mode

Open arafalov opened this issue 4 years ago • 13 comments

#Description

New URP with tests. Everything is based on AddSchemaFieldsUpdateProcessorFactory but is sufficiently different to be a separate entry for backward compatibility.

Solution

This solves the problem with original solution by:

  • splitting accumulation of information about data seen (in processAdd) and actual schema modification (in commit)
  • it uses parameter flag instead of disabling the whole chain, which causes problems if you do want to parse dates
  • it supports parameter widening for numeric types to achieve sane results (e.g. Integer promoted to Double)
  • It tracks multiplicity of values, so allows the baseline fieldTypes to be single-valued and fields declare multiValued as appropriate

It removes

  • Sub-selecting which fields this process applies to. Mostly because we are now much more explicit about it being a learning schema. But also because nobody seems to be using that. Nor was it tested.
  • Default options. They were interactive in non-predictable ways and the only sane option is mapping for String type anyway.

Not in this PR

  • RefGuide
  • New Examples
  • Change file notice They should be done together in a separate Jira, as they may need additional discussion.

Tests

The existing schemaless tests were copied and adjusted to work with new limitations. A couple more tests could be added later to test better for multiplicity (already tested with xpath), but also to test type widening more directly.

Checklist

Please review the following and check all that apply:

  • [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [x] I have created a Jira issue and added the issue ID to my pull request title.
  • [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [x] I have developed this patch against the master branch.
  • [x] I have run ./gradlew check.
  • [x] I have added tests for my changes.
  • [-] I have added documentation for the Ref Guide (for Solr changes only).

arafalov avatar Sep 13 '20 03:09 arafalov

@arafalov Can I see an example of how a user would use this?

noblepaul avatar Sep 22 '20 12:09 noblepaul

@noblepaul Good point:

  1. bin/solr start
  2. Edit the default configset's add-schema-fields definition to
    • Replace AddSchema... with GuessSchema... URP
    • Remove default parameter (code automatically widens to "String" as last step)
    • Remove Integer mapping, where Long already exists (code automatically widens to nearest acceptable definition for numerics)
  3. bin/solr create -c guess
  4. bin/post -c guess -params "guess-schema=true" example/exampledocs/*.xml
  5. As post does indexing and then commit, it triggers both phases of the URP
  6. At the point, the schema is created (based on all data) but nothing is actually indexed; the user can review and adjust the schema without any issues
  7. Flag value can actually be anything for now, in the future I was thinking that maybe this could be a tag to mark new fields (see SOLR-14855)
  8. bin/post -c guess example/exampledocs/*.xml
  9. Now the data is entered and can be searched
  10. bin/post -c guess -params "guess-schema=true" example/films/films.json
  11. bin/post -c guess example/films/films.json
  12. This shows that we don't need the "extra steps" instructions, the films example current requires.

arafalov avatar Sep 22 '20 13:09 arafalov

At the point, the schema is created (based on all data) but nothing is actually indexed; the user can review and adjust the schema without any issues

How do you expect the schema to be adjusted? User has no idea what just happened

I'm assuming Step 2 won't be required as it will be built into the end point.

Is it possible to just return the commands to create a schema?

example

bin/post -c guess -params "guess-schema=true" example/films/films.json

the response can be

curl -X POST -H 'Content-type: application/json' -d '{"add-field":[
{
"name":"id",  
"type":"string",
 "stored":true },
{
"name":"desc",  
"type":"text",
 "stored":true
 }
]}' http://localhost:8983/solr/guess/schema

The user has the following choices

  1. Run the command as is
  2. edit the command and post it

My point is , our guess schema is worse than useless and anything that it creates will be WRONG for 99% of users. Why bother creating the fields in schema when we know fully well that user will have to edit the schema in ZK soon after

noblepaul avatar Sep 23 '20 00:09 noblepaul

Strong words there "worse than useless", especially considering that this - to me - seems a strong improvement on the current schemaless mode as it looks at more values and actually supports single/multivalued fields.

In general, I was trying to implement Hoss's proposal, but I am open to the other ideas, if we can clarify the use case.

My understanding is that the use case is of having a lot of data that one does not quite know the shape off. So, they want to index it quickly, explore and then do some manual adjustments. I am not expecting this to be anywhere near production. Schemaless mode should not have been either.

I am not sure how many people will know how to do step 6, but currently they don't even have that option. Switching from single-value to multi-value is impossible (very hard?) once the actual values are in the index. One has to basically delete everything and start again. As happens in the films example, if one misses the README. With this one, they can look at field definitions in Admin UI and remove or add fields as required without underlying lucene indexes throwing complains.

The way I am seeing this (as well as for other example) is to have a super minimal learning configuration where every additional field is quite obvious. That learning schema, clearly, would not need the step 2 as it would be all setup. I thought your question was about how you would test the code for yourself.

Additionally, to help see what was changed, I think the tag JIRA could be helpful. And frankly, in my imagination, it is not a cloud setup, but a simple learning one. Whether that, by itself, is a breaking point for you, we shall have to see.

Generating Schema JSON raises its own questions, such as the shape of the schema it will be applied to, as guessing is currently happening as a differential to the existing schema. Also, this does not seem like the code that should be in this particular URP, but more of a general utility. If one existed, maybe it would make sense to leverage on top of it.

In general, I am open to implement it any way that seems most useful. I will wait for another couple of opinions rather than chasing one very strong one.

arafalov avatar Sep 23 '20 07:09 arafalov

Also, I am not even sure there is a pathway to return a non-error message from commit that bin/post will echo to the user as a positive statement. For queries, yes. But we are talking an Update handler with a URP chain.

arafalov avatar Sep 23 '20 07:09 arafalov

Strong words there "worse than useless", especially considering that this - to me - seems a strong improvement on the current schemaless mode as it looks at more values and actually supports single/multivalued fields.

I'm sorry for the confusion.

I was referring to the current solution we have in Solr (schemaless, guess schema thing) . It's not a comment on the new solution. The current schemaless is indeed worse than useless

Generating Schema JSON raises its own questions, such as the shape of the schema it will be applied to, as guessing is currently happening as a differential to the existing schema.

The command is only relevant for that moment. If you execute it right away, it's useful. Users most likely will just copy paste the command (and edit, if required)

noblepaul avatar Sep 23 '20 07:09 noblepaul

Ok, I am glad we are on the same page that the current (let's call it Add) solution is rather bad despite all the great work put into it. Let's now get onto the same page about the next step you are actually proposing. I can read the rest of your statement in one of the following ways:

  1. Neither original Add nor proposed Guess solutions will address problem. Next step: that discussion is not about code and should be taken up in the parent JIRA. That's exactly what it is there for and this code/PR is here to push the discussion from theoretical to practical.
  2. Guess approach is ok overall, but the schema creation is still bad, could it return schema generation commands instead. I just double-checked code and there is no way for the current architecture to return non-error feedback (from either processCommit or SimplePostTool side). Next step: Propose a way this could be done. Do note that the reason we are still an URP is because any schema guessing or creation depends on previous chain URPs to be always enabled (e.g. for custom dates formats); that is one of the things really broken with enable/disable flag for Add solution and why I am doing the single-URP level flag.
  3. We need some other Guess approach. Next action: propose alternative architecture, preferably as straw-man implementation. This would give people on JIRA a chance to select from TWO ways forward, that would be amazing whether we end on one, another or merged solution.
  4. ??? Use veto and keep status quo until somebody yet different has a much better idea than people in last 3 JIRA?
  5. ??? (I don't claim to read your mind, but I want to move this discussion forward in concrete non-blocking steps)

arafalov avatar Sep 23 '20 14:09 arafalov

Purely responding to the URP response part, it’s definitely not possible for URP to send non-error responses. I do think its something we should implement though, since it will expand the use cases that URPs can solve. Ill create a JIRA for it.

HoustonPutman avatar Sep 23 '20 14:09 HoustonPutman

Purely responding to the URP response part, it’s definitely not possible for URP to send non-error responses. I do think its something we should implement though, since it will expand the use cases that URPs can solve. Ill create a JIRA for it.

It may be possible to future proof this implementation by making guess-schema being a mode switch, instead of current present/absent flag. So, maybe rename it to guess-mode instead with options of

  • update - current (only) option basically,
  • show - (if/when there is a way to return suggested JSON),
  • update-all - (if we wanted to - sometimes - have specific fields even if dynamicField definition matches; could be done now if useful,
  • none to support tools easier.

arafalov avatar Sep 23 '20 18:09 arafalov

I recommend a new request handler such as /update/guess-schema

This way we do not need to add any new functionality, nor do we need to pass any extra params

curl -F '[email protected]' http://localhost:8983/gettingstarted/update/guess-schema

The response can be

curl -X POST -H 'Content-type: application/json' -d '{"add-field":[
{
"name":"id",  
"type":"string",
 "stored":true },
{
"name":"desc",  
"type":"text",
 "stored":true
 }
]}' http://localhost:8983/solr/gettingstarted/schema


noblepaul avatar Sep 25 '20 05:09 noblepaul

This may be me wearing my "marketing hat", but I think I was somewhat resistent to this whole idea. This morning I was thinking... Maybe I just don't like the word "guess", it has some negative connations when it comes to data and schemas and sotware, though may be accurate! What about Explore-Schema or Predict-Schema ?

epugh avatar Sep 25 '20 10:09 epugh

@noblepaul I don't think your proposal is fully thought out.

  1. This seems not orthogonal to being able to send CSV/JSON/XML/DIH to it, if you are proposing for it to be another one of pathVsLoaders
  2. The whole 'I give you schema' proposal ignores the fact that DateParser or BoolParser URP present in guess process also needs to be present in whatever schema you send it to. That 'curl' command is hiding the user from the actual issues the guessing is supposed to help with.
  3. Even your types "string vs text" is not something that can currently can guess.

Can you do a counter-proposal in code that skips all the guessing and just shows this send/return path? But is still in the execution path to take the concerns above into the account. I could not find such a place.

@epugh Predict works for me. Explore feels a bit too general and confusing (more interactive/UI feel). But, in general, I am not stuck on Guess at all.

arafalov avatar Sep 25 '20 11:09 arafalov

I’ve always thought “guess" was mostly a reflection of how inadequate us devs thought the whole process was. I agree its not very confidence-inspiring…

Off the top of my head, the idea of a new handler has a lot of appeal. It’s easy to get tell a user “index as many training documents as you want to this handler to create your schema. NO DOCUMENTS WILL BE INDEXED during this stage, this is for refining your schema. When you believe you’ve send enough training documents for Solr to have a reasonable chance of defining the correct schema, you must send the documents again, this time to the update handler after reloading the collection to be able to search them”. Maybe “schema-trainer”?

It seems to me that there could be a some efficiencies here.

1> Could we bypass reloading the collection each time? We’d have to read the schema directly and modify it. Or at least only reload it at the end of each batch if there were changes.

2> The response could indicate whether there were any changes to the schema. I can imagine a process whereby I send 10 docs, see there were changes. Then send 10 more and see there were changes. Rinse/repeat until I’d been able to send N batches without any changes.

3> This still isn’t going to bullet-proof semi-structured docs. Or any other really, but it’ll at least make things far more robust. What happens if I index 1,000 Word docs, call it good, then index PDFs. Or PNGs or…. anyway, I try to index some new type of document. Or do we change the update handler to put all unrecognized fields into a text field? Probably should log warnings that this is the case.

4> We’d have to throw an error if the training handler was used after there were documents in the index and any existing field needed to be modified. If we did that users could try to send docs through the training handler at any time, which would partially handle <3>.

5> One tricky bit would be how to train on a bunch of documents after some docs were already indexed. Prior to indexing any docs, we can freely modify existing fields in the schema. Back to <3>. I’m indexing all the Word docs just fine. Now I want to index PDFs and they have new fields, so I want to throw a bunch of them at the training handler. If I send them one-by-one, how to distinguish between a field that had been defined that no currently-indexed documents use and can be changed .vs. one that has documents indexed against it? This’ll bear some thought because the replica processing the training could theoretically examine the local index and see no docs using that field but some replica of some other shard does have one or more docs using the field. And in the case where implicit routing is used or even composite keys this gets worse. This either gets really complicated or we make a rule like “if you use the training handler after documents are indexed, you get one chance to send a batch of documents. Training will fail if any existing field needs to be modified”.

6> We’d be able to confine field guessing exclusively to the training handler.

Erick

On Sep 25, 2020, at 7:38 AM, Alexandre Rafalovitch [email protected] wrote:

@noblepaul I don't think your proposal is fully thought out.

• This seems not orthogonal to being able to send CSV/JSON/XML/DIH to it, if you are proposing for it to be another one of pathVsLoaders • The whole 'I give you schema' proposal ignores the fact that DateParser or BoolParser URP present in guess process also needs to be present in whatever schema you send it to. That 'curl' command is hiding the user from the actual issues the guessing is supposed to help with. • Even your types "string vs text" is not something that can currently can guess. Can you do a counter-proposal in code that skips all the guessing and just shows this send/return path? But is still in the execution path to take the concerns above into the account. I could not find such a place.

@epugh Predict works for me. Explore feels a bit too general and confusing (more interactive/UI feel). But, in general, I am not stuck on Guess at all.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

ErickErickson avatar Sep 25 '20 13:09 ErickErickson