graphql-ruby icon indicating copy to clipboard operation
graphql-ruby copied to clipboard

Fields "missing" randomly

Open nesrual opened this issue 1 year ago • 10 comments

Describe the bug

I have a type which has the following field:

field :pit_logs, [Types::PitLogType], null: false, scope: false do
  argument :dates, [String], required: false, default_value: '[]'
end

The query works fine (most of the times):

{
  initiative(id: "902fba2e-08b9-4848-a994-d1dc02ab2c00") {
    id
    pitLogs(dates: ["2024-01-11", "2023-01-01"]) {
      date
      progress
    }
  }
}

And it returns the correct reply:

{
  "data": {
    "initiative": {
      "id": "902fba2e-08b9-4848-a994-d1dc02ab2c00",
      "pitLogs": [
        {
          "date": "2023-01-01",
          "progress": 100
        },
        {
          "date": "2024-01-11",
          "progress": 100
        }
      ]
    }
  }
}

However after some time I start to see the following errors returned from the GraphQL api:

{
  "errors": [
    {
      "message": "Field 'pitLogs' doesn't exist on type 'Initiative'",
      "locations": [
        {
          "line": 4,
          "column": 5
        }
      ],
      "path": [
        "query",
        "initiative",
        "pitLogs"
      ],
      "extensions": {
        "code": "undefinedField",
        "typeName": "Initiative",
        "fieldName": "pitLogs"
      }
    }
  ]
}

I have not been able to reproduce the bug in development. It only appears in production as far as I can tell.

Versions

    graphql (2.2.8)
    graphql-batch (0.5.4)
      graphql (>= 1.12.18, < 3)
      promise.rb (~> 0.7.2)
    graphql-c_parser (1.0.7)
      graphql
    graphql-persisted_queries (1.7.0)
      graphql (>= 1.12)

Ruby on Rails 7.1.3, Ruby version 3.2.3

GraphQL schema

Include relevant types and fields (in Ruby is best, in GraphQL IDL is ok). Any custom extensions, etc?

# frozen_string_literal: true

module Types
  class InitiativeType < Types::BaseObject
    pundit_role nil

    description 'Initiative'
    field :id, ID, null: false

    field :pit_logs, [Types::PitLogType], null: false, scope: false do
      argument :dates, [String], required: false, default_value: '[]'
    end
  end
end

GraphQL query

Example GraphQL query and response (if query execution is involved)

{
  initiative(id: "902fba2e-08b9-4848-a994-d1dc02ab2c00") {
    id
    pitLogs(dates: ["2024-01-11", "2023-01-01"]) {
      date
      progress
    }
  }
}
{
  "data": {
    "initiative": {
      "id": "902fba2e-08b9-4848-a994-d1dc02ab2c00",
      "pitLogs": [
        {
          "date": "2023-01-01",
          "progress": 100
        },
        {
          "date": "2024-01-11",
          "progress": 100
        }
      ]
    }
  }
}

and sometimes:

{
  "errors": [
    {
      "message": "Field 'pitLogs' doesn't exist on type 'Initiative'",
      "locations": [
        {
          "line": 4,
          "column": 5
        }
      ],
      "path": [
        "query",
        "initiative",
        "pitLogs"
      ],
      "extensions": {
        "code": "undefinedField",
        "typeName": "Initiative",
        "fieldName": "pitLogs"
      }
    }
  ]
}

Steps to reproduce

I can't reproduce it in development. It seems to be a problem that appears after some time so perhaps a caching issue?

Steps to reproduce the behavior

Expected behavior

I expect the correct data to be returned

Actual behavior

The correct data is returned sometimes - other times just an error message that the field cannot be found.

nesrual avatar Feb 15 '24 10:02 nesrual

Wow! That's so strange. Thanks for the detailed writeup.

My first idea is, maybe some types are being hidden from the client, so the field is being hidden. Do you have any visible? methods implemented in your schema? Or any other features that might be using GraphQL-Ruby's "visibility" feature (https://graphql-ruby.org/authorization/visibility.html)?

Another idea is, sometimes this happens when there's an error loading a source (.rb) file during the schema boot. But that would cause it to fail every time, not just sometimes. (That said -- do you run multiple server processes in production? Do you know if this error happens on one server but not other?)

I see that graphql-persisted_queries is in the gemfile, but as far as I know, it wouldn't affect this, because that gem caches query strings, but not results.

Let me know about visibility. Other than that, are there any other plugins or configurations in your schema, or in PitLog that you can share?

rmosolgo avatar Feb 15 '24 15:02 rmosolgo

Hi

I do not have any visible? methods implemented in any of my schemas. Actually we opt out of that feature in our schema:

use GraphQL::PersistedQueries, compiled_queries: true, store: :memory
use GraphQL::Batch

# Opt out of GraphQL-Ruby's visibility feature:
use GraphQL::Schema::AlwaysVisible

mutation(Types::MutationType)
query(Types::QueryType)

The application server we use is puma and it is started with the following parameters:

bundle exec puma -e production -t 1:5 -w 5 --pidfile tmp/puma.pid -S tmp/status.txt --preload

The pitLogs is as follows:

From the initiative model:

def pit_logs(dates: [])
  # Return PIT data for the given dates
  GetPitData.call(object: self, dates:).data
end

It simply returns objects as follows:

# frozen_string_literal: true

class PitLog
  attr_accessor :date, :progress, :weight, :period_progress_in_percent, :deviation_min, :deviation_max, :deviation_status, :finished_at, :is_finished, :is_finished_overdue
end

The pitLog type is as follows:

# frozen_string_literal: true

module Types
  class PitLogType < Types::BaseObject
    pundit_role nil

    description 'PIT data'
    field :date, String, null: false
    field :progress, Integer, null: false
    field :weight, Float, null: false
    field :finished_at, GraphQL::Types::ISO8601DateTime, null: true
    field :period_progress_in_percent, Integer, null: false
    field :deviation_min, Integer, null: false
    field :deviation_max, Integer, null: false
    field :deviation_status, String, null: false
    field :is_finished, Boolean, null: false
    field :is_finished_overdue, Boolean, null: false
  end
end

nesrual avatar Feb 16 '24 08:02 nesrual

I have investigated further. We are running several docker images with the same application deployed. It is random if a node returns the error or the correct data; so the issue is not due to one node returning an error and the other nodes returning correct data. It is completely random.

nesrual avatar Feb 16 '24 10:02 nesrual

After some time all the nodes will fail. So it might be some caching / parsing / stale object problem?

nesrual avatar Feb 16 '24 10:02 nesrual

The returned "documentation" from the server also changes; some times the pitLogsentry is in the docs; other times not.

Screenshot 2024-02-16 at 11 47 36

nesrual avatar Feb 16 '24 10:02 nesrual

Ok, thanks so much for sharing all these details. AlwaysVisible is certainly the most dubious of all the plugins -- I'm going to dig in there and see if I can spot any bugs that might make this fail sometimes...

rmosolgo avatar Feb 16 '24 14:02 rmosolgo

Feel free to contact me directly so you can have a deeper look into the source code. [email protected]

nesrual avatar Feb 16 '24 16:02 nesrual

One more clarifying question: when a node returns an error response, does it keep returning error responses after that? Is it "stuck" in that state, or does it go back to returning (mostly) correct data responses?

Also, was there any change to the system that made this start happening? (Eg gem upgrade, schema change?) Or did you just notice that it's been happening, but not sure when it started?

I've been investigating AlwaysVisible, but I just can't come up with anything that would create this irregular behavior -- no caching (that's the point). And besides, what would cause this to be hidden 🤔 I can only think of two reasons that would be true:

  • visible? has returned false -- but you said that there aren't any visibility implementations in your app. And besides, AlwaysVisible would skip those calls even if the method was defined.
  • The field really isn't defined. But obviously it is...
  • Ok, a third one: is the GraphQL Schema's internal list of fields getting messed up? Each type has a hash of fields (source), and if somehow that was modified during execution, maybe it would cause behavior like this. But how could it be ... I don't know of any API for removing entries from that Hash.

I'm at a loss for clues ... could you share the code of your BaseField and BaseObject classes? Maybe there's some clue in there. Feel free to email me ([email protected]) if you prefer.

rmosolgo avatar Feb 16 '24 18:02 rmosolgo

Hi

# frozen_string_literal: true

module Types
  class BaseObject < GraphQL::Schema::Object
    include GraphQL::Pro::PunditIntegration::ObjectIntegration
    pundit_role nil
    field_class Types::BaseField
  end
end
# frozen_string_literal: true

module Types
  class BaseField < GraphQL::Schema::Field
    # Add the Pundit integration:
    include GraphQL::Pro::PunditIntegration::FieldIntegration
    # By default, don't require a role at field-level:
    pundit_role nil
    argument_class Types::BaseArgument
  end
end

Please note that we are using GraphQL Pro.

nesrual avatar Feb 19 '24 08:02 nesrual

Thanks for sharing those. Unfortunately, I don't see any clues there at all 😖

As a matter of simplifying and ruling out some possibilities, could you deploy without AlwaysVisible and/or GraphQL::PersistedQueries? I think they're both performance improvements, but IIUC, the API should work with one or the other, or both, disabled for a short time. Would that be possible, even in a staging environment, if this error happens there?

Is it just this field that has this bug, or does it happen to other fields, too? If it's just this field, then I guess we should look for unique configurations that might affect just pitLogs. (Nothing in the code above gives me any clue ... maybe there's some global handling for things named "log"? Maybe some strange error somehow coming from GetPitData? But I don't even know how something from that runtime code could make a field disappear....) But, if it's happening to other fields too, then I guess we should look for something higher-level, that would affect all the fields.

This is a tough one 😖

rmosolgo avatar Feb 19 '24 17:02 rmosolgo

Update: I have run tons of tests via RSpec shooting off 100K requests to the endpoint without any issues. I have tested the application with production settings also and I couldn't replicate the issue. It makes me think that it might be an issue with the application server (Puma). So I have changed to Passenger for now and will see how that works compared to Puma. I literally can't figure out how the app and the GraphQL gem should suddenly change - so my main suspect is Puma and perhaps how threads are handled when spawned from the workers. I will let you know my findings.

nesrual avatar Feb 21 '24 09:02 nesrual

Sadly the bug persists:

{
  "errors": [
    {
      "message": "Field 'pitLogs' doesn't exist on type 'Initiative'",
      "locations": [
        {
          "line": 4,
          "column": 5
        }
      ],
      "path": [
        "query",
        "initiative",
        "pitLogs"
      ],
      "extensions": {
        "code": "undefinedField",
        "typeName": "Initiative",
        "fieldName": "pitLogs"
      }
    }
  ]
}

nesrual avatar Feb 21 '24 10:02 nesrual

Ok I found the error!

When an object is updated we send out requests to connected clients to fetch queries from the backend. In order to do that we construct the GraphQL query to be performed by the webapp (Angular) to the backend.

For a given object I have an interactor that will return the GraphQL query. However I filter out some fields that should not be included in the query.

How to replicate the bug.

Let us check to see if the Types::InitiativeType has a pitLogs key (it should have):

irb(main):036> "Types::InitiativeType".constantize.own_fields.keys.include?("pitLogs")
=> true

Now lets do the same except filtering out the pitLogs field:

irb(main):037> "Types::InitiativeType".constantize.own_fields.except!("pitLogs").keys.include?("pitLogs")
=> false

So far so good. But now when I check to see if pitLogs is a key I get this:

irb(main):038> "Types::InitiativeType".constantize.own_fields.keys.include?("pitLogs")
=> false

It should return true in this case!

Reloading the app solves the issue:

irb(main):039> reload!
Reloading...
=> true
irb(main):040> "Types::InitiativeType".constantize.own_fields.keys.include?("pitLogs")
=> true

Does the .constantize in any way alter the definition of Types::InitiativeType - it seems like it is the case?

nesrual avatar Feb 21 '24 10:02 nesrual

.except! is typically used with hashes when you want to modify the original hash directly. Be cautious with .except! since it changes the hash it's called on, which might affect other parts of your code relying on the original state of the hash.

So I guess that it would be correct to use .except instead of .except! since the latter would change the keys of the Types::InitiativeType and thereby cause this error.

It would be ok to close this ticket now since it is not an error in the gem.

nesrual avatar Feb 21 '24 12:02 nesrual

use .except instead of .except!

Yes, I'd agree that's the best way to go forward. It's nice to make sense of what was going on! Thanks for sharing all your findings.

As of .constantize, I wouldn't expect it to modify the definition, but reload would certainly do the trick. reload would tell Rails to forget all the constants it loaded and rebuild them from source. So that would make the definition's own_fields hash go back to its initial state.

For the record, one possibility here would be for GraphQL-Ruby to call freeze on all its definition state objects (like this Hash), but the challenge is that we can't really be sure when definitions are "finished". Ruby being Ruby, there are always ways to continue modifying classes. It might be possible to add .freeze/.dup to public APIs for modifying definitions, so that only those intended methods could be used, but I don't think that'd be work the overhead right now. (It would add overhead to definiting the schema during application boot, slowing down development and deployment.)

rmosolgo avatar Feb 21 '24 16:02 rmosolgo