TIDES icon indicating copy to clipboard operation
TIDES copied to clipboard

🚀💻 – resolve frictions in frictionless spec

Open botanize opened this issue 3 years ago • 5 comments

Describe the feature you want and how it meets your needs or solves a problem

As a spec writer I want to communicate features of the spec, without creating hidden dependencies or requirements because of strange interpretations within frictionless.

I'm currently running up against two strange design choices in frictionless while trying to create sample data:

  1. optional fields in frictionless tables are interpreted as required in the file AND nullable. In GTFS an optional (not-required) field is optional in the file AND nullable. I think most people find the GTFS usage more natural. There's currently no way to say a field is optional in a file. This leads to validations failing if an "optional" field is missing, because the frictionless validator requires all fields defined in a schema to be present in the same order they're defined. update: you can correctly validate a file without optional fields using the --schema-sync option, e.g, frictionless validate --schema-sync samples/50027-MetroTransitMN/datapackage.json.
  2. (resolved via #81 ) A foreign key in a table schema suggests that there's an optional link with another table, but in fact, frictionless enforces that the other table exists. For example, trip_id_performed in vehicle_locations is defined as a foreign key on trips_performed.trip_id_perfomed. But that means that if you provide vehicle_locations.csv in your data package, you MUST also provide trips_performed.csv, operators.csv (via trips_performed), vehicles.csv, devices.csv and train_cars.csv (via devices) even if you only provide the required fields, which contain no information not already in vehicle_locations.csv. I'm not even sure you can provide only the required information, for example, if you don't run train service, but provide vehicle_locations.csv, the foreign key on devices requires you to provide devices.csv, but since devices also has a foreign key on train_cars, you're required to provide train_cars.csv, but train_cars.train_car_id is required, even though you don't have any train cars, let alone any train car ids that would necessitate an entire train_cars.csv. This is clearly not our intention when specifying foreign keys!—I'm totally happy to run some aggregation of vehicle_locations.csv on vehicle_id, without knowing anything else about the vehicle.

Unintended foreign key dependencies:

  • fare_transactions:
    • vehicles
    • trips_performed
      • operators
    • devices
  • passenger_events:
    • vehicles
    • trips_performed
      • operators
    • devices
    • train_cars
  • vehicle_locations:
    • vehicles
    • trips_performed
      • operators
    • devices
      • train_cars
  • stop_visits:
    • vehicles
    • trips_performed
      • operators
  • trips_performed:
    • vehicles
    • operators
  • devices:
    • vehicles
    • train_cars
  • vehicle_train_cars:
    • vehicles
    • train_cars

Describe the solution you'd like

I don't see one, but I'd like to hear some ideas.

I filed an issue for the not-required field is required in the file problem, but the initial response is not promising.

I should probably file an issue for the foreign key creates a dependency on an additional file problem. The problem seems to be that the foreign keys definition in the table schema links not to another schema, but to a data resource. This is nonsensical—a schema should never have a relationship to a data resource, only to other schemas.

Describe alternatives you've considered

  • submit PRs to frictionless to clarify the spec language and fix the validation behavior
  • fork the frictionless validator and modify it so that it behaves the way we expect.
  • find an alternate representation

Additional context

botanize avatar Oct 08 '22 02:10 botanize

I see and agree with the issue. Without knowing much about frictionless (I hadn't heard of it before this repo), it appears from your description of it that it is aimed at describing relational data as it would exist in a database, where tables do need to exist before foreign keys are created, and where the existence of columns is fixed, not optional depending on whether the data contains entries for the column. I remember that back when I first started working with GTFS I found the flexibility of the schema odd and hard to work with, because you couldn't write a simple parser assuming a predictable number and order of columns. Most datasets I work with have a more rigid schema.

In this particular case, I don't mind so much that all columns have to be specified, although I can see how it could be an annoyance to those used to working with GTFS or JSON data, and wouldn't mind if flexibility were allowed. The foreign key issue, on the other hand, is unworkable.

I see three alternatives:

  1. From a relational database perspective, what we're calling a foreign key is not, strictly speaking, a foreign key. The issue is one of semantics, but the solution is not defining foreign keys. The downside of this is that we do want to document a relationship between columns across tables, and one that is very similar to a foreign key but not as strict. We'd have to document that another way, not using frictionless.
  2. The concept of a loose and unchecked foreign key does come up in other cases, and it would be useful for many use cases, not just TIDES, to have a "key" that is there just to document a relationship rather than to enforce it. We could add that concept to frictionless through a contribution, or a fork if necessary. A contribution would be preferred so that we don't have to maintain the fork.
  3. We could write our own validator. This is more work, but in time I suspect we will want to do this anyway because there will be lots of subtle, situation-dependent things to check in the data, and the kind of logic that would require is beyond the scope of frictionless.

gabriel-korbato avatar Oct 11 '22 13:10 gabriel-korbato

  1. The concept of a loose and unchecked foreign key does come up in other cases, and it would be useful for many use cases, not just TIDES, to have a "key" that is there just to document a relationship rather than to enforce it. We could add that concept to frictionless through a contribution, or a fork if necessary. A contribution would be preferred so that we don't have to maintain the fork.

We could move the foreign key definitions to a custom field, e.g., _foreignKeys. The foreign keys could be used for documentation, or custom validation, without causing the validator to enforce their existence.

  1. We could write our own validator. This is more work, but in time I suspect we will want to do this anyway because there will be lots of subtle, situation-dependent things to check in the data, and the kind of logic that would require is beyond the scope of frictionless.

Or fork the existing one. Definitely more work! I'm not sure at this point what kinds of special validation we could need. One thing you can do in frictionless is define custom types. We could use that to define a 28 hour transit clock (or longer). I don't know how or if custom types are currently validated, so that's one area of potential development.

botanize avatar Oct 11 '22 17:10 botanize

  1. The concept of a loose and unchecked foreign key does come up in other cases, and it would be useful for many use cases, not just TIDES, to have a "key" that is there just to document a relationship rather than to enforce it. We could add that concept to frictionless through a contribution, or a fork if necessary. A contribution would be preferred so that we don't have to maintain the fork.

We could move the foreign key definitions to a custom field, e.g., _foreignKeys. The foreign keys could be used for documentation, or custom validation, without causing the validator to enforce their existence.

Or add a boolean property of the foreignKey object: enforced = [true | false], but that would require modifying frictionless.

  1. We could write our own validator. This is more work, but in time I suspect we will want to do this anyway because there will be lots of subtle, situation-dependent things to check in the data, and the kind of logic that would require is beyond the scope of frictionless.

Or fork the existing one. Definitely more work! I'm not sure at this point what kinds of special validation we could need. One thing you can do in frictionless is define custom types. We could use that to define a 28 hour transit clock (or longer). I don't know how or if custom types are currently validated, so that's one area of potential development.

I was thinking of complex row identification constraints, such as "either A or (B and C) must be defined and unique". Or checks dependent on metadata: many kinds of data at different stages of processing will be converted to TIDES, and it could be that we want to relax some constraints for unprocessed data, so one could imagine validation of a table dependent on the values in some other table.

gabriel-korbato avatar Oct 11 '22 18:10 gabriel-korbato

A few thoughts:

  1. Frictionless is designed to be extensible, so we should extend it if and when we need it to satisfy a use case
  2. It wasn't trivial but also not difficult to your my own frictionless validator for a frictionless spec - I did this for GMNS before the frictionless framework was well-developed (and took forever!) and other validation packages like pandera now support reading of frictionless schemas an could also be used if they are extensible (note: I haven't looked into this so it might be a nightmare). We could also fork the frictionless framework.
  3. I agree that this seems like it is a fairly common issue/use case requirement that there should be general frictionless support for.
  4. Perhaps instead of calling it a foreign key, we should call it a foreign reference?

e-lo avatar Oct 11 '22 20:10 e-lo

4. Perhaps instead of calling it a foreign key, we should call it a foreign reference?

Yes, it's really not a foreign key, at least not as commonly meant in DB jargon, so maybe a different name would clear that up and make the update palatable to the maintainers of frictionless. Alternatively, we could suggest an enforced property, which should also be accepted as reasonable.

gabriel-korbato avatar Oct 11 '22 21:10 gabriel-korbato