open-bus
open-bus copied to clipboard
SiriRide calculated attributes - v1
Following our last discussions and the restarting of SiriRide entity task ,creating this issue for listing planned SiriRide calculated attributes (v1).
I have separated them into different classes according to their complexity:
- Class 1: SiriRide Raw data.
- Class 2: Simple calculations over Siri data.
- Class 3: Simple calculations over GTFS data.
- Class 4: Complex calculations - ride level.
- Class 5: Complex calculations - aggregated.
- Class 6: Models.
Class | Attr name | Attr desc | Comments |
---|---|---|---|
1 | agency_id | ||
1 | route_id | ||
1 | route_short_name | ||
1 | bus_id | ||
1 | planned_start_date | ||
1 | planned_start_time | ||
1 | points_time_list | list of points timestamp | by time_recorded |
1 | points_latlon_list | list of points latlon | |
2 | points_cnt | number of Geo points in SiriRide | |
3 | ride_in_gtfs | specific ride is listed in the GTFS | agency_id + route_id + planned_start_date + planned_start_time |
3 | ride_date_in_gtfs | ride date is listed in the GTFS | agency_id + route_id + planned_start_date |
3 | ride_route_in_gtfs | ride route is listed in the GTFS | agency_id + route_id |
3 | ride_agency_in_gtfs | ride agency is listed in the GTFS | agency_id |
4 | stops_matching_pct_500 | stops match percentage with buffer of 500m over each stop | will be calculated only for ride_date_in_gtfs = 1 |
4 | stops_matching_pct_1000 | stops match percentage with buffer of 1000m over each stop | will be calculated only for ride_date_in_gtfs = 1 |
4 | start_time_est | estimated ride start time in first station | will be calculated only for X% of stops_matching_pct_1000 |
4 | end_time_est | estimated ride end time in last station | will be calculated only for X% of stops_matching_pct_1000 |
4 | driving_time_est | estimated driving time from first station to last station | will be calculated only for X% of stops_matching_pct_1000 |
4 | driving_speed_est | estimated average driving speed from first station to last station | will be calculated only for X% of stops_matching_pct_1000 |
It's a very initial list. Please edit it with your own insights.
I think we should add the attribute "makat number", in addition to agency_id, route_id, route_short_name.
Some thoughts and fields I think we also need:
- Where in your classification do you put the estimated time at each stop?
- Time of first and last record (as opposed to departure and arrival times)
- Time of first and last record that had actual lat,lon (not 0.0,0.0)
- Fields about where are the missing stops (start, end or mid of the trip)
- Did the bus go back and forth or loop (arrive at a point in the shape he was at at an earlier time)
- General point - I think we should try to have all relevant fields that exist in trip_stats in SiriRide, and make sure we are comparable to the trip_stats fields (and also have the same names?). Most of them you already mentioned. I think only these two are missing:
-
distance
-
is_loop
kind of matches the thing I mentioned above about back and forth
-
- Once we have these we will be able to easily create in the future something like
siri_route_stats
with fields matching the ones in the gtfs route_stats
- we could also have class for siri-ride that have multiple siri-records (with date time and lat-lon attributes), it could be easier to have one list than two.
- what do you say about merge together the planned_start_date and time?
- in case we are going to have those 2 classes (siri-ride and siri-record) we could have in each of them "analytics" member that holds dictionary with all the metrics. for example "points_cnt" will be in siri-ride object while "speed" will be in siri-record.
I like the idea of dividing the variables into complexity classes. My suggestions/comments:
- I think we should classify each variable by 2 criteria:
- Data needed (siri ride only, gtfs, etc.)
- Data Science work needed (e.g. straightforward aggregation vs statistical model required)
- I don't understand the difference between complex calculations (class 4) and models (class 6). I prefer more clear definition to the data science solution complexity (see above), that do not require us to decide in advance which type of DS solution (ML/statistical model...) will be the best for each "complex" variable.
- Le'ts add dependencies - if driving time requires start_time and end_time and given them it is straightforward calculation - let's mention it.
- On top of Dan's suggestions I would also add:
- total_ride_time_raw : time from first non 0 time point until the last one. This variable will help us to easily detect data anomalies with too long and too short rides.
- is_match_route: is the route ID mentioned in SIRI matching the expected route shape (from GTFS)?
- I didn't understand the variables: stops_matching_pct_*. Maybe add further description?
- In general I think we should focus now on defining and creating the "straightforward" variables, and later focus on variables that require statistics/modeling.
@AvivSela - I didn't understand your suggestion in (1), what is the purpose of each class? Why should they be separated? Regarding (2) - I think that merging the date and time can hurt efficiency of indexing. Maybe we would like to index the date and not the time.
- It's more easy to loop them:
for ind, point_time in enumerate(points_time_list):
time = point_time
lat, lon = points_latlon_list[ind]
Vs.
for record in records:
time = record.time
lat, lon = record.latlon
- it's less error prone in case we will need to add new record that should be splitted to two and insert into the same index in both list.
- it's more easy to sort in case of modifications.
Thanks you all for your comments and insights!
I updated the design following it.
The variables list became too long so I ended up opening a design doc for it. Please see here.
In summary:
- I added most of the suggested variables (see exceptions in the "open issues" section below) and some more (total ~30 raw data/straightforward calculations and ~10 complex calculations).
- I added variables dependencies.
- I separated the data categories (what was called "classes" in the previous comment) by the 2 criteria @adiwaz mentioned.
Open issues:
- "makat number" - @evyatark I didn't found the column in Splunk siri data. What is the meaning of this column? do you know its "Splunk" name?
- "is_match_route" - @adiwaz, I think that for this version it will be more simple (from IT and DS perspective) not to use GTFS shape files, and build our "match route" variables based over GTFS route_stats only (stops data).
- SiriRecord class - @AvivSela ,I assume this is more IT-related issue rather than data-related issue.
- planned_end_datetime_gtfs - I didn't found this data in gtfs route_stats. We don't get/collect it?
Following 15/4 Zoom meeting, some required updates in the data design:
- We will need to mention which variables based directly on Siri or GTFS and which based solely on other SiriRide variables (@EyalBerger)
- Data types: We should add data types.
- Naming: We will need to make sure that variables names are as they are in siri sources, e.g "service_id" is "trip_id_to_date". Who is familiar with siri sources and can help me with that task?
I added data types and update dependencies (when variable based directly on Siri or GTFS) to the data design.
Hi, I looked at SIRI 2.8. it might take some time but we will get there. there are some more fields there that come "free of charge" without the need to calculate them. Here is example of the JSON format: ICD_SM_2_8_ver25.pdf
{
"-version": "2.8",
"ResponseTimestamp": "2020-10-16T06:32:30+03:00",
"Status": "true",
"MonitoredStopVisit": [
{
"RecordedAtTime": "2020-10-16T06:32:19+03:00",
"ItemIdentifier": "1455075547",
"MonitoringRef": "47507",
"MonitoredVehicleJourney": {
"LineRef": "28209",
"DirectionRef": "1",
"FramedVehicleJourneyRef": {
"DataFrameRef": "2020-10-16",
"DatedVehicleJourneyRef": "50698246"
},
"PublishedLineName": "52",
"OperatorRef": "3",
"DestinationRef": "47453",
"OriginAimedDepartureTime": "2020-10-16T06:25:00+03:00",
"VehicleLocation": {
"Longitude": "35.079803",
"Latitude": "32.823952"
},
"Bearing": "8",
"Velocity": "29",
"VehicleRef": "7576269",
"MonitoredCall": {
"StopPointRef": "47507",
"Order": "26",
"ExpectedArrivalTime": "2020-10-16T06:49:00+03:00",
"DistanceFromStop": "4009"
}
}
}
]
}
If im taking those fields combine them into one object that represent a ride that have list of records with the observation over time i will get the following schema:
SiriRide
LineRef: "Reference to a LINE"
DirectionRef: "Reference to a DIRECTION the VEHICLE is running along the LINE"
FramedVehicleJourneyRef_DataFrameRef: "The date part of the trip ID"
FramedVehicleJourneyRef_DatedVehicleJourneyRef: "The number part of trip ID"
PublishedLineName: "The bus number, as published on the bus"
OperatorRef: "The Operator code"
DestinationRef: "The destination stop code"
VehicleRef: "Vehicle number. The value should match the license number of the Vehicle"
OriginAimedDepartureTime: "The start time of the Journey, according to the licensing system" The value should match DepartureTime at TripIdToDate.txt file at the GTFS"
SiriRecords
ResponseTimestamp: "The time of the Response"
RecordedAtTime: "Time at which data was recorded at the Vehicle"
VehicleLocation
Longitude: Latitude from equator
Latitude: Latitude from equator
Bearing: "Vehicle bearing with respect to the North"
Velocity: "Vehicle speed at Km/h."
StopPointRef: "The stop code of the stop that the Vehicle is stopping at now, or recently visited"
Order: "The stop order of the stop that the Vehicle is stopping at now, or recently visited"
DistanceFromStop: "The distance that the Vehicle travelled from the start of the journey. in meters"
{
"title": "SiriRide",
"type": "object",
"properties": {
"LineRef": {
"title": "Lineref",
"description": "Reference to a LINE ",
"type": "integer"
},
"DirectionRef": {
"title": "Directionref",
"description": "Reference to a DIRECTION the VEHICLE is running along the LINE",
"type": "integer"
},
"FramedVehicleJourneyRef_DataFrameRef": {
"title": "Framedvehiclejourneyref Dataframeref",
"description": "The date part of the trip ID",
"type": "string",
"format": "date-time"
},
"FramedVehicleJourneyRef_DatedVehicleJourneyRef": {
"title": "Framedvehiclejourneyref Datedvehiclejourneyref",
"description": "The number part of trip ID",
"type": "integer"
},
"PublishedLineName": {
"title": "Publishedlinename",
"description": "The bus number, as published on the bus",
"type": "string"
},
"OperatorRef": {
"title": "Operatorref",
"description": "The Operator code",
"type": "integer"
},
"DestinationRef": {
"title": "Destinationref",
"description": "The destination stop code",
"type": "integer"
},
"VehicleRef": {
"title": "Vehicleref",
"description": "Vehicle number. The value should match the license number of the Vehicle",
"type": "integer"
},
"OriginAimedDepartureTime": {
"title": "Originaimeddeparturetime",
"description": "The start time of the Journey, according to the licensing system\" The value should match DepartureTime at TripIdToDate.txt file at the GTFS",
"type": "string",
"format": "date-time"
},
"SiriRecords": {
"title": "Sirirecords",
"description": "represent one observation on a vehicle over time",
"type": "array",
"items": {
"$ref": "#/definitions/SiriRecord"
}
}
},
"required": [
"LineRef",
"DirectionRef",
"FramedVehicleJourneyRef_DataFrameRef",
"FramedVehicleJourneyRef_DatedVehicleJourneyRef",
"PublishedLineName",
"OperatorRef",
"DestinationRef",
"VehicleRef",
"OriginAimedDepartureTime",
"SiriRecords"
],
"definitions": {
"GeoPoint": {
"title": "GeoPoint",
"type": "object",
"properties": {
"Longitude": {
"title": "Longitude",
"description": "Latitude from equator",
"type": "number"
},
"Latitude": {
"title": "Latitude",
"description": "Latitude from equator",
"type": "number"
}
},
"required": [
"Longitude",
"Latitude"
]
},
"SiriRecord": {
"title": "SiriRecord",
"type": "object",
"properties": {
"ResponseTimestamp": {
"title": "Responsetimestamp",
"description": "The time of the Response",
"type": "string",
"format": "date-time"
},
"RecordedAtTime": {
"title": "Recordedattime",
"description": "Time at which data was recorded at the Vehicle",
"type": "string",
"format": "date-time"
},
"VehicleLocation": {
"title": "Vehiclelocation",
"description": "Vehicle Location",
"allOf": [
{
"$ref": "#/definitions/GeoPoint"
}
]
},
"Bearing": {
"title": "Bearing",
"description": "Vehicle bearing with respect to the North",
"minimum": 0,
"maximum": 360,
"type": "integer"
},
"Velocity": {
"title": "Velocity",
"description": "Vehicle speed at Km/h.",
"minimum": 0,
"type": "integer"
},
"StopPointRef": {
"title": "Stoppointref",
"description": "The stop code of the stop that the Vehicle is stopping at now, or recently visited",
"type": "integer"
},
"Order": {
"title": "Order",
"description": "The stop order of the stop that the Vehicle is stopping at now, or recently visited",
"type": "integer"
},
"DistanceFromStop": {
"title": "Distancefromstop",
"description": "The distance that the Vehicle travelled from the start of the journey. in meters",
"type": "integer"
}
},
"required": [
"ResponseTimestamp",
"RecordedAtTime",
"VehicleLocation",
"Bearing",
"Velocity",
"StopPointRef",
"Order",
"DistanceFromStop"
]
}
}
}