Data Class

rove_parameters module

class backend.data_class.rove_parameters.ROVE_params(agency: str, month: str, year: str, date_type: str, data_option: str, input_paths: Dict, output_paths: Dict, start_date: str = '', end_date: str = '')

Bases: object

Data structure that stores all parameters needed throughout the backend.

Parameters:

agency (str) – name of the analyzed agency
month (str) – 2-character string of the analyzed month, e.g. 03 for March
year (str) – 4-character string of the analyzed year, e.g. 2022
date_type (str) – type of dates to be analyzed. One of “Workday”, “Saturday”, “Sunday”
data_option (list) – list of input data options. One of ‘GTFS’, ‘GTFS-AVL’

agency: str: Analyzed transit agency, see parameter definition.

month: str: Analyzed month, see parameter definition.

year: str: Analyzed year, see parameter definition.

date_type: str: Analyzed date option, see parameter definition.

data_option: str: Analyzed data option, see parameter definition.

suffix: str: Suffix used in input and output file names, string concatenation in the form of “<agency>_<month>_<year>”, e.g. “MBTA_02_2021”.

input_paths: Dict[str, str]: Dict of paths to input data, i.e. gtfs, avl, backend_config, frontend_config, shapes file (if shape generation has been run previously).

output_paths: Dict[str, str]: Dict of paths to output data, i.e., shapes file, timepoints lookup, stop name lookup, aggregated metrics by time periods, aggregated metrics by 10-min intervals.

redValues: Dict[str, str]: A dict serving as the lookup for “redValues”, i.e. whether the visualization of a metric value is red when the value is high or low. This information is required in the frontend_config JSON file, where an object named “redValues” must exist and consist of name-value pairs of each metric to be calculated, where the value must be “High” or “Low”. e.g. “scheduled_frequency” : “Low” means that the scheduled frequency of stop pairs/routes will be colored red if the value is low and blue if high; whereas “High” means high values are colored red and low values blue.

backend_config: agency-specific configuration parameters for the backend (backend_config), e.g., time periods, speed range, percentile list, additional files, etc. (Although there are two config files (frontend_config and backend_config), this attribute storing backend_config data is called “config” for simplicity, because frontend_config is only used in the backend to retrieve redValues as described above, and all other reference to “config” in the backend is using backend_config.)

date_list: List[datetime]: List of dates of the given date_type in the given month and year of the agency.

get_iso3166_code()

get_backend_config(fpath: str)

get_frontend_config(fpath: str)

get_transitFileProp_or_vizFileProp(name: str, fconfig: Dict, this_sub_dict: Dict)

generate_date_list() → List[datetime]: Generate a list of dates of date_type between the start_date and end_date or in the given month and year. For example, if the user specified to analyze “MBTA”, “02”, “2021”, “Workday” as the agency, month, year and date_type and did not specify a start_date or end_date, then this method will return a list of datetime objects that are the workdays (no weekend or holiday) in Feb 2021 in the state/country represented by the iso3166_code; otherwise, if a start_date and end_date are specified, then this method will return a list of datetime objects between the start date and end date, but the A “workalendarPath” object must exist in the backend_config JSON file for the method to know which state/region to lookup the holiday calendar for, and its value must be the workalendar class for the region that the agency operates in. E.g. for the MBTA, this name-value pair is specified in the backend_config file: “workalendarPath”: “workalendar.usa.massachusetts.Massachusetts”. For details on how to find the correct workalendar class for your region, refer to https://workalendar.github.io/workalendar/iso-registry.html. :raises KeyError: No workalendarPath is found in config. :return: List of dates. :rtype: List[datetime.datetime]

gtfs module

class backend.data_class.gtfs.GTFS(rove_params: ROVE_params, mode: str = 'bus', shape_gen=True)

Bases: object

Store a validated GTFS stop records table. Add timepoint and branchpoint data to the records table. Also generate and store a dict of route patterns (patterns_dict). :param rove_params: a rove_params object that stores information needed throughout the backend :type rove_params: ROVE_params :param mode: the mode of transit that the GTFS data is for, defaults to ‘bus’.

For example, if mode is ‘bus’, then the list of route type values for ‘bus’ as specified in backend_config will be used to query the corresponding GTFS trips. The current implementation (metrics, shapes, etc.) is developed around bus (or bus-like) mode only. Support for other transit modes may be added in the future.

REQUIRED_DATA_SPEC = {'routes': {'route_id': 'string', 'route_type': 'int64'}, 'stop_times': {'arrival_time': 'int64', 'departure_time': 'int64', 'stop_id': 'string', 'stop_sequence': 'int64', 'trip_id': 'string'}, 'stops': {'stop_id': 'string', 'stop_lat': 'float64', 'stop_lon': 'float64', 'stop_name': 'string'}, 'trips': {'direction_id': 'int64', 'route_id': 'string', 'service_id': 'string', 'trip_id': 'string'}}: Required tables and columns in GTFS static data. Note that “direction_id” is not a required field in GTFS specification, but is required by ROVE.

OPTIONAL_DATA_SPEC = {'shapes': {'shape_id': 'string', 'shape_pt_lat': 'float64', 'shape_pt_lon': 'float64', 'shape_pt_sequence': 'int64'}}: Optional tables and columns that ideally are present in GTFS static data

mode: str: Analyzed transit mode, see parameter definition.

alias: str: Alias of the data class, defined as ‘gtfs’.

rove_params: ROVE_params: ROVE_params for the backend, see parameter definition.

raw_data: Dict[str, pandas.DataFrame]: Raw data read from the given path, see GTFS.load_data() for details.

validated_data: Dict[str, pandas.DataFrame]: Validated data, see GTFS.validate_data() for details.

records: pandas.DataFrame: GTFS records table that contains all stop events info and trips info, see GTFS.get_gtfs_records() for details.

patterns_dict: A dict of improved patterns, see GTFS.improve_pattern_with_shapes() for details.

load_data(path: str) → Dict[str, pandas.DataFrame]: Load in GTFS data from a zip file, and retrieve data of the dates in date_list (as stored in rove_params) and route_type (as stored in config). Enforce that required tables are present and not empty, and log (w/o enforcing) if optional tables are not present in the feed or empty. Enforce that all spec columns exist for tables in both the required and optional specs. Store the retrieved raw data tables in a dict. :param path: path to the raw data :type path: str :return: a dict containing raw GTFS data. Key: name of GTFS table; value: DataFrames of required and optional GTFS tables. :rtype: Dict[str, pd.DataFrame]

validate_data(): Clean up raw data by converting column types to those listed in the spec. :return: a dict containing cleaned-up GTFS data. Key: name of GTFS table; value: GTFS table stored as DataFrame. :rtype: Dict[str, pd.DataFrame]

get_gtfs_records() → pandas.DataFrame

Return a dataframe that is the validated GTFS stop_times table left joined by the validated GTFS trips table. Values are sorted by [route_id, trip_id, stop_sequence]. Additional columns are added for the convenience of downstream calculations:

‘hour’ - the hour that the arrival time is in;

‘trip_start_time’: start time of the trip that this stop event record is associated with;

‘trip_end_time’: end time of the trip that this stop event record is associated with.

Returns:: the merged dataframe and additional columns
Return type:: pd.DataFrame

add_timepoints(): Add, or repopulate, the ‘timepoint’ column in the GTFS records table (created from get_gtfs_records()). ‘timepoint’ is an optional column in GTFS standards, but we require the identification of timepoints in each trip for timepoint-level metric calculations. Therefore, each agency must either supply the ‘timepoint’ info in the ‘timepoint’ column of the ‘stop_times’ table in GTFS data, or provide additional data source and extend the standard GTFS class and overwrite this method to populate the ‘timepoint’ column in the GTFS records table. Otherwise, every stop in the stop_times will be labeled as a timepoint.

add_branchpoints(): Add the ‘branchpoint’ and ‘tp_bp’ columns in the GTFS records table. ‘branchpoint’ is defined as stops where routes converge or diverge between two timepoints. The ‘tp_bp’ column marks stops that are either a timepoint or a branchpoint. The ‘tp_bp’ stop pairs are the basis of aggregation for ‘timepoint’ and ‘timepoint-aggregated’ metrics.

generate_patterns() → Dict[str, Dict]

Generate a dict of patterns from validated GTFS data. Add a “pattern” column to the trips table. :raises ValueError: number of unique trip hashes does not match with number of unique sequence of stops :return: pattern dict - key: pattern (route_id - direction_id - hash count); value: Segment dict (a segment is a section of road between two transit stops).

(Segment dict - key: tuple of first and last stops of the segment; value: list of coordinates defining the segment.)

Return type:: Dict[str, Dict]

improve_pattern_with_shapes(patterns: Dict, records: pandas.DataFrame, gtfs: Dict) → Dict[str, Dict]: Improve the coordinates of each segment in each pattern by supplementing the stop coordinates with coordinates found in the GTFS shapes table, i.e. in addition to the two stop coordinates at both ends of the segment, also add additional intermediate coordinates given by GTFS shapes to enrich the segment profile. :param patterns: dict of patterns :type patterns: Dict :param records: table of validated GTFS stop_times records :type records: pd.DataFrame :param gtfs: dict of validated GTFS tables :type gtfs: Dict :return: dict of patterns, where the list of coordinates of each segment is supplemented by the GTFS shapes table :rtype: Dict[str, Dict]

generate_timepoints_output(): Save to a JSON file a lookup of timepoint pairs. Each key is the segment ID, i.e. string concatenation of “route_id - first stop - second stop” of the stop pair, and value is a tuple (first stop_id, second stop_id) of the timepoint pair that this stop pair belongs to.

generate_stop_name_output(): Save to a JSON file a lookup of stop names. Each key is the stop ID, and element is the dict {“stop_name” : <name of the stop>} and optionally the name-value pair for “municipality” if the field exists in the table.

avl module

class backend.data_class.avl.AVL(rove_params: ROVE_params, bus_gtfs: GTFS)

Bases: object

Stores a validated AVL data records table with passenger on, off and load values corrected.

Parameters:: rove_params (ROVE_params) – a rove_params object that stores information needed throughout the backend

REQUIRED_COL_SPEC = {'dwell_time': 'float64', 'passenger_load': 'int64', 'passenger_off': 'int64', 'passenger_on': 'int64', 'route': 'string', 'seat_capacity': 'int64', 'stop_id': 'string', 'stop_sequence': 'int64', 'stop_time': 'int', 'trip_id': 'string'}: Required columns and the data types that each column will be converted to in AVL data

OPTIONAL_COL_SPEC = {}: Optional columns in AVL data

rove_params: ROVE_params: ROVE_params for the backend, see parameter definition.

gtfs: GTFS: GTFS records table

validated_data: pandas.DataFrame: Validated data, see AVL.validate_data() for details.

records: pandas.DataFrame: AVL records table, see AVL.get_avl_records() for details.

load_data(path: str) → pandas.DataFrame

Load in AVL data from the given path.

Parameters:: path (str) – file path to raw AVL data
Raises:: ValueError – raw AVL data file is empty
Returns:: dataframe of AVL data with all required columns
Return type:: pd.DataFrame

validate_data() → pandas.DataFrame

Clean up raw data by converting column types to those listed in the spec. Convert dwell_time and stop_time columns to integer seconds if necessary. Filter to keep only AVL records of dates in the date_list in ROVE_params.

Returns:: a dataframe of validated AVL data
Return type:: pd.DataFrame

check_avl_gtfs_ids_match()

convert_dwell_time(data: pandas.Series) → pandas.Series

Convert dwell times to integer seconds.

Parameters:: data (pd.Series) – the column of dwell_time data
Returns:: column of dwell times in integer seconds
Return type:: pd.Series

convert_stop_time(data: pandas.Series) → Tuple[pandas.Series, pandas.Series]

Convert stop times to integer seconds since the beginning of service (defined in config). Also return a column of service date (e.g. 01:30 am on March 4 may correspond to the service date of March 3 if service span is from 5 am to 3 am the next day).

Parameters:: data (pd.Series) – the column of stop_time (time of arrival at a stop) data
Returns:: column of stop times in integer seconds, and column of service dates
Return type:: Tuple[pd.Series, pd.Seires]

get_avl_records() → pandas.DataFrame

Return a dataframe that is the validated AVL table. Values are sorted by [‘svc_date’, ‘route_id’, ‘trip_id’, ‘stop_sequence’], and only unique rows of each combination of [‘svc_date’, ‘route_id’, ‘trip_id’, ‘stop_sequence’] columns are kept.

Returns:: dataframe containing validated and sorted AVL data
Return type:: pd.DataFrame

correct_passenger_load(): Enforce that no one alights at the first stop or boards at the last stop, and make sure the passenger_on, passenger_off and passenger_load values of each trip add up.