mobts.imputation package

Submodules

mobts.imputation.donors module

Module concerned with the donor-based imputations

This module contains: - determining the minimum overlap period for scaled median imputation method based on project’s temporal frequency - building pivot tables for further operations, where timestamp would be index, counters as columns, and counts as values - creating a correlation matrix of counters based on pearson correlation between counts - scaled medians imputation - regression imputation

mobts.imputation.donors._build_pivots(df: DataFrame, cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1)) DataFrame

builds pivots of data, where timestamp would be index, counters as columns, and counts as values

  • df: full network DataFrame

  • cols: columns config

  • stl_cfg: STL config

  • pivot_raw: building a pivot based on raw observed counts

  • pivot_ts: building a pivot based on smoothed out time series of STL’s trend + seasonality

mobts.imputation.donors._corralation_matrix_donors(pivot_for_corr: DataFrame) DataFrame

builds the correlation matrix of counters based on pearson correlation between counts, counters, and timestamps

  • pivot_for_corr: the ‘_build_pivots’ function’s output, which is a pivot of counts

  • the correlation matrix of counters

mobts.imputation.donors._get_min_overlap_period_sm(freq: str, donors_cfg: DonorsConfig = DonorsConfig()) int

Determines the minimum overlap period necessary for scaled medians imputation

  • freq: temporal frequency of the project

  • donors_cfg: donors’ config

  • integar corresponding to the minimum necessary overlap period

mobts.imputation.donors.impute_regression(df: DataFrame, pivot: DataFrame, freq: str, donor_map: dict[str, list[str]], counters=None, cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), donors_cfg: DonorsConfig = DonorsConfig(), stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8')) DataFrame

Fills missing values using regression prediction of donors (M8)

  • df: the complete network dataset

  • pivot: pivotted dataset of counters

  • donor_map: dictionary map of donors

  • freq: temporal frequency of the project

  • counters: counters to be operated on. if NaN, all counters will be processed

  • cols: columns config

  • donors_cfg: donors’ config

  • out_cfg: output config

  • stl_cfg: STL config

  • Imputed DataFrame using regression method (M8)

  • the ‘counters’ argument is added in order to be utilized through piepline, to skip counters which do not have data holes. this gives us the possibility to only process counters with holes

mobts.imputation.donors.impute_scaled_median(df: DataFrame, pivot: DataFrame, donor_map: dict[str, list[str]], freq: str, counters=None, cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), donors_cfg: DonorsConfig = DonorsConfig(), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8')) DataFrame

Fills missing values using scaled median of donors (M7)

  • df: the complete network dataset

  • pivot: pivotted dataset of counters

  • donor_map: dictionary map of donors

  • freq: temporal frequency of the project

  • counters: counters to be operated on. if NaN, all counters will be processed

  • cols: columns config

  • donors_cfg: donors’ config

  • out_cfg: output config

  • Imputed DataFrame using scaled medians method (M7)

  • the ‘counters’ argument is added in order to be utilized through piepline, to skip counters which do not have data holes. this gives us the possibility to only process counters with holes

mobts.imputation.pipeline module

The pipeline for imputation subpackage

This module contains the ‘impute’ class. It’s ‘run’ function includes: - formatting and verifying the temporal elements of the input dataset - identifying counters with holes - applying STL, scaled medians imputation, and donor regression imputation

class mobts.imputation.pipeline.impute(cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1), donors_cfg: DonorsConfig = DonorsConfig(), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8'), cfg_spr: SparsityConfig = SparsityConfig(drop_sparse_counters=True, sparse_threshold=0.5), suppress_runtime_warnings: bool = True)

Bases: object

End-to-end pipeline of input data to imputed data

  • df: input dataset

  • cols: columns config

  • stl_cfg: STL config

  • donors_cfg: donors config

  • out_cfg: output config

  • suppress_runtime_warnings: boolean for suppressing warnings

report(print_output: bool = True, save: bool = False, filepath: str = 'preprocess_report.txt') dict

Returns a dictionary containing summary information from the latest run.

  • print_output : boolean for printing the operation info

  • save : boolean for saving the info in a text file

  • filepath : Path of the text file to save, default=”preprocess_report.txt”

  • Dictionary with summary information from the latest pipeline run.

run(df: DataFrame, counter_col: str, timestamp_col: str, count_col: str, metadata_cols: list | None = None) DataFrame

mobts.imputation.selector module

Mixed utility module, concerned with selections for the donor-methods

This module contains: - identifying counters with missing counts - determining the minimum mutual period of donors from the config, based on the temporal frequency of the project - determining the minimum prediction period used in regression from the config, based on the temporal frequency of the project - function for determining if the counter is eligible to be filled in using the scaled medians method - function for selecting donor stations for the regression method - function for determining if the counter is eligible to be filled in using the regression method - determining eligible imputation method for each counter

mobts.imputation.selector._counter_method_choice(target: str, pivot: DataFrame, donor_map: dict[str, list], freq: str, donors_cfg: DonorsConfig = DonorsConfig(), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8')) str

picks the best eligible method for each counter (first M8, then M7, and then STL)

  • target: the counter that is the target of the function

  • pivot: pivotted form the data (timestamp index, counter columns, count values)

  • donor_map: dictionary map of donors

  • freq: temporal frequency of the project

  • donors_cfg: donor config

  • out_cfg: output config

  • string indicating the best eligible method for the target counter

mobts.imputation.selector._find_counters_with_holes(df: DataFrame, count_col: str, counter_col: str) list

Finds counters with missing values

  • df: preprocessed network DataFrame

  • count_col: count column

  • counter_col: counter column

  • list of counters that have missing counts

mobts.imputation.selector._get_min_mutual_period(freq: str, donors_cfg: DonorsConfig = DonorsConfig()) int

determines the minimun mutual period for donors from config

  • freq: temporal frequency of the project

  • donors_cfg: donor config

  • minimum mutual period

mobts.imputation.selector._get_min_prediction_period(freq: str, donors_cfg: DonorsConfig = DonorsConfig()) int

determines the minimum prediction for donors from config

  • freq: temporal frequency of the project

  • donors_cfg: donor config

  • minimum prediction period needed for regression

mobts.imputation.selector._is_eligible_for_regression(target: str, pivot: DataFrame, freq: str, donors: list, donors_cfg: DonorsConfig = DonorsConfig()) bool

Determines if the counter is eligible for regression imputation method

  • target: the counter that is the target of the function

  • pivot: pivotted form the data (timestamp index, counter columns, count values)

  • freq: temporal frequency of the project

  • donors: list of donors retrieved from the donor map

  • donors_cfg: donor config

  • boolean indicating if the counter is eligible for regression imputation method

mobts.imputation.selector._is_eligible_for_scaled_median(target: str, pivot: DataFrame, freq: str, donors: list[str], donors_cfg: DonorsConfig = DonorsConfig()) bool

Determines if the counter is eligible for scaled median method

  • target: the counter that is the target of the function

  • pivot: pivotted form the data (timestamp index, counter columns, count values)

  • freq: temporal frequency of the project

  • donors: list of donors retrieved from the donor map

  • donors_cfg: donor config

  • boolean indicating if the counter is eligible for scaled median imputation method

mobts.imputation.selector._select_regression_donors(target: str, pivot: DataFrame, freq: str, donors: list, donors_cfg: DonorsConfig = DonorsConfig()) list

Selects donors for regression

  • target: the counter that is the target of the function

  • pivot: pivotted form the data (timestamp index, counter columns, count values)

  • freq: temporal frequency of the project

  • donors: list of donors retrieved from the donor map

  • donors_cfg: donor config

  • list of eligible donors for the regression imputation

mobts.imputation.stl module

STL imputation, prerequisite for donor imputation

This module contains: - setting the ‘period’ argument based on temporal frequency, to be used in STL functions - determining the termporal column based on temporal frequency, on which STL will operate - a linear interpolation function for initiating the STL function - rolling median function to be used for calculating rolling median of STL residuals - function for the application of the initial interpolation for STL - application of the STL function on one counter (method with adjustment for long holes) - application of STL on the entire network

mobts.imputation.stl._get_grouping_column_for_stl(freq: str) str

Determines the temporal column for the STL function to operate on

  • freq: temporal frequency of the project

  • string indicating the temporal column. “weekday” for daily data, “how” (hour of week) for hourly data

mobts.imputation.stl._get_stl_period(freq: str, stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1)) int

Determines the ‘period’ argument for the STL function

  • freq: temporal frequency of the project

  • stl_cfg: config for STL

  • Integar for STL period. 7 for daily data, and 168 for hourly data

mobts.imputation.stl._initial_interpolate_for_stl(df: DataFrame, cols: ColumnsConfig, out_cfg: OutputConfig) DataFrame

Applies the preliminary interpolation necessary for STL

  • series: full dataset

  • cols: columns config

  • out_cfg: config for output columns’ names

  • DataFrame with interpolated time-series

interpolation. This allows us to preserve the trend for the missing periods.

mobts.imputation.stl._interpolate_linear(s: Series) Series

basic interpolation

  • s: time-serie corresponding to one single counter

  • the interpolated time-serie

mobts.imputation.stl._rolling_median_week_window(series: Series, freq: str, stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1)) Series

Calculates a rolling median of time-series

  • series: time series corresponding to one single counter

  • freq: temporal frequency of the project

  • stl_cfg: config for STL

  • time series of rolling medians for the time-serie

mobts.imputation.stl._stl_on_counter_hole_adjusted(g: DataFrame, freq: str, cols: ColumnsConfig, stl_cfg: STLConfig, out_cfg: OutputConfig) DataFrame

Applies STL on one counter

  • g: DataFrame for a single counter

  • freq: temporal frequency of the project

  • cols: columns config

  • stl_cfg: config for STL

  • out_cfg: config for output columns’ names

  • DataFrame with imputed missing values for one counter, using STL

mobts.imputation.stl.impute_stl(df: DataFrame, cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8')) DataFrame

Applies STL on all counters

  • df: full dataset

  • cols: columns config

  • stl_cfg: config for STL

  • out_cfg: config for output columns’ names

  • DataFrame with imputed missing values, using STL

Module contents