mobts.imputation package

Submodules

mobts.imputation.donors module

Module concerned with the donor-based imputations

This module contains: - determining the minimum overlap period for scaled median imputation method based on project’s temporal frequency - building pivot tables for further operations, where timestamp would be index, counters as columns, and counts as values - creating a correlation matrix of counters based on pearson correlation between counts - scaled medians imputation - regression imputation

mobts.imputation.donors._build_pivots(df: DataFrame, cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1)) → DataFrame

builds pivots of data, where timestamp would be index, counters as columns, and counts as values

df: full network DataFrame
cols: columns config
stl_cfg: STL config

pivot_raw: building a pivot based on raw observed counts
pivot_ts: building a pivot based on smoothed out time series of STL’s trend + seasonality

mobts.imputation.donors._corralation_matrix_donors(pivot_for_corr: DataFrame) → DataFrame

builds the correlation matrix of counters based on pearson correlation between counts, counters, and timestamps

pivot_for_corr: the ‘_build_pivots’ function’s output, which is a pivot of counts

the correlation matrix of counters

mobts.imputation.donors._get_min_overlap_period_sm(freq: str, donors_cfg: DonorsConfig = DonorsConfig()) → int

Determines the minimum overlap period necessary for scaled medians imputation

freq: temporal frequency of the project
donors_cfg: donors’ config

integar corresponding to the minimum necessary overlap period

mobts.imputation.donors.impute_regression(df: DataFrame, pivot: DataFrame, freq: str, donor_map: dict[str, list[str]], counters=None, cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), donors_cfg: DonorsConfig = DonorsConfig(), stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8')) → DataFrame

Fills missing values using regression prediction of donors (M8)

df: the complete network dataset
pivot: pivotted dataset of counters
donor_map: dictionary map of donors
freq: temporal frequency of the project
counters: counters to be operated on. if NaN, all counters will be processed
cols: columns config
donors_cfg: donors’ config
out_cfg: output config
stl_cfg: STL config

Imputed DataFrame using regression method (M8)

the ‘counters’ argument is added in order to be utilized through piepline, to skip counters which do not have data holes. this gives us the possibility to only process counters with holes

mobts.imputation.donors.impute_scaled_median(df: DataFrame, pivot: DataFrame, donor_map: dict[str, list[str]], freq: str, counters=None, cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), donors_cfg: DonorsConfig = DonorsConfig(), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8')) → DataFrame

Fills missing values using scaled median of donors (M7)

df: the complete network dataset
pivot: pivotted dataset of counters
donor_map: dictionary map of donors
freq: temporal frequency of the project
counters: counters to be operated on. if NaN, all counters will be processed
cols: columns config
donors_cfg: donors’ config
out_cfg: output config

Imputed DataFrame using scaled medians method (M7)

the ‘counters’ argument is added in order to be utilized through piepline, to skip counters which do not have data holes. this gives us the possibility to only process counters with holes

mobts.imputation.pipeline module

The pipeline for imputation subpackage

This module contains the ‘impute’ class. It’s ‘run’ function includes: - formatting and verifying the temporal elements of the input dataset - identifying counters with holes - applying STL, scaled medians imputation, and donor regression imputation

class mobts.imputation.pipeline.impute(cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1), donors_cfg: DonorsConfig = DonorsConfig(), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8'), cfg_spr: SparsityConfig = SparsityConfig(drop_sparse_counters=True, sparse_threshold=0.5), suppress_runtime_warnings: bool = True)

Bases: object

End-to-end pipeline of input data to imputed data

df: input dataset
cols: columns config
stl_cfg: STL config
donors_cfg: donors config
out_cfg: output config
suppress_runtime_warnings: boolean for suppressing warnings

report(print_output: bool = True, save: bool = False, filepath: str = 'preprocess_report.txt') → dict

Returns a dictionary containing summary information from the latest run.

print_output : boolean for printing the operation info
save : boolean for saving the info in a text file
filepath : Path of the text file to save, default=”preprocess_report.txt”

Dictionary with summary information from the latest pipeline run.

run(df: DataFrame, counter_col: str, timestamp_col: str, count_col: str, metadata_cols: list | None = None) → DataFrame

mobts.imputation.selector module

Mixed utility module, concerned with selections for the donor-methods

This module contains: - identifying counters with missing counts - determining the minimum mutual period of donors from the config, based on the temporal frequency of the project - determining the minimum prediction period used in regression from the config, based on the temporal frequency of the project - function for determining if the counter is eligible to be filled in using the scaled medians method - function for selecting donor stations for the regression method - function for determining if the counter is eligible to be filled in using the regression method - determining eligible imputation method for each counter

mobts.imputation.selector._counter_method_choice(target: str, pivot: DataFrame, donor_map: dict[str, list], freq: str, donors_cfg: DonorsConfig = DonorsConfig(), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8')) → str

picks the best eligible method for each counter (first M8, then M7, and then STL)

target: the counter that is the target of the function
pivot: pivotted form the data (timestamp index, counter columns, count values)
donor_map: dictionary map of donors
freq: temporal frequency of the project
donors_cfg: donor config
out_cfg: output config

string indicating the best eligible method for the target counter

mobts.imputation.selector._find_counters_with_holes(df: DataFrame, count_col: str, counter_col: str) → list

Finds counters with missing values

df: preprocessed network DataFrame
count_col: count column
counter_col: counter column

list of counters that have missing counts

mobts.imputation.selector._get_min_mutual_period(freq: str, donors_cfg: DonorsConfig = DonorsConfig()) → int

determines the minimun mutual period for donors from config

freq: temporal frequency of the project
donors_cfg: donor config

minimum mutual period

mobts.imputation.selector._get_min_prediction_period(freq: str, donors_cfg: DonorsConfig = DonorsConfig()) → int

determines the minimum prediction for donors from config

freq: temporal frequency of the project
donors_cfg: donor config

minimum prediction period needed for regression

mobts.imputation.selector._is_eligible_for_regression(target: str, pivot: DataFrame, freq: str, donors: list, donors_cfg: DonorsConfig = DonorsConfig()) → bool

Determines if the counter is eligible for regression imputation method

target: the counter that is the target of the function
pivot: pivotted form the data (timestamp index, counter columns, count values)
freq: temporal frequency of the project
donors: list of donors retrieved from the donor map
donors_cfg: donor config

boolean indicating if the counter is eligible for regression imputation method

mobts.imputation.selector._is_eligible_for_scaled_median(target: str, pivot: DataFrame, freq: str, donors: list[str], donors_cfg: DonorsConfig = DonorsConfig()) → bool

Determines if the counter is eligible for scaled median method

target: the counter that is the target of the function
pivot: pivotted form the data (timestamp index, counter columns, count values)
freq: temporal frequency of the project
donors: list of donors retrieved from the donor map
donors_cfg: donor config

boolean indicating if the counter is eligible for scaled median imputation method

mobts.imputation.selector._select_regression_donors(target: str, pivot: DataFrame, freq: str, donors: list, donors_cfg: DonorsConfig = DonorsConfig()) → list

Selects donors for regression

target: the counter that is the target of the function
pivot: pivotted form the data (timestamp index, counter columns, count values)
freq: temporal frequency of the project
donors: list of donors retrieved from the donor map
donors_cfg: donor config

list of eligible donors for the regression imputation

mobts.imputation.stl module

STL imputation, prerequisite for donor imputation

This module contains: - setting the ‘period’ argument based on temporal frequency, to be used in STL functions - determining the termporal column based on temporal frequency, on which STL will operate - a linear interpolation function for initiating the STL function - rolling median function to be used for calculating rolling median of STL residuals - function for the application of the initial interpolation for STL - application of the STL function on one counter (method with adjustment for long holes) - application of STL on the entire network

mobts.imputation.stl._get_grouping_column_for_stl(freq: str) → str

Determines the temporal column for the STL function to operate on

freq: temporal frequency of the project

string indicating the temporal column. “weekday” for daily data, “how” (hour of week) for hourly data

mobts.imputation.stl._get_stl_period(freq: str, stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1)) → int

Determines the ‘period’ argument for the STL function

freq: temporal frequency of the project
stl_cfg: config for STL

Integar for STL period. 7 for daily data, and 168 for hourly data

mobts.imputation.stl._initial_interpolate_for_stl(df: DataFrame, cols: ColumnsConfig, out_cfg: OutputConfig) → DataFrame

Applies the preliminary interpolation necessary for STL

series: full dataset
cols: columns config
out_cfg: config for output columns’ names

DataFrame with interpolated time-series

interpolation. This allows us to preserve the trend for the missing periods.

mobts.imputation.stl._interpolate_linear(s: Series) → Series

basic interpolation

s: time-serie corresponding to one single counter

the interpolated time-serie

mobts.imputation.stl._rolling_median_week_window(series: Series, freq: str, stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1)) → Series

Calculates a rolling median of time-series

series: time series corresponding to one single counter
freq: temporal frequency of the project
stl_cfg: config for STL

time series of rolling medians for the time-serie

mobts.imputation.stl._stl_on_counter_hole_adjusted(g: DataFrame, freq: str, cols: ColumnsConfig, stl_cfg: STLConfig, out_cfg: OutputConfig) → DataFrame

Applies STL on one counter

g: DataFrame for a single counter
freq: temporal frequency of the project
cols: columns config
stl_cfg: config for STL
out_cfg: config for output columns’ names

DataFrame with imputed missing values for one counter, using STL

mobts.imputation.stl.impute_stl(df: DataFrame, cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8')) → DataFrame

Applies STL on all counters

df: full dataset
cols: columns config
stl_cfg: config for STL
out_cfg: config for output columns’ names

DataFrame with imputed missing values, using STL