mobts.imputation package
Submodules
mobts.imputation.donors module
Module concerned with the donor-based imputations
This module contains: - determining the minimum overlap period for scaled median imputation method based on project’s temporal frequency - building pivot tables for further operations, where timestamp would be index, counters as columns, and counts as values - creating a correlation matrix of counters based on pearson correlation between counts - scaled medians imputation - regression imputation
- mobts.imputation.donors._build_pivots(df: DataFrame, cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1)) DataFrame
builds pivots of data, where timestamp would be index, counters as columns, and counts as values
df: full network DataFrame
cols: columns config
stl_cfg: STL config
pivot_raw: building a pivot based on raw observed counts
pivot_ts: building a pivot based on smoothed out time series of STL’s trend + seasonality
- mobts.imputation.donors._corralation_matrix_donors(pivot_for_corr: DataFrame) DataFrame
builds the correlation matrix of counters based on pearson correlation between counts, counters, and timestamps
pivot_for_corr: the ‘_build_pivots’ function’s output, which is a pivot of counts
the correlation matrix of counters
- mobts.imputation.donors._get_min_overlap_period_sm(freq: str, donors_cfg: DonorsConfig = DonorsConfig()) int
Determines the minimum overlap period necessary for scaled medians imputation
freq: temporal frequency of the project
donors_cfg: donors’ config
integar corresponding to the minimum necessary overlap period
- mobts.imputation.donors.impute_regression(df: DataFrame, pivot: DataFrame, freq: str, donor_map: dict[str, list[str]], counters=None, cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), donors_cfg: DonorsConfig = DonorsConfig(), stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8')) DataFrame
Fills missing values using regression prediction of donors (M8)
df: the complete network dataset
pivot: pivotted dataset of counters
donor_map: dictionary map of donors
freq: temporal frequency of the project
counters: counters to be operated on. if NaN, all counters will be processed
cols: columns config
donors_cfg: donors’ config
out_cfg: output config
stl_cfg: STL config
Imputed DataFrame using regression method (M8)
the ‘counters’ argument is added in order to be utilized through piepline, to skip counters which do not have data holes. this gives us the possibility to only process counters with holes
- mobts.imputation.donors.impute_scaled_median(df: DataFrame, pivot: DataFrame, donor_map: dict[str, list[str]], freq: str, counters=None, cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), donors_cfg: DonorsConfig = DonorsConfig(), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8')) DataFrame
Fills missing values using scaled median of donors (M7)
df: the complete network dataset
pivot: pivotted dataset of counters
donor_map: dictionary map of donors
freq: temporal frequency of the project
counters: counters to be operated on. if NaN, all counters will be processed
cols: columns config
donors_cfg: donors’ config
out_cfg: output config
Imputed DataFrame using scaled medians method (M7)
the ‘counters’ argument is added in order to be utilized through piepline, to skip counters which do not have data holes. this gives us the possibility to only process counters with holes
mobts.imputation.pipeline module
The pipeline for imputation subpackage
This module contains the ‘impute’ class. It’s ‘run’ function includes: - formatting and verifying the temporal elements of the input dataset - identifying counters with holes - applying STL, scaled medians imputation, and donor regression imputation
- class mobts.imputation.pipeline.impute(cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1), donors_cfg: DonorsConfig = DonorsConfig(), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8'), cfg_spr: SparsityConfig = SparsityConfig(drop_sparse_counters=True, sparse_threshold=0.5), suppress_runtime_warnings: bool = True)
Bases:
objectEnd-to-end pipeline of input data to imputed data
df: input dataset
cols: columns config
stl_cfg: STL config
donors_cfg: donors config
out_cfg: output config
suppress_runtime_warnings: boolean for suppressing warnings
- report(print_output: bool = True, save: bool = False, filepath: str = 'preprocess_report.txt') dict
Returns a dictionary containing summary information from the latest run.
print_output : boolean for printing the operation info
save : boolean for saving the info in a text file
filepath : Path of the text file to save, default=”preprocess_report.txt”
Dictionary with summary information from the latest pipeline run.
- run(df: DataFrame, counter_col: str, timestamp_col: str, count_col: str, metadata_cols: list | None = None) DataFrame
mobts.imputation.selector module
Mixed utility module, concerned with selections for the donor-methods
This module contains: - identifying counters with missing counts - determining the minimum mutual period of donors from the config, based on the temporal frequency of the project - determining the minimum prediction period used in regression from the config, based on the temporal frequency of the project - function for determining if the counter is eligible to be filled in using the scaled medians method - function for selecting donor stations for the regression method - function for determining if the counter is eligible to be filled in using the regression method - determining eligible imputation method for each counter
- mobts.imputation.selector._counter_method_choice(target: str, pivot: DataFrame, donor_map: dict[str, list], freq: str, donors_cfg: DonorsConfig = DonorsConfig(), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8')) str
picks the best eligible method for each counter (first M8, then M7, and then STL)
target: the counter that is the target of the function
pivot: pivotted form the data (timestamp index, counter columns, count values)
donor_map: dictionary map of donors
freq: temporal frequency of the project
donors_cfg: donor config
out_cfg: output config
string indicating the best eligible method for the target counter
- mobts.imputation.selector._find_counters_with_holes(df: DataFrame, count_col: str, counter_col: str) list
Finds counters with missing values
df: preprocessed network DataFrame
count_col: count column
counter_col: counter column
list of counters that have missing counts
- mobts.imputation.selector._get_min_mutual_period(freq: str, donors_cfg: DonorsConfig = DonorsConfig()) int
determines the minimun mutual period for donors from config
freq: temporal frequency of the project
donors_cfg: donor config
minimum mutual period
- mobts.imputation.selector._get_min_prediction_period(freq: str, donors_cfg: DonorsConfig = DonorsConfig()) int
determines the minimum prediction for donors from config
freq: temporal frequency of the project
donors_cfg: donor config
minimum prediction period needed for regression
- mobts.imputation.selector._is_eligible_for_regression(target: str, pivot: DataFrame, freq: str, donors: list, donors_cfg: DonorsConfig = DonorsConfig()) bool
Determines if the counter is eligible for regression imputation method
target: the counter that is the target of the function
pivot: pivotted form the data (timestamp index, counter columns, count values)
freq: temporal frequency of the project
donors: list of donors retrieved from the donor map
donors_cfg: donor config
boolean indicating if the counter is eligible for regression imputation method
- mobts.imputation.selector._is_eligible_for_scaled_median(target: str, pivot: DataFrame, freq: str, donors: list[str], donors_cfg: DonorsConfig = DonorsConfig()) bool
Determines if the counter is eligible for scaled median method
target: the counter that is the target of the function
pivot: pivotted form the data (timestamp index, counter columns, count values)
freq: temporal frequency of the project
donors: list of donors retrieved from the donor map
donors_cfg: donor config
boolean indicating if the counter is eligible for scaled median imputation method
- mobts.imputation.selector._select_regression_donors(target: str, pivot: DataFrame, freq: str, donors: list, donors_cfg: DonorsConfig = DonorsConfig()) list
Selects donors for regression
target: the counter that is the target of the function
pivot: pivotted form the data (timestamp index, counter columns, count values)
freq: temporal frequency of the project
donors: list of donors retrieved from the donor map
donors_cfg: donor config
list of eligible donors for the regression imputation
mobts.imputation.stl module
STL imputation, prerequisite for donor imputation
This module contains: - setting the ‘period’ argument based on temporal frequency, to be used in STL functions - determining the termporal column based on temporal frequency, on which STL will operate - a linear interpolation function for initiating the STL function - rolling median function to be used for calculating rolling median of STL residuals - function for the application of the initial interpolation for STL - application of the STL function on one counter (method with adjustment for long holes) - application of STL on the entire network
- mobts.imputation.stl._get_grouping_column_for_stl(freq: str) str
Determines the temporal column for the STL function to operate on
freq: temporal frequency of the project
string indicating the temporal column. “weekday” for daily data, “how” (hour of week) for hourly data
- mobts.imputation.stl._get_stl_period(freq: str, stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1)) int
Determines the ‘period’ argument for the STL function
freq: temporal frequency of the project
stl_cfg: config for STL
Integar for STL period. 7 for daily data, and 168 for hourly data
- mobts.imputation.stl._initial_interpolate_for_stl(df: DataFrame, cols: ColumnsConfig, out_cfg: OutputConfig) DataFrame
Applies the preliminary interpolation necessary for STL
series: full dataset
cols: columns config
out_cfg: config for output columns’ names
DataFrame with interpolated time-series
interpolation. This allows us to preserve the trend for the missing periods.
- mobts.imputation.stl._interpolate_linear(s: Series) Series
basic interpolation
s: time-serie corresponding to one single counter
the interpolated time-serie
- mobts.imputation.stl._rolling_median_week_window(series: Series, freq: str, stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1)) Series
Calculates a rolling median of time-series
series: time series corresponding to one single counter
freq: temporal frequency of the project
stl_cfg: config for STL
time series of rolling medians for the time-serie
- mobts.imputation.stl._stl_on_counter_hole_adjusted(g: DataFrame, freq: str, cols: ColumnsConfig, stl_cfg: STLConfig, out_cfg: OutputConfig) DataFrame
Applies STL on one counter
g: DataFrame for a single counter
freq: temporal frequency of the project
cols: columns config
stl_cfg: config for STL
out_cfg: config for output columns’ names
DataFrame with imputed missing values for one counter, using STL
- mobts.imputation.stl.impute_stl(df: DataFrame, cols: ColumnsConfig = ColumnsConfig(counter='counter', timestamp='timestamp', count='count', weekday='weekday', week_num='week_num', how='how', hour='hour', date='date'), stl_cfg: STLConfig = STLConfig(rolling_median_window=2, rolling_median_min_valid=1), out_cfg: OutputConfig = OutputConfig(col_sm_imputed='count_sm_imputed', col_reg_imputed='count_reg_imputed', col_final='count_imputed', col_method_used='imputation_method', stl_method='STL', sm_method='M7', reg_method='M8')) DataFrame
Applies STL on all counters
df: full dataset
cols: columns config
stl_cfg: config for STL
out_cfg: config for output columns’ names
DataFrame with imputed missing values, using STL