ckanapi_harvesters.harvesters.data_cleaner package

Submodules

ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_abc module

Functions to clean data before upload.

class ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_abc.CkanDataCleanerABC

Bases: ABC

Data cleaner abstract base class.

A table is defined by a list of fields with a data type. Each row can specify the value of all/some fields. When a value is nested (dictionary or list), the functions iterate over the values of these elements with a recursive implementation. These elements are called sub-values.

_add_field_from_path(path: str, data_type: str, new_field_name: str = None, suggest_index: bool = True, notes: str = None) None

Auxiliary method to define a new column from a nested object.

abstractmethod _clean_final_steps(records: List[dict] | DataFrame, fields: OrderedDict[str, CkanField] | None, known_fields: OrderedDict[str, CkanField] | None) List[dict] | DataFrame

Method called at the end of clean_records

abstractmethod _clean_subvalue(subvalue: Any, field: CkanField, path: str, level: int, *, field_data_type: str) Any

Cleaning of a subvalue. A subvalue is a value within a nested cell.

_detect_non_standard_field(field_name: str, values: Any | Series) CkanField

Auxiliary function of create_new_field to detect field type used if the default criteria did not match any specific case.

_detect_standard_field_bypass(field_name: str, values: Any | Series) CkanField | None

Auxiliary function of create_new_field to detect field type used to bypass the default criteria.

_extra_checks(records: List[dict] | DataFrame, fields: OrderedDict[str, CkanField] | None) None

Method called at the end of _clean_final_steps

_replace_non_standard_subvalue(subvalue: Any, field: CkanField, path: str, level: int, *, field_data_type: str) Any

Auxiliary function of _clean_subvalue to perform type castings/checks used if none of the default criteria were met.

_replace_non_standard_value(value: Any, field: CkanField, *, field_data_type: str) Any

Auxiliary function of clean_value_field to perform type castings/checks used if none of the default criteria were met.

_replace_standard_subvalue_bypass(subvalue: Any, field: CkanField, path: str, level: int, *, field_data_type: str) Tuple[Any, bool]

Auxiliary function of _clean_subvalue to perform type castings/checks used to bypass the default criteria.

_replace_standard_value_bypass(value: Any, field: CkanField, *, field_data_type: str) Tuple[Any, bool]

Auxiliary function of clean_value_field to perform type castings/checks used to bypass the default criteria.

apply_new_fields_request(ckan, resource_id: str)

This method performs the field patch if a new field was detected. Call before upsert.

abstractmethod clean_records(records: List[dict] | DataFrame, known_fields: OrderedDict[str, CkanField] | None, *, inplace: bool = False) List[dict] | DataFrame

Main function to clean a list of records.

Parameters:
  • records

  • known_fields

  • inplace

Returns:

abstractmethod clean_value_field(value: Any, field: CkanField) Any

Cleaning of a value. A value is directly the value of a cell.

clear_all_outputs()

Some values must not be cleared for each DataFrame upload. The cleaner is stateful for certain values cleared only here.

clear_outputs_new_dataframe()
abstractmethod copy(dest=None)
abstractmethod create_new_field(field_name: str, values: Any | Series) CkanField

This method adds a new field definition

abstractmethod detect_field_types_and_subs(records: List[dict] | DataFrame, known_fields: OrderedDict[str, CkanField] = None) OrderedDict[str, str]

This function detects the initial fields and necessary field renamings

abstractmethod static get_class_keyword() str

Returns the name of the class, according to data_cleaner_dict defined in data_cleaner_init.py. This name is used to setup the data cleaner for a resource builder.

merge_field_changes(fields: List[dict] = None) List[dict]

This method merges the fields argument of a datastore_create with the fields detected by the data cleaner. Fields already defined in the fields argument are not overwritten.

class ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_abc.DataCleanerNone

Bases: CkanDataCleanerABC

Implementation which does nothing. Placeholder to explicitly mention a data cleaner must not be used.

clean_records(records: List[dict] | DataFrame, known_fields: OrderedDict[str, CkanField] | None, *, inplace: bool = False) List[dict] | DataFrame

Main function to clean a list of records.

Parameters:
  • records

  • known_fields

  • inplace

Returns:

clean_value_field(value: Any, field: CkanField) Any

Cleaning of a value. A value is directly the value of a cell.

copy(dest=None) DataCleanerNone
create_new_field(field_name: str, values: Any | Series) CkanField

This method adds a new field definition

detect_field_types_and_subs(records: List[dict] | DataFrame, known_fields: OrderedDict[str, CkanField] = None) OrderedDict[str, str]

This function detects the initial fields and necessary field renamings

static get_class_keyword() str

Returns the name of the class, according to data_cleaner_dict defined in data_cleaner_init.py. This name is used to setup the data cleaner for a resource builder.

ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_errors module

Error codes for data cleaner

exception ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_errors.CleanError

Bases: Exception

exception ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_errors.CleanerRequirementError(requirement: str, data_type: str)

Bases: RequirementError

exception ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_errors.FormatError(data: str, data_type: str)

Bases: Exception

exception ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_errors.UnexpectedGeometryError(found_type: str, expected_type: str)

Bases: Exception

ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_init module

File format keyword selection

ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_init.init_data_cleaner(data_cleaner_string: str | None) CkanDataCleanerABC | None

ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload module

Alias

ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_1_basic module

Functions to clean data before upload.

class ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_1_basic.CkanDataCleanerUploadBasic

Bases: CkanDataCleanerABC

Data cleaner for basic data types

clean_records(records: List[dict] | DataFrame, known_fields: OrderedDict[str, CkanField] | OrderedDict[str, dict] | List[dict | CkanField] | None, *, inplace: bool = False) List[dict] | DataFrame

Main function to clean a list of records.

Parameters:
  • records

  • known_fields

  • inplace

Returns:

clean_value_field(value: Any, field: CkanField) Any

Cleaning of a value. A value is directly the value of a cell.

copy(dest=None) CkanDataCleanerUploadBasic
create_new_field(field_name: str, values: Any | Series) CkanField

This method adds a new field definition

detect_field_types_and_subs(records: List[dict] | DataFrame, known_fields: OrderedDict[str, CkanField] = None) OrderedDict[str, CkanField]

This function detects the initial fields and necessary field renamings

static get_class_keyword() str

Returns the name of the class, according to data_cleaner_dict defined in data_cleaner_init.py. This name is used to setup the data cleaner for a resource builder.

class ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_1_basic.CkanDataCleanerUploadDigitalColumns

Bases: CkanDataCleanerUploadBasic

static get_class_keyword() str

Returns the name of the class, according to data_cleaner_dict defined in data_cleaner_init.py. This name is used to setup the data cleaner for a resource builder.

ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_1_basic._pd_series_type_instance_detect(values: Series, test_type: Type)

This function checks that the test_type matches all rows which are not NaN/None/NA in a pandas Series.

ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_1_basic.default_cleaner() CkanDataCleanerABC

ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_2_geom module

Adding support for geometries

class ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_2_geom.CkanDataCleanerUploadGeom

Bases: CkanDataCleanerUploadBasic

static get_class_keyword() str

Returns the name of the class, according to data_cleaner_dict defined in data_cleaner_init.py. This name is used to setup the data cleaner for a resource builder.

ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_2_geom.has_invalid_coordinates(shape: None) Tuple[bool, bool]
ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_2_geom.shapely_geometry_from_value(value: Any) None

ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_3_assist module

Functions to clean data before upload.

class ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_3_assist.CkanDataCleanerUploadAssist

Bases: CkanDataCleanerUploadGeom

Implementation which raises an exception if a data change is recommended by the data cleaner and assists in field typing.

clean_value_field(value: Any, field: CkanField) Any

Cleaning of a value. A value is directly the value of a cell.

static get_class_keyword() str

Returns the name of the class, according to data_cleaner_dict defined in data_cleaner_init.py. This name is used to setup the data cleaner for a resource builder.

ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_3_check module

Functions to clean data before upload.

class ckanapi_harvesters.harvesters.data_cleaner.data_cleaner_upload_3_check.CkanDataCleanerUploadCheckOnly

Bases: CkanDataCleanerUploadGeom

Implementation which raises an exception if a data change is recommended by the data cleaner.

clean_value_field(value: Any, field: CkanField) Any

Cleaning of a value. A value is directly the value of a cell.

static get_class_keyword() str

Returns the name of the class, according to data_cleaner_dict defined in data_cleaner_init.py. This name is used to setup the data cleaner for a resource builder.

Module contents

Section of the package dedicated to the conversion of records to a CKAN-compatible format. This is linked to the data harvesters.