ckanapi_harvesters.builder package

Subpackages

Submodules

ckanapi_harvesters.builder.builder_aux module

Auxiliary functions

ckanapi_harvesters.builder.builder_aux.positive_end_index(end_index: int | None, total: int) int

Return stop index for a loop, following pythonic definition for slices (last index treated = end_index-1). If end_index is negative, the index is taken from the end of the slice. end_index = -1 means end just before the last element.

ckanapi_harvesters.builder.builder_ckan module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the ckan connection definition.

class ckanapi_harvesters.builder.builder_ckan.BuilderCkan(url: str = None, apikey_file: str = None, proxy: ProxyConfig = None)

Bases: object

_get_builder_df(base_dir: str) DataFrame

Converts the result of method _to_dict() into a DataFrame

Returns:

_load_from_df(ckan_df: DataFrame, base_dir: str, proxies: dict, error_not_found: bool = True) None

Function to load builder parameters from a DataFrame, usually from an Excel worksheet

Parameters:

ckan_df

Returns:

_to_dict(base_dir: str) dict

Function to export builder parameters to an Excel worksheet, using the same fields as the input format

See:

_load_from_df

See:

to_xls

Returns:

copy() BuilderCkan
from_ckan(ckan: CkanApiManage) None

Initialize fields from a CKAN instance.

init_ckan(base_dir: str, ckan: CkanApiManage = None, default_proxies: dict = None, proxies: str | dict | ProxyConfig = None) CkanApiManage

Initialize a CKAN instance, following the parameters of the Excel workbook. The parameters from Excel have precedence on the values already contained in the CKAN object. However, the Excel workbook might not contain sufficient information.

Parameters:
  • base_dir

  • ckan

  • default_proxies

  • proxies

Returns:

property policy: CkanPackageDataFormatPolicy
property policy_file: str
property proxies: dict
property proxy_string: str
set_policy_file(policy_file: str, *, ckan: CkanApiManage = None, base_dir: str = None, proxies: dict = None, error_not_found: bool = True, load_error: bool = True) None

ckanapi_harvesters.builder.builder_errors module

Data model to represent a CKAN database architecture

exception ckanapi_harvesters.builder.builder_errors.EmptyPackageNameException

Bases: RuntimeError

exception ckanapi_harvesters.builder.builder_errors.GroupByError

Bases: Exception

exception ckanapi_harvesters.builder.builder_errors.MissingDataStoreColumnsSheet(resource_name: str, columns_sheet_name: str)

Bases: Exception

exception ckanapi_harvesters.builder.builder_errors.MissingDataStoreInfoError

Bases: Exception

exception ckanapi_harvesters.builder.builder_errors.RequiredDataFrameFieldsError(missing_fields: Iterable[str])

Bases: Exception

exception ckanapi_harvesters.builder.builder_errors.ResourceFileNotExistMessage(resource_name: str, error_level: ErrorLevel, specific_message: str)

Bases: ContextErrorLevelMessage

exception ckanapi_harvesters.builder.builder_errors.UnsupportedBuilderVersionError(file_version)

Bases: Exception

ckanapi_harvesters.builder.builder_field module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the field definition

class ckanapi_harvesters.builder.builder_field.BuilderField(*, name: str = None, type_override: CkanFieldType = None, description: str = None, label: str = None)

Bases: object

copy(*, dest=None)
static from_df_row(row: Series) BuilderField
update_from_ckan(field_info: CkanField) None
update_missing(other: BuilderField) None

ckanapi_harvesters.builder.builder_package module

Alias to most complete BuilderPackage implementation

ckanapi_harvesters.builder.builder_package_1_basic module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the package definition.

class ckanapi_harvesters.builder.builder_package_1_basic.BuilderPackageBasic(package_name: str = None, *, package_id: str = None, title: str = None, description: str = None, private: bool = None, state: CkanState = None, version: str = None, url: str = None, tags: List[str] = None, organization_name: str = None, license_name: str = None, src=None)

Bases: object

Class to store an image of a CKAN package defined by an Excel worksheet

__NB__: There are several paths to distinguish:

  • the path of the Excel worksheet

  • base_dir: the base directory for relative paths

  • resources_base_dir: the base directory for resources (for upload), which is generally defined relative to base_dir

  • out_dir: the output directory, for download, absolute or relative to the cwd (current working directory)

__NB__: A builder can refer to the following external files:

  • CKAN API key file (.txt)

  • Proxy authentication file (.txt)

  • CKAN CA certificate file (.pem)

  • CA certificate for external connexions (.pem)

  • Data format policy file (.json)

  • External Python module (.py) containing DataFrame modification functions for upload/download of a DataStore

_apply_out_dir_src(base_dir: str, not_exist_error: bool = False)

The default download directory is specified in a field of the Excel workbook. This function resolves the directory name, based on the location of the Excel file or the base_dir, if provided.

Parameters:

base_dir

Returns:

_apply_resources_base_dir_src(base_dir: str)

The resources base directory is specified in a field of the Excel workbook. This function resolves the directory name, based on the location of the Excel file or the base_dir, if provided.

Parameters:

base_dir

Returns:

_get_builder_df(base_dir: str = None, include_id: bool = True) Tuple[DataFrame, DataFrame]

Converts the result of method _to_dict() into a DataFrame

Returns:

_get_datastores_df() Dict[str, DataFrame]

Calls the method _get_fields_df() on all resources which are DataStores and returns a DataFrame per DataStore listing the fields of the DataStore with their metadata

Returns:

_get_datastores_dict() Dict[str, dict]

Calls the method _get_fields_dict() on all resources which are DataStores and returns a DataFrame per DataStore listing the fields of the DataStore with their metadata

Returns:

_get_mono_resource_names()

List resource names of mono-resource builders.

Returns:

_get_mono_resource_used_files(resources_base_dir: str, ckan: CkanApiManage)

List files used by mono-resource builders

Parameters:

resources_base_dir

Returns:

_get_resources_df(include_id: bool = True) DataFrame

Calls the method _to_dict() on all resources and returns the DataFrame listing the resources of the package

Returns:

_load_from_df(info_df: DataFrame, package_df: DataFrame, base_dir: str = None) None

Function to load builder parameters from a DataFrame, usually from an Excel worksheet

Parameters:

package_df

Returns:

_to_dict(base_dir: str = None, include_id: bool = True) Tuple[dict, dict]

Function to export builder parameters to an Excel worksheet, using the same fields as the input format

See:

_load_from_df

See:

to_xls

Returns:

clear_ids()

Clear all known ids from package and resource builders :return:

clear_secrets_and_disconnect() None
copy(dest=None) BuilderPackageBasic
property default_out_dir: str
default_sample_title_suffix: str = ' - Sample'
default_sample_url_suffix: str = '-sample'
default_to_json_reduced_size: bool = False
download_request_full(ckan: CkanApiManage, out_dir: str = None, enforce_none_out_dir: bool = False, resource_name: str = None, full_download: bool = False, threads: int = None, skip_existing: bool = True, progress_callback: Callable = None, force: bool = False, rm_dir: bool = False) None

Downloads the full package resources into out_dir.

Parameters:
  • ckan

  • out_dir – download directory

  • rm_dir – remove directory if exists before downloading

  • skip_existing – skip download of existing resources

  • enforce_none_out_dir – if no out_dir is provided, True: files will not be saved after download, False: default output dir will be used, if defined

  • resource_name

  • full_download – option to fully download the resources. If False, only a partial download is made.

  • threads

  • progress_callback

  • force – option to bypass the enable_download attribute of resources

Returns:

download_resource(ckan: CkanApiManage, resource_name: str, full_download: bool = False, **kwargs) bytes

Proxy for download_sample for a resource

download_resource_df(ckan: CkanApiManage, resource_name: str, search_all: bool = False, **kwargs) DataFrame

Proxy for download_sample_df for a DataStore

download_sample(ckan: CkanApiManage, resource_name: str = None, *, datastores_as_df: bool = True, download_url_resources: bool = False, include_files: bool = True, empty_files: bool = False, search_all: bool = False, **kwargs) Dict[str, bytes | DataFrame]

Download samples for all resources. Resources which are not DataStores are downloaded entirely as bytes.

Parameters:
  • ckan

  • resource_name – option to restrict to a single resource

  • datastores_as_df – Download DataStores as DataFrames (do not convert to bytes)

  • download_url_resources – Option to download resources aiming for an external URL.

  • include_files – Option to include resources which are files as bytes.

  • empty_files – Option to force file contents to an empty file.

  • search_all – Option to search all resources before downloading (only applies to DataStores).

  • kwargs – applies to download_sample_df

Returns:

a dictionary with a sample for each resource

download_sample_df(ckan: CkanApiManage, resource_name: str = None, *, search_all: bool = False, **kwargs) Dict[str, DataFrame]

Download a sample DataFrame for the DataStore type resources.

Parameters:
  • ckan

  • resource_name

Returns:

static from_ckan(ckan: CkanApiMap, package_info: CkanPackageInfo | str, *, base_dir: str = None, error_duplicates: bool = True) BuilderPackageBasic

Function to initialize a BuilderPackageBasic from information requested by the CKAN API

Parameters:
  • ckan

  • package_info – The package to import or the package name

Returns:

static from_dict(d: dict, base_dir: str = None, *, proxies: dict = None) BuilderPackageBasic

Load package definition from a dictionary. In this case, the base directory used to specify the resources locations must be given manually. This is usually the directory of the file where the dictionary comes from.

Parameters:
  • d

  • base_dir

  • proxies

Returns:

static from_excel(path_or_stream, *, proxies: dict = None, engine: str = None, **kwargs) BuilderPackageBasic

Load package definition from an Excel workbook.

Parameters:
  • path_or_stream – path to the Excel workbook

  • engine – Engine used by pandas.read_excel(). Supported engines: xlrd, openpyxl, odf, pyxlsb, calamine.

openpyxl makes part of this package’s optional requirements :return:

static from_json(json_file, *, proxies: dict = None) BuilderPackageBasic
static from_jsons(stream: str, *, source_file: str = None, proxies: dict = None) BuilderPackageBasic
get_all_df(base_dir: str = None, include_id: bool = True) Dict[str, DataFrame]

Returns all the dataframes used to define the object and components

Returns:

get_base_dir(base_dir: str = None) str

Returns the default base_dir if not specified. The base_dir is the location of the Excel workbook. If this was initialized from a dictionary, the current working directory will be used (cwd).

Returns:

get_default_out_dir(out_dir: str, enforce_none: bool = False) str

This returns the default download directory.

Parameters:

out_dir

Returns:

get_license_id(ckan: CkanApiMap) str

Returns the license for the package. The license can be specified by its title or id

Parameters:

ckan

Returns:

get_license_info(ckan: CkanApiMap) CkanLicenseInfo
get_license_name(ckan: CkanApiMap) str
get_or_query_package_id(ckan: CkanApiManage) str
get_or_query_resource_id(ckan: CkanApiManage, resource_name: str, error_not_found: bool = True) str
get_owner_org(ckan: CkanApiMap) str

Returns the owner organization for the package. The owner organization can be specified by its name, title or id

Parameters:

ckan

Returns:

get_package_page_url(ckan: CkanApiManage, *, error_not_found: bool = True, default_url: bool = False) str
get_resources_base_dir(resources_base_dir: str) str

This returns the base directory for the resource files. It is distinct from the base_dir and can be defined relative to the base_dir in the Excel workbook (see comment at the top of the class).

Parameters:

resources_base_dir

Returns:

info_request_full(ckan: CkanApiManage) Tuple[CkanPackageInfo, List[CkanResourceInfo]]
info_request_package(ckan: CkanApiManage) CkanPackageInfo
init_ckan(ckan: CkanApiManage = None, *, base_dir: str = None, set_owner_org: bool = False, default_proxies: dict = None, proxies: str | dict | ProxyConfig = None) CkanApiManage

Initialize the CKAN instance from the parameters defined in the “ckan” tab of the Excel workbook.

Parameters:
  • ckan

  • base_dir

  • default_proxies

  • set_owner_org – Option to set the owner_org of the CKAN instance.

This can be problematic because it requires some requests as the proxies are not set. It can be omitted because it has no influence on the patch_request_package function. :return:

init_resources_options_and_metadata(ckan: CkanApiManage, *, base_dir: str = None) None

Update ckan options in resource_builders Call before any operation on resources

list_resource_ids(ckan: CkanApiManage) List[str]

List resource ids on CKAN server, following the order of the package builder

Parameters:

ckan

Returns:

local_policy_check(policy: CkanPackageDataFormatPolicy = None, *, buffer: Dict[str, List[DataPolicyError]] = None, raise_error: bool = False, verbose: bool = True) bool

Check if the package builder respects a data format policy (only on local definition).

Returns:

map_resources(ckan: CkanApiMap, *, error_not_found: bool = True, cancel_if_exists: bool = True, datastore_info: bool = True) CkanPackageInfo | None

proxy call to ckan.map_resources and returns package information from CKAN

Parameters:
  • ckan

  • error_not_found

  • cancel_if_exists

Returns:

package_delete_resources(ckan: CkanApiManage, *, bypass_admin: bool = False)
property package_name: str
package_resource_reorder(ckan: CkanApiManage) None

Apply the order of the resources defined in the Excel workbook.

Parameters:

ckan

Returns:

patch_request_final(ckan: CkanApiManage)
patch_request_full(ckan: CkanApiManage, *, reupload: bool = False, override_ckan: bool = False, resources_base_dir: str = None, create_default_view: bool = True, clear_all_resources: bool = False, progress_callback: CkanProgressCallbackABC | Callable = None, sample_df_dict: Dict[str, bytes | DataFrame] = None, inhibit_datastore_patch_indexes: bool = False) Tuple[CkanPackageInfo, Dict[str, CkanResourceInfo]]

Perform necessary requests to initiate/reupload the package and resources metadata on the CKAN server. For folder resources, this only uploads the first file of the resource.

Parameters:
  • ckan

  • reupload – Reupload files, even if present on CKAN server. For DataStores, this resets the DataStores to an initial state.

  • override_ckan – Option to ignore metadata from CKAN server. Only metadata from Excel or data sources will be applied.

  • resources_base_dir – Override for resources directory. Location specified in Excel sheet is used by default.

  • progress_callback – Specific progress bar

  • create_default_view – Option to create default view for each resource.

  • clear_all_resources – Option to clear all resources in package before uploading.

  • sample_df_dict – default DataFrames/bytes for each resource

  • inhibit_datastore_patch_indexes – option to ignore primary_key and indexes in case for DataStores if they already exists. In certain cases, running without this option can lead to impossible updates (recomputing indexes on large tables can be costly).

Returns:

patch_request_package(ckan: CkanApiManage) CkanPackageInfo

Function to perform all the necessary requests to initiate/reupload the package on the CKAN server. This function does not upload the package resources.

Note

The organization must be provided, especially if the package is private

Parameters:

ckan

Returns:

remote_policy_check(ckan: CkanApiManage, policy: CkanPackageDataFormatPolicy = None, *, buffer: Dict[str, List[DataPolicyError]] = None, raise_error: bool = False, verbose: bool = None) bool

Check the package defined by this builder against a data format policy, based on the information from the API.

Parameters:
  • ckan

  • policy

  • buffer

  • raise_error

  • verbose

Returns:

property resources_base_dir: str
set_default_out_dir(value: str, base_dir: str = None)
set_resources_base_dir(value: str, base_dir: str = None)
static setup_auto_draft_state(mode_auto: bool = None, *, draft_state_by_default: bool = None) None

By default, packages are created in Draft state. This function disables this feature. Call before instantiating any package builder (BuilderPackage).

Parameters:
  • mode_auto – set to True/False to setup at the same time the package state during upload and the default package state, applied at the end of the upload, if not specified by the user in the Excel workbook.

  • draft_state_by_default – specific setting for the default package state (applied at the end of the upload).

setup_sample_package(ckan: CkanApiManage, package_name: str = None, *, sample_url_suffix: str = None, sample_title_suffix: str = None, sample_df_dict: Dict[str, bytes | DataFrame] = None, return_sample: bool = False, **kwargs) BuilderPackageBasic | Tuple[BuilderPackageBasic, Dict[str, bytes | DataFrame]]

Returns a package builder configured to represent a sample of the current package builder. Limitation: the current package builder must be created from CKAN.

Parameters:
  • ckan

  • package_name – If specified, derives the package metadata from the specified package name. By default, the current package builder will be used.

  • sample_url_suffix – Suffix to add to the package_name (default is “-sample”)

  • sample_title_suffix – Suffix to add to the package title (default is “ - Sample”)

  • sample_df_dict – Option to transmit the data of each resource to the output of function.

  • return_sample – Option to return the data of each resource.

  • kwargs – Optional arguments to pass to the download_sample function.

Returns:

a package builder configured to represent a sample of the current package builder. Optionally, the dictionary of resources to transmit

to_ckan_package_info(*, check_id: bool = True) CkanPackageInfo

Function to insert the information coming from the builder into the CKAN map. Requires the IDs of the package and resources to be known. This enables to use the stored IDs instead of querying the CKAN API for these IDs.

Returns:

to_dict(base_dir: str = None, include_id: bool = True, separate_field_builders: bool = False) dict

Call this function to export the builder parameters to an Excel worksheet

Returns:

to_excel(path_or_buffer, *, engine: str = None, include_id: bool = True, include_help: bool = True, **kwargs) None

Call this function to export the builder parameters to an Excel worksheet

Parameters:
  • path_or_buffer

  • engine

Returns:

to_json(json_file: str, *, include_id: bool = True, reduced_size: bool = None) None
to_jsons(*, base_dir: str = None, include_id: bool = True, reduced_size: bool = None) str
static unlock_external_code_execution(value: bool = True)

This function enables external code execution for the PythonUserCode class. It is necessary to load builders which specify an Auxiliary functions file.

__Warning__: only run code if you trust the source!

Returns:

static unlock_external_url_resource_download(value: bool = True)

This function enables the download of resources external from the CKAN server.

static unlock_no_ca(value: bool = True)

This function enables you to disable the CA verification of the CKAN server.

__Warning__: Only allow in a local environment!

update_ckan_map(ckan: CkanApiMap, *, warn_msg: bool = True) CkanPackageInfo

This function updates the CKAN map from the information contained in this builder. For this to work, the package and resource ids must be known. This is not the case if the package was not initialized. Use if the builder was initialized from ckan or use with precaution.

Warning

This function bypasses the ids which should normally be obtained through the API. Use at your own risk.

Parameters:

ckan

Returns:

update_from_ckan(ckan: CkanApiMap, *, error_not_found: bool = True) None

Update IDs from CKAN mapped objects. Objects must be mapped first.

update_package_name_in_resources()

Update package_name attribute in resource_builders Call before any operation on resources. This function is marked as deprecated. It double-checks the reciprocal link between the package and its resources.

upload_file_checks(resource_name: str | List[str] = None, *, resources_base_dir: str = None, messages: Dict[str, ContextErrorLevelMessage] = None, verbose: bool = True, raise_error: bool = False, ckan: CkanApiManage = None, **kwargs) bool

Method to check the presence of all needed files before uploading or patching resources.

Parameters:
  • resources_base_dir

  • ckan – Optional CkanApi object used to parameterize the requests to test the presence of resources defined by an url.

  • kwargs – keyword arguments to specify connexion parameters for querying the urls.

Returns:

upload_large_datasets(ckan: CkanApiManage, *, resources_base_dir: str = None, threads: int = None, progress_callback: CkanProgressCallbackABC | Callable = None, only_missing: bool = False, from_line_count: bool = False, allow_chunks: bool = True, inhibit_datastore_patch_indexes: bool = False) None

Method to upload large datasets of the package. This method is to be called after patch_request_full, at least once, to initiate resources. The first part of each DataStore is uploaded with the latter call. This method upserts the remaining lines to the DataStore. If a primary key was defined, these lines are upserted. This means the method can be called multiple times, even if the transfer was interrupted. In the contrary case, the lines are inserted. If the resource is not reset with option reupload=True, a second call to upload_large_datasets could lead to duplicate lines.

See:

patch_request_full

Parameters:
  • ckan

  • resources_base_dir

  • threads

  • progress_callback

  • only_missing – upsert only missing rows for DataStores and only missing files for MultiFile

  • from_line_count – count the lines on the CKAN DataStore and ignore the first n lines of your data source

  • allow_chunks – read DataStore files by chunks, when available

  • inhibit_datastore_patch_indexes – option to ignore primary_key and indexes in case for DataStores if they already exists. In certain cases, running without this option can lead to impossible updates (recomputing indexes on large tables can be costly).

Returns:

ckanapi_harvesters.builder.builder_package_1_basic.excel_name_of_builder(resource_builder: BuilderResourceABC) str
ckanapi_harvesters.builder.builder_package_1_basic.excel_name_of_sheet(resource_name: str) str
ckanapi_harvesters.builder.builder_package_1_basic.load_help_page_df(*, engine: str = None) DataFrame

ckanapi_harvesters.builder.builder_package_2_harvesters module

Code to initiate a package builder from a Dataset harvester

class ckanapi_harvesters.builder.builder_package_2_harvesters.BuilderPackageWithHarvesters(package_name: str = None, *, package_id: str = None, title: str = None, description: str = None, private: bool = None, state: CkanState = None, version: str = None, url: str = None, tags: List[str] = None, organization_name: str = None, license_name: str = None, src=None)

Bases: BuilderPackageBasic

copy(dest=None) BuilderPackageWithHarvesters
static init_from_harvester(dataset_harvester: DatasetHarvesterABC) BuilderPackageWithHarvesters

ckanapi_harvesters.builder.builder_package_3_multi_threaded module

Code to upload metadata to the CKAN server, with one thread per resource

class ckanapi_harvesters.builder.builder_package_3_multi_threaded.BuilderPackageMultiThreaded(package_name: str = None, *, package_id: str = None, title: str = None, description: str = None, private: bool = None, state: CkanState = None, version: str = None, url: str = None, tags: List[str] = None, organization_name: str = None, license_name: str = None)

Bases: BuilderPackageWithHarvesters, BuilderMultiABC

copy(dest=None) BuilderPackageWithHarvesters

ckanapi_harvesters.builder.builder_resource module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the basic resources. See builder_datastore for specific functions to initiate datastores.

class ckanapi_harvesters.builder.builder_resource.BuilderFileABC(*, parent: BuilderPackageWithHarvesters, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, file_name: str = None)

Bases: BuilderResourceABC, ABC

Abstract class defining the behavior for a resource represented by a file (not a DataStore)

copy(*, dest=None, parent=None)
download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = True, threads: int = 1, force: bool = False, return_data: bool = False, **kwargs) None

Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.

  • threads

  • force – option to bypass the enable_download attribute of resources

Returns:

download_resource_bytes(ckan: CkanApiManage, full_download: bool = True, search_all: bool = True, **kwargs) bytes | None

Download the resource and return the data as bytes.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are also limited to one request for this function by default.

  • threads

Returns:

patch_request(ckan: CkanApiManage, package_id: str, *, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, payload: bytes | BufferedIOBase = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo

Perform a patch of the resource on the CKAN server. A patch is a full update of the metadata of the resource, and of the DataStore if appropriate. The source file of the resource is also uploaded (or a first file for large DataStores).

Parameters:
  • ckan

  • package_id

  • reupload

  • resources_base_dir

  • payload

Returns:

class ckanapi_harvesters.builder.builder_resource.BuilderFileBinary(*, parent: BuilderPackageWithHarvesters, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, file_name: str = None)

Bases: BuilderFileABC

Concrete implementation for a binary file.

copy(*, dest=None, parent=None)
get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None) str

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

load_sample_data(resources_base_dir: str) bytes

Function returning the data from the indicated resources.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

static resource_mode_str() str
static sample_file_path_is_url() bool
upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage

Test the presence of the files/urls used in the upload/patch requests.

Parameters:

resources_base_dir

Returns:

None if success, error message otherwise

class ckanapi_harvesters.builder.builder_resource.BuilderResourceABC(*, parent: BuilderPackageWithHarvesters, name: str = None, format: str = None, description: str = None, state: CkanState = None, enable_download: bool = True, resource_id: str = None, download_url: str = None, options_string: str = None)

Bases: ABC

_merge_resource_attributes(*, override_ckan: bool = False)

Merge resource attributes into self.resource_attributes in the following priority order: 1. Existing metadata from CKAN server. This can be ignored using the override_ckan argument. 2. Metadata provided by the user in the Excel worksheet 3. Metadata found automatically from the data source (e.g. in file header or database)

_to_ckan_resource_info(package_id: str, check_id: bool = True) CkanResourceInfo

Return resource info object from the information of the Excel workbook. No requests are made but to use this data in the ckan object, the ID and name of the resource are mandatory.

Parameters:
  • package_id

  • check_id

Returns:

_update_metadata(ckan: CkanApiManage, *, base_dir: str = None) None

Function to initialize metadata from the data source. The attribute self.known_resource_info must be queried before this call Examples for a DataStore:

  • List of fields

  • Detect field types from example DataFrame

  • Add descriptions from data source

clear_secrets_and_disconnect() None
abstractmethod copy(*, dest: BuilderResourceABC = None, parent: BuilderPackageWithHarvesters = None)
delete_request(ckan: CkanApiManage, *, error_not_found: bool = False)

Delete the resource from the CKAN server.

Returns:

abstractmethod download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = True, force: bool = False, threads: int = 1, return_data: bool = False) Any

Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.

  • threads

  • force – option to bypass the enable_download attribute of resources

Returns:

abstractmethod download_resource_bytes(ckan: CkanApiManage, full_download: bool = True, **kwargs) bytes

Download the resource and return the data as bytes.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are also limited to one request for this function by default.

  • threads

Returns:

download_sample_df(ckan: CkanApiManage, *, limit: int = 100, search_all: bool = False, download_alter: bool = False, **kwargs) DataFrame | None
get_or_query_package_id(ckan: CkanApiManage) str

Obtain package ID from the package name. This can lead to a request to the API.

get_or_query_resource_id(ckan: CkanApiManage, cancel_if_present: bool = True, error_not_found: bool = True) str

Store/retrieve resource ID in the class attributes.

abstractmethod get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None) str | None

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

init_options_from_ckan(ckan: CkanApiManage, *, base_dir: str = None) None

Function to initialize some parameters from the ckan object

initialize_from_options_string(base_dir: str, *, options_string: str = None, parser: ArgumentParser = None) None
abstractmethod load_sample_data(resources_base_dir: str) bytes | None

Function returning the data from the indicated resources.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

property package_name

Returns the package name of the parent package. You cannot assign the package name through this property. Setting this property only performs a check. This will be removed in future releases. To change package name, change the package_name attribute of the parent_package.

abstractmethod patch_request(ckan: CkanApiManage, package_id: str, *, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo

Function to perform all the necessary requests to initiate/reupload the resource on the CKAN server.

Parameters:
  • resources_base_dir

  • ckan

  • reupload – option to reupload the resource

Returns:

resource_info_request(ckan: CkanApiManage, error_not_found: bool = True) CkanResourceInfo | None
abstractmethod static resource_mode_str() str
abstractmethod static sample_file_path_is_url() bool
abstractmethod upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage

Test the presence of the files/urls used in the upload/patch requests.

Parameters:

resources_base_dir

Returns:

None if success, error message otherwise

upload_request(resources_base_dir: str, ckan: CkanApiManage, package_id: str)
upload_request_final(ckan: CkanApiManage, *, force: bool = False) None
class ckanapi_harvesters.builder.builder_resource.BuilderResourceUnmanaged(*, parent: BuilderPackageWithHarvesters, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None)

Bases: BuilderFileABC

Class to manage a resource metadata without specifying its contents during the upload process.

copy(*, dest=None, parent=None)
get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None) str | None

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

load_sample_data(resources_base_dir: str) bytes | None

Function returning the data from the indicated resources.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

patch_request(ckan: CkanApiManage, package_id: str, *, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, payload: bytes | BufferedIOBase = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo

Perform a patch of the resource on the CKAN server. A patch is a full update of the metadata of the resource, and of the DataStore if appropriate. The source file of the resource is also uploaded (or a first file for large DataStores).

Parameters:
  • ckan

  • package_id

  • reupload

  • resources_base_dir

  • payload

Returns:

static resource_mode_str() str
static sample_file_path_is_url() bool
upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) ContextErrorLevelMessage | None

Test the presence of the files/urls used in the upload/patch requests.

Parameters:

resources_base_dir

Returns:

None if success, error message otherwise

class ckanapi_harvesters.builder.builder_resource.BuilderUrl(*, parent: BuilderPackageWithHarvesters, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, url: str = None)

Bases: BuilderUrlABC

Class for a resource defined by an external URL.

copy(*, dest=None, parent=None)
get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None) str

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

load_sample_data(resources_base_dir: str, *, ckan: CkanApiManage = None, proxies: dict = None, headers: dict = None) bytes

Function returning the data from the indicated resources.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

patch_request(ckan: CkanApiManage, package_id: str, *, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, payload: bytes | BufferedIOBase = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo

Perform a patch of the resource on the CKAN server. A patch is a full update of the metadata of the resource, and of the DataStore if appropriate. The source file of the resource is also uploaded (or a first file for large DataStores).

Parameters:
  • ckan

  • package_id

  • reupload

  • resources_base_dir

  • payload

Returns:

static resource_mode_str() str
static sample_file_path_is_url() bool
class ckanapi_harvesters.builder.builder_resource.BuilderUrlABC(*, parent: BuilderPackageWithHarvesters, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, url: str = None)

Bases: BuilderFileABC, ABC

Abstract behavior for a resource defined by an external URL.

copy(*, dest=None, parent=None)
download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = False, threads: int = 1, force: bool = False, return_data: bool = False, **kwargs) None

Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.

  • threads

  • force – option to bypass the enable_download attribute of resources

Returns:

upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage

Test the presence of the files/urls used in the upload/patch requests.

Parameters:

resources_base_dir

Returns:

None if success, error message otherwise

ckanapi_harvesters.builder.builder_resource_datastore module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to initiate a DataStore.

class ckanapi_harvesters.builder.builder_resource_datastore.BuilderDataStoreABC(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, options_string: str = None, base_dir: str = None)

Bases: BuilderResourceABC, ABC

The base class for DataStore resources. A DataStore resource can be updated with multiple requests and holds metadata for fields.

Parameters:
  • field_builders – Merged metadata for fields (used in requests)

  • field_builders_user – Field metadata specified by user (if exists, metadata from CKAN is prioritary)

  • field_builders_data_source – Field metadata which could be obtained from the builder data source

  • primary_key – primary key to transmit to CKAN (cannot be obtained through API)

  • indexes – indexes to transmit to CKAN (cannot be obtained through API)

  • aliases – Resource id aliases for requests (API cannot delete existing aliases)

  • aux_upload_fun_name – Name of the function used to edit DataFrames before uploading

  • aux_download_fun_name – Name of the function used to edit DataFrames after downloading

  • aux_read_fun_name – Name of the function used to read file contents (defines local_file_format as a UserFileFormat)

  • aux_write_fun_name – Name of the function used to write file contents (defines local_file_format as a UserFileFormat)

  • local_file_format – Class used to read/write files

  • df_mapper – DataFrame mapper function. This object adds certain indexes and applies the upload/download functions. It is responsible for mapping DataStore queries to file outputs.

  • data_cleaner_upload – Data sanitizer used to automate certain tasks and replacing invalid values (default is None)

_check_necessary_fields(current_fields: Set[str] = None, empty_datastore: bool = False, raise_error: bool = True) Set[str]

Auxiliary function to list the fields which are required: - for df_mapper to determine the file names, associated requests, and recognize the last inserted row of a document. - to initialize the DataStore with the columns for the primary key and indexes

The required fields are compared to current_fields, if provided.

_get_fields_update(ckan: CkanApiManage, *, current_df_fields: Set[str] | None, data_cleaner_fields: List[dict] | None, reupload: bool, override_ckan: bool) OrderedDict[str, CkanField]

Merge field builders in the following order of priority: 1. Existing metadata from CKAN (can be ignored with option override_ckan) 2. Metadata specified by the user in the Excel worksheet 3. Metadata found automatically from the data source (e.g. in file header or database) 4. Metadata found automatically by the data cleaner, especially for field typing

_merge_resource_attributes_from_file() None

This function merges metadata which could have been extracted from a file reading function into the attributes from data source. Call after self.local_file_format.read_file()

apply_one_frame_per_primary_key(group_by_argument: str | List[str] = None)

Enables mode –one-frame-per-primary-key and applies option –group-by

In this mode, the upload process expect one DataFrame per primary key combination (except the last field of the primary key, which could be an index in the file). Upload update checks are performed using this assumption (do not read files by chunks). Downloads fill files according to unique combinations of the first columns of the primary key.

copy(*, dest=None, parent=None)
download_resource_bytes(ckan: CkanApiManage, full_download: bool = True, **kwargs) bytes

Download the resource and return the data as bytes.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are also limited to one request for this function by default.

  • threads

Returns:

download_resource_df(ckan: CkanApiManage, search_all: bool = True, download_alter: bool = True, **kwargs) DataFrame | None

Download the resource and return it as a DataFrame. This is the DataFrame equivalent for download_resource_bytes.

Parameters:
  • ckan

  • search_all

  • download_alter

  • kwargs

Returns:

download_sample_df(ckan: CkanApiManage, *, limit: int = 100, search_all: bool = False, download_alter: bool = False, pop_id: bool = True, **kwargs) DataFrame | None

Download the first lines of a DataStore. Extra options apply to datastore_dump API.

get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None) None

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

init_options_from_ckan(ckan: CkanApiManage, *, base_dir: str = None) None

Function to initialize some parameters from the ckan object

initialize_extra_options_string(extra_options_string: str, base_dir: str) None
initialize_from_options_string(base_dir: str, *, options_string: str = None, parser: ArgumentParser = None) None
load_sample_data(resources_base_dir: str) bytes

Function returning the data from the indicated resources.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

abstractmethod load_sample_df(resources_base_dir: str, *, upload_alter: bool = True) ListRecords | DataFrame

Function returning the data from the indicated resources as a pandas DataFrame. This is the DataFrame equivalent for load_sample_data.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

patch_request(ckan: CkanApiManage, package_id: str, *, df_upload: DataFrame = None, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo

Function to perform all the necessary requests to initiate/reupload the resource on the CKAN server.

Parameters:
  • resources_base_dir

  • ckan

  • reupload – option to reupload the resource

Returns:

print_help_cli(display: bool = True) str
static sample_file_path_is_url() bool
setup_default_file_mapper(*, primary_key: List[str] = None, file_query_list: Collection[Tuple[str, dict]] = None) None
upsert_request_df(ckan: CkanApiManage, df_upload: DataFrame, *, total_lines_read: int, file_name: str, method: UpsertChoice = UpsertChoice.Upsert, apply_last_condition: bool = None, always_last_condition: bool = None) Tuple[DataFrame, DataFrame]

Call to ckan datastore_upset. Before sending the DataFrame, a call to df_upload_alter is made. This method is overloaded in BuilderDataStoreMultiABC and BuilderDataStoreFolder

Parameters:
  • ckan

  • df_upload

  • method

Returns:

upsert_request_final(ckan: CkanApiManage, *, force: bool = False) None

Final steps after the last upsert query. These steps are automatically done for a DataStore defined by one file.

Parameters:
  • ckan

  • force – perform request anyways

Returns:

class ckanapi_harvesters.builder.builder_resource_datastore.BuilderResourceIgnored(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, file_url: str = None, options_string: str = None, base_dir: str = None)

Bases: BuilderDataStoreABC

Class to maintain a line in the resource builders list but has no action and can hold field metadata.

copy(*, dest=None, parent=None)
download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = True, force: bool = False, threads: int = 1, return_data: bool = False) Any

Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.

  • threads

  • force – option to bypass the enable_download attribute of resources

Returns:

download_resource_bytes(ckan: CkanApiManage, full_download: bool = True, **kwargs) bytes

Download the resource and return the data as bytes.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are also limited to one request for this function by default.

  • threads

Returns:

get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None) str | None

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

load_sample_data(resources_base_dir: str) bytes | None

Function returning the data from the indicated resources.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

load_sample_df(resources_base_dir: str, *, upload_alter: bool = True) None

Function returning the data from the indicated resources as a pandas DataFrame. This is the DataFrame equivalent for load_sample_data.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

patch_request(ckan: CkanApiManage, package_id: str, *, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, payload: bytes | BufferedIOBase = None, inhibit_datastore_patch_indexes: bool = False) None

Function to perform all the necessary requests to initiate/reupload the resource on the CKAN server.

Parameters:
  • resources_base_dir

  • ckan

  • reupload – option to reupload the resource

Returns:

static resource_mode_str() str
static sample_file_path_is_url() bool
upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) ContextErrorLevelMessage | None

Test the presence of the files/urls used in the upload/patch requests.

Parameters:

resources_base_dir

Returns:

None if success, error message otherwise

ckanapi_harvesters.builder.builder_resource_datastore_file module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to initiate a DataStore.

class ckanapi_harvesters.builder.builder_resource_datastore_file.BuilderDataStoreFile(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, file_name: str = None, options_string: str = None, base_dir: str = None)

Bases: BuilderDataStoreFolder

Implementation supporting the reading of a file by chunks

copy(*, dest=None, parent=None)
download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = True, force: bool = False, threads: int = 1, return_data: bool = False) DataFrame | None

Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.

  • threads

  • force – option to bypass the enable_download attribute of resources

Returns:

download_request_full(ckan: CkanApiManage, out_dir: str, threads: int = 1, external_stop_event=None, start_index: int = 0, end_index: int = None, force: bool = False) None
get_local_file_count() int

Get the number of parts of the upload.

get_local_file_offset(file_chunk: FileChunkDataFrame) int

Get the position of the current data in the overall upload.

get_local_file_size_units()
get_local_file_total_size() int

Get the overall size of the upload, normally in bytes or line count.

get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None, file_index: int = 0) str

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

list_local_files(resources_base_dir: str, ckan: CkanApiManage, cancel_if_present: bool = True) List[str]
static resource_mode_str() str
static sample_file_path_is_url() bool
to_builder_datastore_folder(*, dir_name: str = None, primary_key: List[str] = None, file_query_list: Collection[Tuple[str, dict]] = None) BuilderDataStoreFolder
upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage

Test the presence of the files/urls used in the upload/patch requests.

Parameters:

resources_base_dir

Returns:

None if success, error message otherwise

ckanapi_harvesters.builder.builder_resource_datastore_multi_abc module

Code to initiate a DataStore defined by a large number of files to concatenate into one table

class ckanapi_harvesters.builder.builder_resource_datastore_multi_abc.BuilderDataStoreMultiABC(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, options_string: str = None, base_dir: str = None)

Bases: BuilderDataStoreABC, BuilderMultiABC, ABC

generic class to manage large DataStore, divided into files/parts This abstract class is intended to be overloaded in order to be used to generate data from the workspace, without using CSV files

_update_metadata(ckan: CkanApiManage, *, base_dir: str = None) None

In certain implementations, the resource & field metadata can be derived from the data source. Normally, the metadata is defined by the user in an Excel worksheet. When a description is left empty, the value left on the CKAN server is left unchanged. The objective here is to propose values that override the Excel worksheet when the description is empty on the CKAN side (still leave CKAN values unchanged, if present).

Parameters:
  • ckan – CkanApi instance

  • override_ckan – when True, override the values from the CKAN server, if present

copy(*, dest=None, parent=None)
download_file_query_generator(ckan: CkanApiManage, file_query: dict) Generator[DataFrame, Any, None]

Download the DataFrame with the file_query arguments

download_request_full(ckan: CkanApiManage, out_dir: str, threads: int = 1, external_stop_event=None, start_index: int = 0, end_index: int = None, force: bool = False) None
download_request_generator(ckan: CkanApiManage, out_dir: str) Generator[Tuple[Any, DataFrame], Any, None]

Iterator on file_queries.

download_resource_bytes(ckan: CkanApiManage, full_download: bool = False, **kwargs) bytes

Download the resource and return the data as bytes.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are also limited to one request for this function by default.

  • threads

Returns:

download_resource_df(ckan: CkanApiManage, search_all: bool = False, **kwargs) DataFrame

Download the resource and return it as a DataFrame. This is the DataFrame equivalent for download_resource_bytes.

Parameters:
  • ckan

  • search_all

  • download_alter

  • kwargs

Returns:

get_datastore_len(ckan: CkanApiManage) int
setup_default_file_mapper(*, primary_key: List[str] = None, file_query_list: Collection[Tuple[str, dict]] = None) None

This function enables the user to define the primary key and initializes the default file mapper.

Parameters:

primary_key – manually specify the primary key

Returns:

upload_request_final(ckan: CkanApiManage, *, force: bool = False) None
upload_request_full(ckan: CkanApiManage, resources_base_dir: str, *, method: UpsertChoice = None, threads: int = 1, external_stop_event=None, allow_chunks: bool = True, only_missing: bool = False, from_line_count: bool = False, start_index: int = 0, end_index: int = None, inhibit_datastore_patch_indexes: bool = False, **kwargs) None

Perform all the upload requests.

Parameters:
  • ckan

  • resources_base_dir

  • threads

  • external_stop_event

  • only_missing

  • start_index

  • end_index

Returns:

upsert_request_df_no_return(ckan: CkanApiManage, df_upload: DataFrame, *, total_lines_read: int, file_name: str, method: UpsertChoice = UpsertChoice.Upsert, apply_last_condition: bool = None, always_last_condition: bool = None) None

Calls upsert_request_df but does not return anything

Returns:

upsert_request_final(ckan: CkanApiManage, *, force: bool = False) None

Final steps after the last upsert query. This call is mandatory at the end of all requests if the user called upsert_request_df for a multi-part DataStore manually.

Parameters:
  • ckan

  • force – perform request anyways

Returns:

ckanapi_harvesters.builder.builder_resource_datastore_multi_ckan module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to initiate a DataStore.

class ckanapi_harvesters.builder.builder_resource_datastore_multi_ckan.BuilderDataStoreCkan(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, file_name: str = None, options_string: str = None, base_dir: str = None)

Bases: BuilderDataStoreFolder

Merge of existing CKAN DataStores (on the same server) into a single DataStore

copy(*, dest=None, parent=None)
get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, allow_chunks: bool = True, **kwargs) Generator[FileChunkDataFrame, None, None]

Returns an iterator over the data to upload and a position in the current file.

get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None, file_index: int = 0) str

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

list_local_files(resources_base_dir: str, ckan: CkanApiManage, cancel_if_present: bool = True) List[str]
static resource_mode_str() str
static sample_file_path_is_url() bool
upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage

Test the presence of the files/urls used in the upload/patch requests.

Parameters:

resources_base_dir

Returns:

None if success, error message otherwise

ckanapi_harvesters.builder.builder_resource_datastore_multi_folder module

Code to initiate a DataStore defined by a large number of files to concatenate into one table. This concrete implementation is linked to the file system.

class ckanapi_harvesters.builder.builder_resource_datastore_multi_folder.BuilderDataStoreFolder(*, parent, file_query_list: List[Tuple[str, dict]] = None, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, dir_name: str = None, options_string: str = None, base_dir: str = None)

Bases: BuilderDataStoreMultiABC

copy(*, dest=None, parent=None)
download_file_query(ckan: CkanApiManage, out_dir: str, file_name: str, file_query: dict, *, return_df: bool = False) str | None | Tuple[str | None, DataFrame | None]
download_file_query_item(ckan: CkanApiManage, out_dir: str, file_query_item: Tuple[str, dict]) str

Download the file_query item with the its arguments

download_file_query_list(ckan: CkanApiManage, cancel_if_present: bool = True) List[Tuple[str, dict]]
download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = False, force: bool = False, threads: int = 1, return_data: bool = False) None

Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.

  • threads

  • force – option to bypass the enable_download attribute of resources

Returns:

get_file_query_count() int

Returns the total number of file_queries.

get_file_query_generator() Generator[Tuple[str, dict], Any, None]

Returns an iterator on all the file_queries.

get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, allow_chunks: bool = True, **kwargs) Generator[FileChunkDataFrame, None, None]

Returns an iterator over the data to upload and a position in the current file.

get_local_file_count() int

Get the number of parts of the upload.

get_local_file_offset(file_chunk: FileChunkDataFrame) int

Get the position of the current data in the overall upload.

get_local_file_size_units()
get_local_file_total_size() int

Get the overall size of the upload, normally in bytes or line count.

get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage = None, file_index: int = 0) str | None

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

init_download_file_query_list(ckan: CkanApiManage, out_dir: str, cancel_if_present: bool = True, **kwargs) List[Any]

Determine the list of queries to download to reconstruct the uploaded parts. By default, the unique combinations of the first columns of the primary key are used.

init_local_files_list(resources_base_dir: str, ckan: CkanApiManage, cancel_if_present: bool = True, **kwargs) List[str]

Behavior to list parts of an upload.

list_local_files(resources_base_dir: str, ckan: CkanApiManage, cancel_if_present: bool = True) List[str]
load_sample_df(resources_base_dir: str, *, upload_alter: bool = True, file_index: int = 0, allow_chunks: bool = True, **kwargs) ListRecords | DataFrame

Function returning the data from the indicated resources as a pandas DataFrame. This is the DataFrame equivalent for load_sample_data.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

static resource_mode_str() str
setup_download_file_query_list(file_query_list: List[Tuple[str, dict]]) None
upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage

Test the presence of the files/urls used in the upload/patch requests.

Parameters:

resources_base_dir

Returns:

None if success, error message otherwise

upsert_request_df(ckan: CkanApiManage, df_upload: DataFrame, *, total_lines_read: int, file_name: str, method: UpsertChoice = UpsertChoice.Upsert, apply_last_condition: bool = None, always_last_condition: bool = None) Tuple[DataFrame, DataFrame]

Call to ckan datastore_upsert. Before sending the DataFrame, a call to df_upload_alter is made. This implementation optionally checks for the last line of the DataFrame based on the first columns of the primary key.

Parameters:
  • ckan

  • df_upload

  • method

Returns:

ckanapi_harvesters.builder.builder_resource_datastore_multi_harvester module

Code to initiate a DataStore defined by a large number of files to concatenate into one table. This concrete implementation is linked to the file system.

class ckanapi_harvesters.builder.builder_resource_datastore_multi_harvester.BuilderDataStoreHarvester(*, parent, file_query_list: List[Tuple[str, dict]] = None, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, dir_name: str = None, file_url_attr: str = None, options_string: str = None, base_dir: str = None)

Bases: BuilderDataStoreFolder

clear_secrets_and_disconnect() None
copy(*, dest=None, parent=None)
static from_file_datastore(resource_file: BuilderDataStoreFile, *, dir_name: str = None, primary_key: List[str] = None, file_query_list: Collection[Tuple[str, dict]] = None) BuilderDataStoreHarvester

Do not initialize a BuilderDataStoreHarvester with this method. Rather initialize a new instance of the class.

Raises:

NotImplementedError

get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, **kwargs) Generator[FileChunkDataFrame, None, None]

Returns an iterator over the data to upload and a position in the current file.

get_local_file_count() int

Get the number of parts of the upload.

get_local_file_size_units()
get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None, file_index: int = 0) Any | None

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

property harvester: TableHarvesterABC | None
init_local_files_list(resources_base_dir: str, ckan: CkanApiManage, cancel_if_present: bool = True, **kwargs) List[str]

Behavior to list parts of an upload.

init_options_from_ckan(ckan: CkanApiManage, *, base_dir: str = None) None

Function to initialize some parameters from the ckan object

initialize_extra_options_string(extra_options_string: str, base_dir: str) None
list_local_files(resources_base_dir: str, ckan: CkanApiManage | None, cancel_if_present: bool = True) List[Any]
static resource_mode_str() str
upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage

Test the presence of the files/urls used in the upload/patch requests.

Parameters:

resources_base_dir

Returns:

None if success, error message otherwise

upsert_request_df(ckan: CkanApiManage, df_upload: DataFrame, *, total_lines_read: int, file_name: str, method: UpsertChoice = UpsertChoice.Upsert, apply_last_condition: bool = None, always_last_condition: bool = None) Tuple[DataFrame, DataFrame]

Call to ckan datastore_upsert. Before sending the DataFrame, a call to df_upload_alter is made. This implementation optionally checks for the last line of the DataFrame based on the first columns of the primary key.

Parameters:
  • ckan

  • df_upload

  • method

Returns:

ckanapi_harvesters.builder.builder_resource_datastore_unmanaged module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to initiate a DataStore without uploading any data.

class ckanapi_harvesters.builder.builder_resource_datastore_unmanaged.BuilderDataStoreUnmanaged(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, options_string: str = None, base_dir: str = None)

Bases: BuilderDataStoreFile

Class representing a DataStore (resource metadata and fields metadata) without managing its contents during the upload process.

copy(*, dest=None, parent=None)
get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, **kwargs) Generator[Tuple[ListRecords | DataFrame, int], None, None]

Returns an iterator over the data to upload and a position in the current file.

get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None, file_index: int = 0) None

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

init_local_files_list(resources_base_dir: str, cancel_if_present: bool = True, **kwargs) List[str]

Behavior to list parts of an upload.

load_sample_df(resources_base_dir: str, *, upload_alter: bool = True, file_index: int = 0, allow_chunks: bool = True, **kwargs) DataFrame | None

Function returning the data from the indicated resources as a pandas DataFrame. This is the DataFrame equivalent for load_sample_data.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

patch_request(ckan: CkanApiManage, package_id: str, *, df_upload: DataFrame = None, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo

Specific implementation of patch_request which does not upload any data and only updates the fields currently present in the database

Parameters:
  • resources_base_dir

  • ckan

  • package_id

  • reupload

Returns:

static resource_mode_str() str
upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage

Test the presence of the files/urls used in the upload/patch requests.

Parameters:

resources_base_dir

Returns:

None if success, error message otherwise

ckanapi_harvesters.builder.builder_resource_datastore_url module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to initiate a DataStore without uploading any data.

class ckanapi_harvesters.builder.builder_resource_datastore_url.BuilderDataStoreUrl(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, url: str = None, options_string: str = None, base_dir: str = None)

Bases: BuilderDataStoreFile

Class representing a DataStore (resource metadata and fields metadata) defined by a url.

copy(*, dest=None, parent=None)
get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, **kwargs) Generator[FileChunkDataFrame, None, None]

Returns an iterator over the data to upload and a position in the current file.

get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None, file_index: int = 0) str

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

init_local_files_list(resources_base_dir: str, cancel_if_present: bool = True, **kwargs) List[str]

Behavior to list parts of an upload.

load_sample_data(resources_base_dir: str, *, ckan: CkanApiManage = None, proxies: dict = None, headers: dict = None) bytes

Function returning the data from the indicated resources.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

patch_request(ckan: CkanApiManage, package_id: str, *, df_upload: DataFrame = None, payload: bytes | BufferedIOBase = None, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo

Specific implementation of patch_request which does not upload any data and only updates the fields currently present in the database

Parameters:
  • resources_base_dir

  • ckan

  • package_id

  • reupload

Returns:

static resource_mode_str() str
static sample_file_path_is_url() bool
upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage

Test the presence of the files/urls used in the upload/patch requests.

Parameters:

resources_base_dir

Returns:

None if success, error message otherwise

upload_request_full(ckan: CkanApiManage, resources_base_dir: str, *, threads: int = 1, external_stop_event=None, start_index: int = 0, end_index: int = None, inhibit_datastore_patch_indexes: bool = False, **kwargs) None

Perform all the upload requests.

Parameters:
  • ckan

  • resources_base_dir

  • threads

  • external_stop_event

  • only_missing

  • start_index

  • end_index

Returns:

ckanapi_harvesters.builder.builder_resource_init module

Code to initialize a resource builder from a row

ckanapi_harvesters.builder.builder_resource_init.init_resource_from_ckan(ckan: CkanApiMap, resource_info: CkanResourceInfo, parent) BuilderResourceABC

Function initiating a resource builder based on information provided by the CKAN API.

Returns:

ckanapi_harvesters.builder.builder_resource_init.init_resource_from_df(row: Series, parent, base_dir: str = None) BuilderResourceABC | None

Function mapping keywords to a resource builder type.

Parameters:

row

Returns:

ckanapi_harvesters.builder.builder_resource_multi_abc module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the basic resources. See builder_datastore for specific functions to initiate datastores.

class ckanapi_harvesters.builder.builder_resource_multi_abc.BuilderMultiABC

Bases: ABC

_call_progress_callback(position: int, total: int, *, info: Any = None, context: str = None, file_index: int = 0, file_count: int = None, lines_chunk: int = None, total_lines_read: int = None, canceled_request: bool = False, end_message: bool = False, level: int = 0) None

Progress callback function. Use to implement a progress indication for the user.

Parameters:
  • position – the position within the resource (usually, in bytes or line count)

  • total – the total size of the resource

  • info – an object from which more information can be extracted, typically, the DataFrame itself, with an indication of the data origin.

  • context – the context of the call (ckan instance, upload/download, single/multi-threaded)

  • file_index – the index of the file in the list

  • file_count – the number of files in the list

  • lines_chunk – the number of lines in the chunk currently being processed

  • total_lines_read – the total number of lines read, including the current chunk

  • canceled_request – this callback is also called when a line is ignored

  • end_message – boolean indicating of the work in progress

  • level – the level of the progress callback (1: package/dataset, 2: resource builder, 3: used for multi-file resources)

abstractmethod _unit_download_apply(ckan: CkanApiManage, file_query_item: Any, out_dir: str, index: int, start_index: int, end_index: int, total: int, **kwargs) Any

Unitary function deciding whether to perform download and making the steps for the request.

_unit_upload_apply(*, ckan: CkanApiManage, file_chunk: FileChunkDataFrame, upload_alter: bool = True, overall_chunk_index: int, file_count: int, start_index: int, end_index: int, **kwargs) Any

Unitary function deciding whether to perform upload and making the steps for the upload.

copy(*, dest=None)
abstractmethod download_file_query_item(ckan: CkanApiManage, out_dir: str, file_query_item: Any) Any

Download the file_query item with the its arguments

download_file_query_item_graceful(ckan: CkanApiManage, out_dir: str, file_query_item: Any, index: int, external_stop_event=None, start_index: int = 0, end_index: int = None, **kwargs) None

Implementation of download_file_query_item with checks for a multi-threaded download.

download_request_full(ckan: CkanApiManage, out_dir: str, threads: int = 1, external_stop_event=None, start_index: int = 0, end_index: int = None, force: bool = False, **kwargs) None
download_request_full_multi_threaded(ckan: CkanApiManage, out_dir: str, threads: int = None, external_stop_event=None, start_index: int = 0, end_index: int = -1, **kwargs) None

Multi-threaded implementation of download_request_full using ThreadPoolExecutor.

abstractmethod download_request_generator(ckan: CkanApiManage, out_dir: str) Generator[Any, Any, None]

Generator to apply treatments after each request (single-threaded).

Parameters:
  • ckan

  • out_dir

Returns:

abstractmethod get_file_query_count() int

Returns the total number of file_queries.

abstractmethod get_file_query_generator() Generator[Any, Any, None]

Returns an iterator on all the file_queries.

abstractmethod get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, **kwargs) Generator[FileChunkDataFrame, None, None]

Returns an iterator over the data to upload and a position in the current file.

abstractmethod get_local_file_count() int

Get the number of parts of the upload.

abstractmethod get_local_file_offset(file_chunk: FileChunkDataFrame) int

Get the position of the current data in the overall upload.

abstractmethod get_local_file_size_units() CkanProgressUnits
abstractmethod get_local_file_total_size() int

Get the overall size of the upload, normally in bytes or line count.

abstractmethod init_download_file_query_list(ckan: CkanApiManage, out_dir: str, cancel_if_present: bool = True, **kwargs) List[Any]

Determine the list of queries to download to reconstruct the uploaded parts. By default, the unique combinations of the first columns of the primary key are used.

abstractmethod init_local_files_list(resources_base_dir: str, ckan: CkanApiManage, cancel_if_present: bool = True, **kwargs) List[str]

Behavior to list parts of an upload.

upload_request_final(ckan: CkanApiManage, *, force: bool = False) None
upload_request_full(ckan: CkanApiManage, resources_base_dir: str, *, threads: int = 1, external_stop_event=None, from_line_count: bool = False, allow_chunks: bool = True, start_index: int = 0, end_index: int = None, inhibit_datastore_patch_indexes: bool = False, **kwargs) None

Perform all the upload requests.

Parameters:
  • ckan

  • resources_base_dir

  • threads

  • external_stop_event

  • only_missing

  • start_index

  • end_index

Returns:

upload_request_full_multi_threaded(ckan: CkanApiManage, resources_base_dir: str, threads: int = 1, external_stop_event=None, allow_chunks: bool = True, start_index: int = 0, end_index: int = None, **kwargs)

Multi-threaded implementation of upload_request_full, using ThreadPoolExecutor.

upload_request_graceful(ckan: CkanApiManage, file_chunk: FileChunkDataFrame, *, overall_chunk_index: int, external_stop_event=None, start_index: int = 0, end_index: int = None, **kwargs) None

Calls upload_file with checks specific to multi-threading.

Returns:

ckanapi_harvesters.builder.builder_resource_multi_datastore module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the basic resources. See builder_datastore for specific functions to initiate datastores.

class ckanapi_harvesters.builder.builder_resource_multi_datastore.BuilderMultiDataStore(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None)

Bases: BuilderMultiFile, BuilderDataStoreABC

copy(*, dest=None, parent=None)
download_file_query_item(ckan: CkanApiManage, out_dir: str, file_query_item: str, full_download: bool = True) Tuple[str | None, Response | None]

Download the file_query item with the its arguments

download_file_query_item_df(ckan: CkanApiManage, out_dir: str, file_query_item: str, full_download: bool = True) Tuple[str, DataFrame]
download_request_generator_df(ckan: CkanApiManage, out_dir: str, excluded_resource_names: Set[str] = None) Generator[Tuple[str | None, DataFrame | None], Any, None]
get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, excluded_files: Set[str] = None, allow_chunks: bool = True, **kwargs) Generator[FileChunkDataFrame, None, None]

Returns an iterator over the data to upload and a position in the current file.

load_sample_df(resources_base_dir: str, *, upload_alter: bool = True, file_index: int = 0, allow_chunks: bool = True, **kwargs) ListRecords | DataFrame

Function returning the data from the indicated resources as a pandas DataFrame. This is the DataFrame equivalent for load_sample_data.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

static resource_mode_str() str
upload_file_chunk(ckan: CkanApiManage, package_id: str, file_chunk: FileChunkDataFrame, *, reupload: bool = False, override_ckan: bool = False, cancel_if_present: bool = True, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo

Upload a file, using its name as resource name

ckanapi_harvesters.builder.builder_resource_multi_file module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the basic resources. See builder_datastore for specific functions to initiate datastores.

class ckanapi_harvesters.builder.builder_resource_multi_file.BuilderMultiFile(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, dir_name: str = None)

Bases: BuilderResourceABC, BuilderMultiABC

Class to manage a set of files to upload as separate resources

copy(*, dest=None, parent=None)
download_file_query_item(ckan: CkanApiManage, out_dir: str, file_query_item: str) Tuple[str | None, Response | None]

Download the file_query item with the its arguments

download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = True, threads: int = 1, force: bool = False, excluded_resource_names: Set[str] = None, return_data: bool = False, **kwargs) None

Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.

  • threads

  • force – option to bypass the enable_download attribute of resources

Returns:

download_request_full(ckan: CkanApiManage, out_dir: str, threads: int = 1, external_stop_event=None, start_index: int = 0, end_index: int = None, force: bool = False, excluded_resource_names: Set[str] = None) None
download_request_generator(ckan: CkanApiManage, out_dir: str, excluded_resource_names: Set[str] = None) Generator[Tuple[str | None, Response | None], Any, None]

Generator to apply treatments after each request (single-threaded).

Parameters:
  • ckan

  • out_dir

Returns:

download_resource_bytes(ckan: CkanApiManage, full_download: bool = True, **kwargs) bytes | None

Download the resource and return the data as bytes.

Parameters:
  • ckan

  • out_dir

  • full_download – Some resources like URLs are not downloaded by default. Large datasets are also limited to one request for this function by default.

  • threads

Returns:

get_file_query_count() int

Returns the total number of file_queries.

get_file_query_generator() Generator[str, Any, None]

Returns an iterator on all the file_queries.

get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, excluded_files: Set[str] = None, **kwargs) Generator[FileChunkDataFrame, None, None]

Returns an iterator over the data to upload and a position in the current file.

get_local_file_count() int

Get the number of parts of the upload.

get_local_file_generator(resources_base_dir: str, excluded_files: Set[str] = None, **kwargs) Generator[str, None, None]
get_local_file_offset(file_chunk: FileChunkDataFrame) int

Get the position of the current data in the overall upload.

get_local_file_size_units()
get_local_file_total_size() int

Get the overall size of the upload, normally in bytes or line count.

get_or_query_resource_id(ckan: CkanApiManage, cancel_if_present: bool = True, error_not_found: bool = True) None | str

Store/retrieve resource ID in the class attributes.

get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None, file_index: int = 0) str | None

Function returning the local resource file name for the sample file.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

init_download_file_query_list(ckan: CkanApiManage, out_dir: str = None, cancel_if_present: bool = True, excluded_resource_names: Set[str] = None, **kwargs) List[str]

Determine the list of queries to download to reconstruct the uploaded parts. By default, the unique combinations of the first columns of the primary key are used.

init_local_files_list(resources_base_dir: str, cancel_if_present: bool = True, excluded_files: Set[str] = None, **kwargs) List[str]

Behavior to list parts of an upload.

list_local_files(resources_base_dir: str, cancel_if_present: bool = True, excluded_files: Set[str] = None) List[str] | None

List files corresponding to the multi-file resource configuration and are not used in mono-resources

Parameters:
  • resources_base_dir

  • cancel_if_present

  • excluded_files – files from mono-resources

Returns:

list_remote_resource_ids(ckan: CkanApiManage, *, excluded_resource_names: Set[str] = None, cancel_if_present: bool = True) List[str]
list_remote_resources(ckan: CkanApiManage, *, excluded_resource_names: Set[str] = None, cancel_if_present: bool = True) List[str]

Defines the list of resources to download that correspond to the definition and are not used in mono-resources.

Parameters:
  • ckan

  • excluded_resource_names – resource names of mono-resources

  • cancel_if_present

Returns:

load_sample_data(resources_base_dir: str, file_index: int = 0) bytes | None

Function returning the data from the indicated resources.

Parameters:

resources_base_dir – base directory to find the resources on the local machine

Returns:

patch_request(ckan: CkanApiManage, package_id: str, *, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, payload: bytes | BufferedIOBase = None, inhibit_datastore_patch_indexes: bool = False) None | CkanResourceInfo

Function to perform all the necessary requests to initiate/reupload the resource on the CKAN server.

Parameters:
  • resources_base_dir

  • ckan

  • reupload – option to reupload the resource

Returns:

resource_info_request(ckan: CkanApiManage, error_not_found: bool = True) CkanResourceInfo | None
static resource_mode_str() str
static sample_file_path_is_url() bool
upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, excluded_files: Set[str] = None, **kwargs) None | ContextErrorLevelMessage

Test the presence of the files/urls used in the upload/patch requests.

Parameters:

resources_base_dir

Returns:

None if success, error message otherwise

upload_file_chunk(ckan: CkanApiManage, package_id: str, file_chunk: FileChunkDataFrame, *, reupload: bool = False, override_ckan: bool = False, cancel_if_present: bool = True, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo

Upload a file, using its name as resource name

upload_request_final(ckan: CkanApiManage, *, force: bool = False) None
upload_request_full(ckan: CkanApiManage, resources_base_dir: str, *, threads: int = 1, external_stop_event=None, start_index: int = 0, end_index: int = None, allow_chunks: bool = True, reupload: bool = False, only_missing: bool = False, from_line_count: bool = False, excluded_files: Set[str] = None, inhibit_datastore_patch_indexes: bool = False) None

Perform all the upload requests.

Parameters:
  • ckan

  • resources_base_dir

  • threads

  • external_stop_event

  • only_missing

  • start_index

  • end_index

Returns:

ckanapi_harvesters.builder.mapper_datastore module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to convert formats between database and local files.

class ckanapi_harvesters.builder.mapper_datastore.DataSchemeConversion(*, df_upload_fun: Callable[[ListRecords | DataFrame, Any], ListRecords | DataFrame] = None, df_download_fun: Callable[[ListRecords | DataFrame, Any], ListRecords | DataFrame] = None)

Bases: object

__init__(*, df_upload_fun: Callable[[ListRecords | DataFrame, Any], ListRecords | DataFrame] = None, df_download_fun: Callable[[ListRecords | DataFrame, Any], ListRecords | DataFrame] = None)

Class to convert between local data formats and database formats

Parameters:
  • df_upload_fun

  • df_download_fun

copy()
df_download_alter(df_database: DataFrame | List[dict] | Any, file_query: dict = None, fields: Dict[str, CkanField] = None, mapper_kwargs: dict = None, **kwargs) DataFrame | ListRecords

Apply used-defined df_download_fun if present. df_download_fun should be the reverse function of df_upload_fun

Parameters:

df_database – the downloaded dataframe from the database

Returns:

the dataframe ready to save, converted in the local format

df_upload_alter(df_local: DataFrame | List[dict] | Any, *, total_lines_read: int, fields: Dict[str, CkanField], file_query: str, mapper_kwargs: dict = None, **kwargs) DataFrame | ListRecords

Apply used-defined df_upload_fun if present

Parameters:
  • df_local – the DataFrame to upload

  • total_lines_read – total number of lines read, including the current DataFrame

  • fields – the known fields metadata.

  • file_query – the name of the file the data originates from (or query)

  • mapper_kwargs – extra arguments passed to df_upload_fun

Returns:

the DataFrame ready for upload, converted in the format of the database

get_necessary_fields() Set[str]

ckanapi_harvesters.builder.mapper_datastore_multi module

Code to define the bondage between a file and a database query in the context of a large DataStore defined by the concatenation of multiple files.

class ckanapi_harvesters.builder.mapper_datastore_multi.RequestFileMapperABC(*, df_upload_fun: Callable[[DataFrame], Any] = None, df_download_fun: Callable[[DataFrame], Any] = None)

Bases: RequestMapperABC, ABC

Class to define how to reconstruct a file from the full dataset This abstract class is oriented to treating files in the file system

get_file_name_of_query(file_query: dict) str
class ckanapi_harvesters.builder.mapper_datastore_multi.RequestFileMapperIndexKeys(group_by_keys: List[str], sort_by_keys: List[str] = None, *, df_upload_fun: Callable[[DataFrame], Any] = None, df_download_fun: Callable[[DataFrame], Any] = None)

Bases: RequestFileMapperABC

In this implementation, a file is defined by a combination of file_keys values It is optionally ordered by an index_keys which enables to restart a transfer when interrupted By default, the index_keys is the last field of the primary key and the file_keys are the fields preceding the index_keys in the primary key

df_upload_alter(df_local: DataFrame | List[dict] | Any, *, total_lines_read: int, fields: Dict[str, CkanField], file_query: str, mapper_kwargs: dict = None, **kwargs) DataFrame

Apply used-defined df_upload_fun if present

Parameters:
  • df_local – the DataFrame to upload

  • total_lines_read – total number of lines read, including the current DataFrame

  • fields – the known fields metadata.

  • file_query – the name of the file the data originates from (or query)

  • mapper_kwargs – extra arguments passed to df_upload_fun

Returns:

the DataFrame ready for upload, converted in the format of the database

download_file_query_list(ckan: CkanApiManage, resource_id: str) List[dict]

Function to list the {key: value} combinations present in the CKAN datastore to reconstruct the file database before downloading.

Parameters:
  • ckan

  • resource_id

Returns:

a list of query arguments defining each file

get_file_name_of_query(file_query: dict) str
get_file_query_of_df(df_upload: DataFrame) dict | None

Return the dict of {field: value} combinations representing the arguments of the query to reconstruct a file

Parameters:

df_upload – the DataFrame representing the file

Returns:

get_necessary_fields() Set[str]
last_inserted_index_request(ckan: CkanApiManage, resource_id: str, file_query: dict, df_upload: DataFrame) Tuple[int, bool, int, DataFrame]

Knowing the data which needs to be uploaded, this function compares the last known row(s) to the dataframe and returns the index to restart the upload process.

Parameters:
  • ckan

  • resource_id

  • file_query – a dict of {field: value} combinations representing the arguments of the query to reconstruct a file

  • df_upload – the known data corresponding to the file_query to be sent

Returns:

a tuple (i_restart, upload_needed, row_count, df_last_row): - i_restart: the last known index in the dataframe - upload_needed: a boolean indicating if an update is necessary - row_count: the number of rows corresponding to the file_query - df_last_row: the last found row in the dataframe

last_inserted_row_request(ckan: CkanApiManage, resource_id: str, file_query: dict) DataFrame | None

Request in CKAN the last inserted row(s) corresponding to a given file_query

Parameters:
  • ckan

  • resource_id

  • file_query – a dict of {field: value} combinations representing the arguments of the query to reconstruct a file

Returns:

The last row(s) in the database or None (if no specific method was defined)

last_rows_limit = 1
class ckanapi_harvesters.builder.mapper_datastore_multi.RequestFileMapperLimit(limit: int = None, *, df_upload_fun: Callable[[DataFrame], Any] = None, df_download_fun: Callable[[DataFrame], Any] = None)

Bases: RequestFileMapperABC

In this implementation, a file is defined by a certain amount of rows

default_limit = 10000
download_file_query(ckan: CkanApiManage, resource_id: str, file_query: dict, *, progress_callback: CkanProgressCallbackABC) Generator[DataFrame, Any, None]
download_file_query_list(ckan: CkanApiManage, resource_id: str) List[dict]

Function to list the {key: value} combinations present in the CKAN datastore to reconstruct the file database before downloading.

Parameters:
  • ckan

  • resource_id

Returns:

a list of query arguments defining each file

get_file_name_of_query(file_query: dict) str
limit: int
class ckanapi_harvesters.builder.mapper_datastore_multi.RequestFileMapperUser(file_query_list: Iterable[Tuple[str, dict]], *, df_upload_fun: Callable[[DataFrame], Any] = None, df_download_fun: Callable[[DataFrame], Any] = None)

Bases: RequestFileMapperABC

Use this basic implementation if the file query list is provided by the user or if the builder is only used to upload files.

download_file_query_list(ckan: CkanApiManage, resource_id: str) List[dict]

Function to list the {key: value} combinations present in the CKAN datastore to reconstruct the file database before downloading.

Parameters:
  • ckan

  • resource_id

Returns:

a list of query arguments defining each file

class ckanapi_harvesters.builder.mapper_datastore_multi.RequestMapperABC(*, df_upload_fun: Callable[[DataFrame], Any] = None, df_download_fun: Callable[[DataFrame], Any] = None)

Bases: DataSchemeConversion, ABC

Class to define how to reconstruct a file from the full dataset This class overloads some data scheme conversion class functions This abstract class can be derived to specify custom data treatments

download_file_query(ckan: CkanApiManage, resource_id: str, file_query: dict, *, progress_callback: CkanProgressCallbackABC) Generator[DataFrame, Any, None]
download_file_query_generator(ckan: CkanApiManage, resource_id: str) Generator[dict, Any, None]

Generator for download_file_query_list which can be customized

Parameters:
  • ckan

  • resource_id

Returns:

abstractmethod download_file_query_list(ckan: CkanApiManage, resource_id: str) List[dict]

Function to list the {key: value} combinations present in the CKAN datastore to reconstruct the file database before downloading.

Parameters:
  • ckan

  • resource_id

Returns:

a list of query arguments defining each file

get_file_query_of_df(df_upload: DataFrame) dict | None

Return the dict of {field: value} combinations representing the arguments of the query to reconstruct a file

Parameters:

df_upload – the DataFrame representing the file

Returns:

last_inserted_index_request(ckan: CkanApiManage, resource_id: str, file_query: dict, df_upload: DataFrame) Tuple[int, bool, int, DataFrame | None]

Knowing the data which needs to be uploaded, this function compares the last known row(s) to the dataframe and returns the index to restart the upload process.

Parameters:
  • ckan

  • resource_id

  • file_query – a dict of {field: value} combinations representing the arguments of the query to reconstruct a file

  • df_upload – the known data corresponding to the file_query to be sent

Returns:

a tuple (i_restart, upload_needed, row_count, df_last_row): - i_restart: the last known index in the dataframe - upload_needed: a boolean indicating if an update is necessary - row_count: the number of rows corresponding to the file_query - df_last_row: the last found row in the dataframe

last_inserted_row_request(ckan: CkanApiManage, resource_id: str, file_query: dict) DataFrame | None

Request in CKAN the last inserted row(s) corresponding to a given file_query

Parameters:
  • ckan

  • resource_id

  • file_query – a dict of {field: value} combinations representing the arguments of the query to reconstruct a file

Returns:

The last row(s) in the database or None (if no specific method was defined)

upsert_only_missing_rows: bool
ckanapi_harvesters.builder.mapper_datastore_multi.default_file_mapper_from_primary_key(primary_key: List[str] = None, file_query_list: Iterable[Tuple[str, dict]] = None) RequestFileMapperABC

ckanapi_harvesters.builder.mapper_datastore_prototypes module

Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to convert formats between database and local files.

ckanapi_harvesters.builder.mapper_datastore_prototypes.download_function_example(df_download: DataFrame, *, fields: Dict[str, CkanField] = None, file_query: str = None, **kwargs) DataFrame | List[dict]
ckanapi_harvesters.builder.mapper_datastore_prototypes.replace_empty_str(df_local: DataFrame | List[dict], *, fields: Dict[str, CkanField] = None, file_query: str = None, total_lines_read: int = None, **kwargs) DataFrame | List[dict]
ckanapi_harvesters.builder.mapper_datastore_prototypes.simple_upload_fun(df_local: DataFrame | List[dict], *, fields: Dict[str, CkanField] = None, file_query: str = None, total_lines_read: int = None, **kwargs) DataFrame | List[dict]
ckanapi_harvesters.builder.mapper_datastore_prototypes.upload_function_example(df_local: DataFrame | List[dict], *, fields: Dict[str, CkanField] = None, file_query: str = None, total_lines_read: int = None, **kwargs) DataFrame | List[dict]

ckanapi_harvesters.builder.specific_builder_abc module

Abstract class to implement specific builders from code

class ckanapi_harvesters.builder.specific_builder_abc.SpecificBuilderABC(ckan: CkanApiManage, package_name: str, organization_name: str, *, title: str = None, description: str = None, private: bool = None, state: CkanState = None, version: str = None, url: str = None, tags: List[str] = None, license_name: str = None)

Bases: BuilderPackageWithHarvesters, ABC

Module contents

Section of the package dedicated to the initialization of a CKAN package