ckanapi_harvesters.builder package
Subpackages
- ckanapi_harvesters.builder.example package
- Submodules
- ckanapi_harvesters.builder.example.builder_example module
- ckanapi_harvesters.builder.example.builder_example_aux_fun module
- ckanapi_harvesters.builder.example.builder_example_download module
- ckanapi_harvesters.builder.example.builder_example_generate_data module
- ckanapi_harvesters.builder.example.builder_example_patch_upload module
- ckanapi_harvesters.builder.example.builder_example_policy module
- ckanapi_harvesters.builder.example.builder_example_sample_dataset module
- ckanapi_harvesters.builder.example.builder_example_test_sql module
- ckanapi_harvesters.builder.example.builder_example_tests module
- ckanapi_harvesters.builder.example.builder_example_tests_dev module
- ckanapi_harvesters.builder.example.builder_example_tests_offline module
- Module contents
- ckanapi_harvesters.builder.specific package
Submodules
ckanapi_harvesters.builder.builder_aux module
Auxiliary functions
- ckanapi_harvesters.builder.builder_aux.positive_end_index(end_index: int | None, total: int) int
Return stop index for a loop, following pythonic definition for slices (last index treated = end_index-1). If end_index is negative, the index is taken from the end of the slice. end_index = -1 means end just before the last element.
ckanapi_harvesters.builder.builder_ckan module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the ckan connection definition.
- class ckanapi_harvesters.builder.builder_ckan.BuilderCkan(url: str = None, apikey_file: str = None, proxy: ProxyConfig = None)
Bases:
object- _get_builder_df(base_dir: str) DataFrame
Converts the result of method _to_dict() into a DataFrame
- Returns:
- _load_from_df(ckan_df: DataFrame, base_dir: str, proxies: dict, error_not_found: bool = True) None
Function to load builder parameters from a DataFrame, usually from an Excel worksheet
- Parameters:
ckan_df
- Returns:
- _to_dict(base_dir: str) dict
Function to export builder parameters to an Excel worksheet, using the same fields as the input format
- See:
_load_from_df
- See:
to_xls
- Returns:
- copy() BuilderCkan
- from_ckan(ckan: CkanApiManage) None
Initialize fields from a CKAN instance.
- init_ckan(base_dir: str, ckan: CkanApiManage = None, default_proxies: dict = None, proxies: str | dict | ProxyConfig = None) CkanApiManage
Initialize a CKAN instance, following the parameters of the Excel workbook. The parameters from Excel have precedence on the values already contained in the CKAN object. However, the Excel workbook might not contain sufficient information.
- Parameters:
base_dir
ckan
default_proxies
proxies
- Returns:
- property policy: CkanPackageDataFormatPolicy
- property policy_file: str
- property proxies: dict
- property proxy_string: str
- set_policy_file(policy_file: str, *, ckan: CkanApiManage = None, base_dir: str = None, proxies: dict = None, error_not_found: bool = True, load_error: bool = True) None
ckanapi_harvesters.builder.builder_errors module
Data model to represent a CKAN database architecture
- exception ckanapi_harvesters.builder.builder_errors.EmptyPackageNameException
Bases:
RuntimeError
- exception ckanapi_harvesters.builder.builder_errors.GroupByError
Bases:
Exception
- exception ckanapi_harvesters.builder.builder_errors.MissingDataStoreColumnsSheet(resource_name: str, columns_sheet_name: str)
Bases:
Exception
- exception ckanapi_harvesters.builder.builder_errors.MissingDataStoreInfoError
Bases:
Exception
- exception ckanapi_harvesters.builder.builder_errors.RequiredDataFrameFieldsError(missing_fields: Iterable[str])
Bases:
Exception
- exception ckanapi_harvesters.builder.builder_errors.ResourceFileNotExistMessage(resource_name: str, error_level: ErrorLevel, specific_message: str)
Bases:
ContextErrorLevelMessage
- exception ckanapi_harvesters.builder.builder_errors.UnsupportedBuilderVersionError(file_version)
Bases:
Exception
ckanapi_harvesters.builder.builder_field module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the field definition
- class ckanapi_harvesters.builder.builder_field.BuilderField(*, name: str = None, type_override: CkanFieldType = None, description: str = None, label: str = None)
Bases:
object- copy(*, dest=None)
- static from_df_row(row: Series) BuilderField
- update_missing(other: BuilderField) None
ckanapi_harvesters.builder.builder_package module
Alias to most complete BuilderPackage implementation
ckanapi_harvesters.builder.builder_package_1_basic module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the package definition.
- class ckanapi_harvesters.builder.builder_package_1_basic.BuilderPackageBasic(package_name: str = None, *, package_id: str = None, title: str = None, description: str = None, private: bool = None, state: CkanState = None, version: str = None, url: str = None, tags: List[str] = None, organization_name: str = None, license_name: str = None, src=None)
Bases:
objectClass to store an image of a CKAN package defined by an Excel worksheet
__NB__: There are several paths to distinguish:
the path of the Excel worksheet
base_dir: the base directory for relative paths
resources_base_dir: the base directory for resources (for upload), which is generally defined relative to base_dir
out_dir: the output directory, for download, absolute or relative to the cwd (current working directory)
__NB__: A builder can refer to the following external files:
CKAN API key file (.txt)
Proxy authentication file (.txt)
CKAN CA certificate file (.pem)
CA certificate for external connexions (.pem)
Data format policy file (.json)
External Python module (.py) containing DataFrame modification functions for upload/download of a DataStore
- _apply_out_dir_src(base_dir: str, not_exist_error: bool = False)
The default download directory is specified in a field of the Excel workbook. This function resolves the directory name, based on the location of the Excel file or the base_dir, if provided.
- Parameters:
base_dir
- Returns:
- _apply_resources_base_dir_src(base_dir: str)
The resources base directory is specified in a field of the Excel workbook. This function resolves the directory name, based on the location of the Excel file or the base_dir, if provided.
- Parameters:
base_dir
- Returns:
- _get_builder_df(base_dir: str = None, include_id: bool = True) Tuple[DataFrame, DataFrame]
Converts the result of method _to_dict() into a DataFrame
- Returns:
- _get_datastores_df() Dict[str, DataFrame]
Calls the method _get_fields_df() on all resources which are DataStores and returns a DataFrame per DataStore listing the fields of the DataStore with their metadata
- Returns:
- _get_datastores_dict() Dict[str, dict]
Calls the method _get_fields_dict() on all resources which are DataStores and returns a DataFrame per DataStore listing the fields of the DataStore with their metadata
- Returns:
- _get_mono_resource_names()
List resource names of mono-resource builders.
- Returns:
- _get_mono_resource_used_files(resources_base_dir: str, ckan: CkanApiManage)
List files used by mono-resource builders
- Parameters:
resources_base_dir
- Returns:
- _get_resources_df(include_id: bool = True) DataFrame
Calls the method _to_dict() on all resources and returns the DataFrame listing the resources of the package
- Returns:
- _load_from_df(info_df: DataFrame, package_df: DataFrame, base_dir: str = None) None
Function to load builder parameters from a DataFrame, usually from an Excel worksheet
- Parameters:
package_df
- Returns:
- _to_dict(base_dir: str = None, include_id: bool = True) Tuple[dict, dict]
Function to export builder parameters to an Excel worksheet, using the same fields as the input format
- See:
_load_from_df
- See:
to_xls
- Returns:
- clear_ids()
Clear all known ids from package and resource builders :return:
- clear_secrets_and_disconnect() None
- copy(dest=None) BuilderPackageBasic
- property default_out_dir: str
- default_sample_title_suffix: str = ' - Sample'
- default_sample_url_suffix: str = '-sample'
- download_request_full(ckan: CkanApiManage, out_dir: str = None, enforce_none_out_dir: bool = False, resource_name: str = None, full_download: bool = False, threads: int = None, skip_existing: bool = True, progress_callback: Callable = None, force: bool = False, rm_dir: bool = False) None
Downloads the full package resources into out_dir.
- Parameters:
ckan
out_dir – download directory
rm_dir – remove directory if exists before downloading
skip_existing – skip download of existing resources
enforce_none_out_dir – if no out_dir is provided, True: files will not be saved after download, False: default output dir will be used, if defined
resource_name
full_download – option to fully download the resources. If False, only a partial download is made.
threads
progress_callback
force – option to bypass the enable_download attribute of resources
- Returns:
- download_resource(ckan: CkanApiManage, resource_name: str, full_download: bool = False, **kwargs) bytes
Proxy for download_sample for a resource
- download_resource_df(ckan: CkanApiManage, resource_name: str, search_all: bool = False, **kwargs) DataFrame
Proxy for download_sample_df for a DataStore
- download_sample(ckan: CkanApiManage, resource_name: str = None, *, datastores_as_df: bool = True, download_url_resources: bool = False, include_files: bool = True, empty_files: bool = False, search_all: bool = False, **kwargs) Dict[str, bytes | DataFrame]
Download samples for all resources. Resources which are not DataStores are downloaded entirely as bytes.
- Parameters:
ckan
resource_name – option to restrict to a single resource
datastores_as_df – Download DataStores as DataFrames (do not convert to bytes)
download_url_resources – Option to download resources aiming for an external URL.
include_files – Option to include resources which are files as bytes.
empty_files – Option to force file contents to an empty file.
search_all – Option to search all resources before downloading (only applies to DataStores).
kwargs – applies to download_sample_df
- Returns:
a dictionary with a sample for each resource
- download_sample_df(ckan: CkanApiManage, resource_name: str = None, *, search_all: bool = False, **kwargs) Dict[str, DataFrame]
Download a sample DataFrame for the DataStore type resources.
- Parameters:
ckan
resource_name
- Returns:
- static from_ckan(ckan: CkanApiMap, package_info: CkanPackageInfo | str, *, base_dir: str = None, error_duplicates: bool = True) BuilderPackageBasic
Function to initialize a BuilderPackageBasic from information requested by the CKAN API
- Parameters:
ckan
package_info – The package to import or the package name
- Returns:
- static from_dict(d: dict, base_dir: str = None, *, proxies: dict = None) BuilderPackageBasic
Load package definition from a dictionary. In this case, the base directory used to specify the resources locations must be given manually. This is usually the directory of the file where the dictionary comes from.
- Parameters:
d
base_dir
proxies
- Returns:
- static from_excel(path_or_stream, *, proxies: dict = None, engine: str = None, **kwargs) BuilderPackageBasic
Load package definition from an Excel workbook.
- Parameters:
path_or_stream – path to the Excel workbook
engine – Engine used by pandas.read_excel(). Supported engines: xlrd, openpyxl, odf, pyxlsb, calamine.
openpyxl makes part of this package’s optional requirements :return:
- static from_json(json_file, *, proxies: dict = None) BuilderPackageBasic
- static from_jsons(stream: str, *, source_file: str = None, proxies: dict = None) BuilderPackageBasic
- get_all_df(base_dir: str = None, include_id: bool = True) Dict[str, DataFrame]
Returns all the dataframes used to define the object and components
- Returns:
- get_base_dir(base_dir: str = None) str
Returns the default base_dir if not specified. The base_dir is the location of the Excel workbook. If this was initialized from a dictionary, the current working directory will be used (cwd).
- Returns:
- get_default_out_dir(out_dir: str, enforce_none: bool = False) str
This returns the default download directory.
- Parameters:
out_dir
- Returns:
- get_license_id(ckan: CkanApiMap) str
Returns the license for the package. The license can be specified by its title or id
- Parameters:
ckan
- Returns:
- get_license_info(ckan: CkanApiMap) CkanLicenseInfo
- get_license_name(ckan: CkanApiMap) str
- get_or_query_package_id(ckan: CkanApiManage) str
- get_or_query_resource_id(ckan: CkanApiManage, resource_name: str, error_not_found: bool = True) str
- get_owner_org(ckan: CkanApiMap) str
Returns the owner organization for the package. The owner organization can be specified by its name, title or id
- Parameters:
ckan
- Returns:
- get_package_page_url(ckan: CkanApiManage, *, error_not_found: bool = True, default_url: bool = False) str
- get_resources_base_dir(resources_base_dir: str) str
This returns the base directory for the resource files. It is distinct from the base_dir and can be defined relative to the base_dir in the Excel workbook (see comment at the top of the class).
- Parameters:
resources_base_dir
- Returns:
- info_request_full(ckan: CkanApiManage) Tuple[CkanPackageInfo, List[CkanResourceInfo]]
- info_request_package(ckan: CkanApiManage) CkanPackageInfo
- init_ckan(ckan: CkanApiManage = None, *, base_dir: str = None, set_owner_org: bool = False, default_proxies: dict = None, proxies: str | dict | ProxyConfig = None) CkanApiManage
Initialize the CKAN instance from the parameters defined in the “ckan” tab of the Excel workbook.
- Parameters:
ckan
base_dir
default_proxies
set_owner_org – Option to set the owner_org of the CKAN instance.
This can be problematic because it requires some requests as the proxies are not set. It can be omitted because it has no influence on the patch_request_package function. :return:
- init_resources_options_and_metadata(ckan: CkanApiManage, *, base_dir: str = None) None
Update ckan options in resource_builders Call before any operation on resources
- list_resource_ids(ckan: CkanApiManage) List[str]
List resource ids on CKAN server, following the order of the package builder
- Parameters:
ckan
- Returns:
- local_policy_check(policy: CkanPackageDataFormatPolicy = None, *, buffer: Dict[str, List[DataPolicyError]] = None, raise_error: bool = False, verbose: bool = True) bool
Check if the package builder respects a data format policy (only on local definition).
- Returns:
- map_resources(ckan: CkanApiMap, *, error_not_found: bool = True, cancel_if_exists: bool = True, datastore_info: bool = True) CkanPackageInfo | None
proxy call to ckan.map_resources and returns package information from CKAN
- Parameters:
ckan
error_not_found
cancel_if_exists
- Returns:
- package_delete_resources(ckan: CkanApiManage, *, bypass_admin: bool = False)
- property package_name: str
- package_resource_reorder(ckan: CkanApiManage) None
Apply the order of the resources defined in the Excel workbook.
- Parameters:
ckan
- Returns:
- patch_request_final(ckan: CkanApiManage)
- patch_request_full(ckan: CkanApiManage, *, reupload: bool = False, override_ckan: bool = False, resources_base_dir: str = None, create_default_view: bool = True, clear_all_resources: bool = False, progress_callback: CkanProgressCallbackABC | Callable = None, sample_df_dict: Dict[str, bytes | DataFrame] = None, inhibit_datastore_patch_indexes: bool = False) Tuple[CkanPackageInfo, Dict[str, CkanResourceInfo]]
Perform necessary requests to initiate/reupload the package and resources metadata on the CKAN server. For folder resources, this only uploads the first file of the resource.
- Parameters:
ckan
reupload – Reupload files, even if present on CKAN server. For DataStores, this resets the DataStores to an initial state.
override_ckan – Option to ignore metadata from CKAN server. Only metadata from Excel or data sources will be applied.
resources_base_dir – Override for resources directory. Location specified in Excel sheet is used by default.
progress_callback – Specific progress bar
create_default_view – Option to create default view for each resource.
clear_all_resources – Option to clear all resources in package before uploading.
sample_df_dict – default DataFrames/bytes for each resource
inhibit_datastore_patch_indexes – option to ignore primary_key and indexes in case for DataStores if they already exists. In certain cases, running without this option can lead to impossible updates (recomputing indexes on large tables can be costly).
- Returns:
- patch_request_package(ckan: CkanApiManage) CkanPackageInfo
Function to perform all the necessary requests to initiate/reupload the package on the CKAN server. This function does not upload the package resources.
Note
The organization must be provided, especially if the package is private
- Parameters:
ckan
- Returns:
- remote_policy_check(ckan: CkanApiManage, policy: CkanPackageDataFormatPolicy = None, *, buffer: Dict[str, List[DataPolicyError]] = None, raise_error: bool = False, verbose: bool = None) bool
Check the package defined by this builder against a data format policy, based on the information from the API.
- Parameters:
ckan
policy
buffer
raise_error
verbose
- Returns:
- property resources_base_dir: str
- set_default_out_dir(value: str, base_dir: str = None)
- set_resources_base_dir(value: str, base_dir: str = None)
- static setup_auto_draft_state(mode_auto: bool = None, *, draft_state_by_default: bool = None) None
By default, packages are created in Draft state. This function disables this feature. Call before instantiating any package builder (BuilderPackage).
- Parameters:
mode_auto – set to True/False to setup at the same time the package state during upload and the default package state, applied at the end of the upload, if not specified by the user in the Excel workbook.
draft_state_by_default – specific setting for the default package state (applied at the end of the upload).
- setup_sample_package(ckan: CkanApiManage, package_name: str = None, *, sample_url_suffix: str = None, sample_title_suffix: str = None, sample_df_dict: Dict[str, bytes | DataFrame] = None, return_sample: bool = False, **kwargs) BuilderPackageBasic | Tuple[BuilderPackageBasic, Dict[str, bytes | DataFrame]]
Returns a package builder configured to represent a sample of the current package builder. Limitation: the current package builder must be created from CKAN.
- Parameters:
ckan
package_name – If specified, derives the package metadata from the specified package name. By default, the current package builder will be used.
sample_url_suffix – Suffix to add to the package_name (default is “-sample”)
sample_title_suffix – Suffix to add to the package title (default is “ - Sample”)
sample_df_dict – Option to transmit the data of each resource to the output of function.
return_sample – Option to return the data of each resource.
kwargs – Optional arguments to pass to the download_sample function.
- Returns:
a package builder configured to represent a sample of the current package builder. Optionally, the dictionary of resources to transmit
- to_ckan_package_info(*, check_id: bool = True) CkanPackageInfo
Function to insert the information coming from the builder into the CKAN map. Requires the IDs of the package and resources to be known. This enables to use the stored IDs instead of querying the CKAN API for these IDs.
- Returns:
- to_dict(base_dir: str = None, include_id: bool = True, separate_field_builders: bool = False) dict
Call this function to export the builder parameters to an Excel worksheet
- Returns:
- to_excel(path_or_buffer, *, engine: str = None, include_id: bool = True, include_help: bool = True, **kwargs) None
Call this function to export the builder parameters to an Excel worksheet
- Parameters:
path_or_buffer
engine
- Returns:
- static unlock_external_code_execution(value: bool = True)
This function enables external code execution for the PythonUserCode class. It is necessary to load builders which specify an Auxiliary functions file.
__Warning__: only run code if you trust the source!
- Returns:
- static unlock_external_url_resource_download(value: bool = True)
This function enables the download of resources external from the CKAN server.
- static unlock_no_ca(value: bool = True)
This function enables you to disable the CA verification of the CKAN server.
__Warning__: Only allow in a local environment!
- update_ckan_map(ckan: CkanApiMap, *, warn_msg: bool = True) CkanPackageInfo
This function updates the CKAN map from the information contained in this builder. For this to work, the package and resource ids must be known. This is not the case if the package was not initialized. Use if the builder was initialized from ckan or use with precaution.
Warning
This function bypasses the ids which should normally be obtained through the API. Use at your own risk.
- Parameters:
ckan
- Returns:
- update_from_ckan(ckan: CkanApiMap, *, error_not_found: bool = True) None
Update IDs from CKAN mapped objects. Objects must be mapped first.
- update_package_name_in_resources()
Update package_name attribute in resource_builders Call before any operation on resources. This function is marked as deprecated. It double-checks the reciprocal link between the package and its resources.
- upload_file_checks(resource_name: str | List[str] = None, *, resources_base_dir: str = None, messages: Dict[str, ContextErrorLevelMessage] = None, verbose: bool = True, raise_error: bool = False, ckan: CkanApiManage = None, **kwargs) bool
Method to check the presence of all needed files before uploading or patching resources.
- Parameters:
resources_base_dir
ckan – Optional CkanApi object used to parameterize the requests to test the presence of resources defined by an url.
kwargs – keyword arguments to specify connexion parameters for querying the urls.
- Returns:
- upload_large_datasets(ckan: CkanApiManage, *, resources_base_dir: str = None, threads: int = None, progress_callback: CkanProgressCallbackABC | Callable = None, only_missing: bool = False, from_line_count: bool = False, allow_chunks: bool = True, inhibit_datastore_patch_indexes: bool = False) None
Method to upload large datasets of the package. This method is to be called after patch_request_full, at least once, to initiate resources. The first part of each DataStore is uploaded with the latter call. This method upserts the remaining lines to the DataStore. If a primary key was defined, these lines are upserted. This means the method can be called multiple times, even if the transfer was interrupted. In the contrary case, the lines are inserted. If the resource is not reset with option reupload=True, a second call to upload_large_datasets could lead to duplicate lines.
- See:
patch_request_full
- Parameters:
ckan
resources_base_dir
threads
progress_callback
only_missing – upsert only missing rows for DataStores and only missing files for MultiFile
from_line_count – count the lines on the CKAN DataStore and ignore the first n lines of your data source
allow_chunks – read DataStore files by chunks, when available
inhibit_datastore_patch_indexes – option to ignore primary_key and indexes in case for DataStores if they already exists. In certain cases, running without this option can lead to impossible updates (recomputing indexes on large tables can be costly).
- Returns:
- ckanapi_harvesters.builder.builder_package_1_basic.excel_name_of_builder(resource_builder: BuilderResourceABC) str
- ckanapi_harvesters.builder.builder_package_1_basic.excel_name_of_sheet(resource_name: str) str
- ckanapi_harvesters.builder.builder_package_1_basic.load_help_page_df(*, engine: str = None) DataFrame
ckanapi_harvesters.builder.builder_package_2_harvesters module
Code to initiate a package builder from a Dataset harvester
- class ckanapi_harvesters.builder.builder_package_2_harvesters.BuilderPackageWithHarvesters(package_name: str = None, *, package_id: str = None, title: str = None, description: str = None, private: bool = None, state: CkanState = None, version: str = None, url: str = None, tags: List[str] = None, organization_name: str = None, license_name: str = None, src=None)
Bases:
BuilderPackageBasic- copy(dest=None) BuilderPackageWithHarvesters
- static init_from_harvester(dataset_harvester: DatasetHarvesterABC) BuilderPackageWithHarvesters
ckanapi_harvesters.builder.builder_package_3_multi_threaded module
Code to upload metadata to the CKAN server, with one thread per resource
- class ckanapi_harvesters.builder.builder_package_3_multi_threaded.BuilderPackageMultiThreaded(package_name: str = None, *, package_id: str = None, title: str = None, description: str = None, private: bool = None, state: CkanState = None, version: str = None, url: str = None, tags: List[str] = None, organization_name: str = None, license_name: str = None)
Bases:
BuilderPackageWithHarvesters,BuilderMultiABC- copy(dest=None) BuilderPackageWithHarvesters
ckanapi_harvesters.builder.builder_resource module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the basic resources. See builder_datastore for specific functions to initiate datastores.
- class ckanapi_harvesters.builder.builder_resource.BuilderFileABC(*, parent: BuilderPackageWithHarvesters, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, file_name: str = None)
Bases:
BuilderResourceABC,ABCAbstract class defining the behavior for a resource represented by a file (not a DataStore)
- copy(*, dest=None, parent=None)
- download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = True, threads: int = 1, force: bool = False, return_data: bool = False, **kwargs) None
Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.
threads
force – option to bypass the enable_download attribute of resources
- Returns:
- download_resource_bytes(ckan: CkanApiManage, full_download: bool = True, search_all: bool = True, **kwargs) bytes | None
Download the resource and return the data as bytes.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are also limited to one request for this function by default.
threads
- Returns:
- patch_request(ckan: CkanApiManage, package_id: str, *, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, payload: bytes | BufferedIOBase = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo
Perform a patch of the resource on the CKAN server. A patch is a full update of the metadata of the resource, and of the DataStore if appropriate. The source file of the resource is also uploaded (or a first file for large DataStores).
- Parameters:
ckan
package_id
reupload
resources_base_dir
payload
- Returns:
- class ckanapi_harvesters.builder.builder_resource.BuilderFileBinary(*, parent: BuilderPackageWithHarvesters, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, file_name: str = None)
Bases:
BuilderFileABCConcrete implementation for a binary file.
- copy(*, dest=None, parent=None)
- get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None) str
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- load_sample_data(resources_base_dir: str) bytes
Function returning the data from the indicated resources.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- static resource_mode_str() str
- upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage
Test the presence of the files/urls used in the upload/patch requests.
- Parameters:
resources_base_dir
- Returns:
None if success, error message otherwise
- class ckanapi_harvesters.builder.builder_resource.BuilderResourceABC(*, parent: BuilderPackageWithHarvesters, name: str = None, format: str = None, description: str = None, state: CkanState = None, enable_download: bool = True, resource_id: str = None, download_url: str = None, options_string: str = None)
Bases:
ABC- _merge_resource_attributes(*, override_ckan: bool = False)
Merge resource attributes into self.resource_attributes in the following priority order: 1. Existing metadata from CKAN server. This can be ignored using the override_ckan argument. 2. Metadata provided by the user in the Excel worksheet 3. Metadata found automatically from the data source (e.g. in file header or database)
- _to_ckan_resource_info(package_id: str, check_id: bool = True) CkanResourceInfo
Return resource info object from the information of the Excel workbook. No requests are made but to use this data in the ckan object, the ID and name of the resource are mandatory.
- Parameters:
package_id
check_id
- Returns:
- _update_metadata(ckan: CkanApiManage, *, base_dir: str = None) None
Function to initialize metadata from the data source. The attribute self.known_resource_info must be queried before this call Examples for a DataStore:
List of fields
Detect field types from example DataFrame
Add descriptions from data source
- clear_secrets_and_disconnect() None
- abstractmethod copy(*, dest: BuilderResourceABC = None, parent: BuilderPackageWithHarvesters = None)
- delete_request(ckan: CkanApiManage, *, error_not_found: bool = False)
Delete the resource from the CKAN server.
- Returns:
- abstractmethod download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = True, force: bool = False, threads: int = 1, return_data: bool = False) Any
Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.
threads
force – option to bypass the enable_download attribute of resources
- Returns:
- abstractmethod download_resource_bytes(ckan: CkanApiManage, full_download: bool = True, **kwargs) bytes
Download the resource and return the data as bytes.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are also limited to one request for this function by default.
threads
- Returns:
- download_sample_df(ckan: CkanApiManage, *, limit: int = 100, search_all: bool = False, download_alter: bool = False, **kwargs) DataFrame | None
- get_or_query_package_id(ckan: CkanApiManage) str
Obtain package ID from the package name. This can lead to a request to the API.
- get_or_query_resource_id(ckan: CkanApiManage, cancel_if_present: bool = True, error_not_found: bool = True) str
Store/retrieve resource ID in the class attributes.
- abstractmethod get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None) str | None
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- init_options_from_ckan(ckan: CkanApiManage, *, base_dir: str = None) None
Function to initialize some parameters from the ckan object
- initialize_from_options_string(base_dir: str, *, options_string: str = None, parser: ArgumentParser = None) None
- abstractmethod load_sample_data(resources_base_dir: str) bytes | None
Function returning the data from the indicated resources.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- property package_name
Returns the package name of the parent package. You cannot assign the package name through this property. Setting this property only performs a check. This will be removed in future releases. To change package name, change the package_name attribute of the parent_package.
- abstractmethod patch_request(ckan: CkanApiManage, package_id: str, *, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo
Function to perform all the necessary requests to initiate/reupload the resource on the CKAN server.
- Parameters:
resources_base_dir
ckan
reupload – option to reupload the resource
- Returns:
- resource_info_request(ckan: CkanApiManage, error_not_found: bool = True) CkanResourceInfo | None
- abstractmethod static resource_mode_str() str
- abstractmethod upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage
Test the presence of the files/urls used in the upload/patch requests.
- Parameters:
resources_base_dir
- Returns:
None if success, error message otherwise
- upload_request(resources_base_dir: str, ckan: CkanApiManage, package_id: str)
- upload_request_final(ckan: CkanApiManage, *, force: bool = False) None
- class ckanapi_harvesters.builder.builder_resource.BuilderResourceUnmanaged(*, parent: BuilderPackageWithHarvesters, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None)
Bases:
BuilderFileABCClass to manage a resource metadata without specifying its contents during the upload process.
- copy(*, dest=None, parent=None)
- get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None) str | None
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- load_sample_data(resources_base_dir: str) bytes | None
Function returning the data from the indicated resources.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- patch_request(ckan: CkanApiManage, package_id: str, *, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, payload: bytes | BufferedIOBase = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo
Perform a patch of the resource on the CKAN server. A patch is a full update of the metadata of the resource, and of the DataStore if appropriate. The source file of the resource is also uploaded (or a first file for large DataStores).
- Parameters:
ckan
package_id
reupload
resources_base_dir
payload
- Returns:
- static resource_mode_str() str
- upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) ContextErrorLevelMessage | None
Test the presence of the files/urls used in the upload/patch requests.
- Parameters:
resources_base_dir
- Returns:
None if success, error message otherwise
- class ckanapi_harvesters.builder.builder_resource.BuilderUrl(*, parent: BuilderPackageWithHarvesters, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, url: str = None)
Bases:
BuilderUrlABCClass for a resource defined by an external URL.
- copy(*, dest=None, parent=None)
- get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None) str
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- load_sample_data(resources_base_dir: str, *, ckan: CkanApiManage = None, proxies: dict = None, headers: dict = None) bytes
Function returning the data from the indicated resources.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- patch_request(ckan: CkanApiManage, package_id: str, *, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, payload: bytes | BufferedIOBase = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo
Perform a patch of the resource on the CKAN server. A patch is a full update of the metadata of the resource, and of the DataStore if appropriate. The source file of the resource is also uploaded (or a first file for large DataStores).
- Parameters:
ckan
package_id
reupload
resources_base_dir
payload
- Returns:
- static resource_mode_str() str
- class ckanapi_harvesters.builder.builder_resource.BuilderUrlABC(*, parent: BuilderPackageWithHarvesters, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, url: str = None)
Bases:
BuilderFileABC,ABCAbstract behavior for a resource defined by an external URL.
- copy(*, dest=None, parent=None)
- download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = False, threads: int = 1, force: bool = False, return_data: bool = False, **kwargs) None
Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.
threads
force – option to bypass the enable_download attribute of resources
- Returns:
- upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage
Test the presence of the files/urls used in the upload/patch requests.
- Parameters:
resources_base_dir
- Returns:
None if success, error message otherwise
ckanapi_harvesters.builder.builder_resource_datastore module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to initiate a DataStore.
- class ckanapi_harvesters.builder.builder_resource_datastore.BuilderDataStoreABC(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, options_string: str = None, base_dir: str = None)
Bases:
BuilderResourceABC,ABCThe base class for DataStore resources. A DataStore resource can be updated with multiple requests and holds metadata for fields.
- Parameters:
field_builders – Merged metadata for fields (used in requests)
field_builders_user – Field metadata specified by user (if exists, metadata from CKAN is prioritary)
field_builders_data_source – Field metadata which could be obtained from the builder data source
primary_key – primary key to transmit to CKAN (cannot be obtained through API)
indexes – indexes to transmit to CKAN (cannot be obtained through API)
aliases – Resource id aliases for requests (API cannot delete existing aliases)
aux_upload_fun_name – Name of the function used to edit DataFrames before uploading
aux_download_fun_name – Name of the function used to edit DataFrames after downloading
aux_read_fun_name – Name of the function used to read file contents (defines local_file_format as a UserFileFormat)
aux_write_fun_name – Name of the function used to write file contents (defines local_file_format as a UserFileFormat)
local_file_format – Class used to read/write files
df_mapper – DataFrame mapper function. This object adds certain indexes and applies the upload/download functions. It is responsible for mapping DataStore queries to file outputs.
data_cleaner_upload – Data sanitizer used to automate certain tasks and replacing invalid values (default is None)
- _check_necessary_fields(current_fields: Set[str] = None, empty_datastore: bool = False, raise_error: bool = True) Set[str]
Auxiliary function to list the fields which are required: - for df_mapper to determine the file names, associated requests, and recognize the last inserted row of a document. - to initialize the DataStore with the columns for the primary key and indexes
The required fields are compared to current_fields, if provided.
- _get_fields_update(ckan: CkanApiManage, *, current_df_fields: Set[str] | None, data_cleaner_fields: List[dict] | None, reupload: bool, override_ckan: bool) OrderedDict[str, CkanField]
Merge field builders in the following order of priority: 1. Existing metadata from CKAN (can be ignored with option override_ckan) 2. Metadata specified by the user in the Excel worksheet 3. Metadata found automatically from the data source (e.g. in file header or database) 4. Metadata found automatically by the data cleaner, especially for field typing
- _merge_resource_attributes_from_file() None
This function merges metadata which could have been extracted from a file reading function into the attributes from data source. Call after self.local_file_format.read_file()
- apply_one_frame_per_primary_key(group_by_argument: str | List[str] = None)
Enables mode –one-frame-per-primary-key and applies option –group-by
In this mode, the upload process expect one DataFrame per primary key combination (except the last field of the primary key, which could be an index in the file). Upload update checks are performed using this assumption (do not read files by chunks). Downloads fill files according to unique combinations of the first columns of the primary key.
- copy(*, dest=None, parent=None)
- download_resource_bytes(ckan: CkanApiManage, full_download: bool = True, **kwargs) bytes
Download the resource and return the data as bytes.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are also limited to one request for this function by default.
threads
- Returns:
- download_resource_df(ckan: CkanApiManage, search_all: bool = True, download_alter: bool = True, **kwargs) DataFrame | None
Download the resource and return it as a DataFrame. This is the DataFrame equivalent for download_resource_bytes.
- Parameters:
ckan
search_all
download_alter
kwargs
- Returns:
- download_sample_df(ckan: CkanApiManage, *, limit: int = 100, search_all: bool = False, download_alter: bool = False, pop_id: bool = True, **kwargs) DataFrame | None
Download the first lines of a DataStore. Extra options apply to datastore_dump API.
- get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None) None
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- init_options_from_ckan(ckan: CkanApiManage, *, base_dir: str = None) None
Function to initialize some parameters from the ckan object
- initialize_extra_options_string(extra_options_string: str, base_dir: str) None
- initialize_from_options_string(base_dir: str, *, options_string: str = None, parser: ArgumentParser = None) None
- load_sample_data(resources_base_dir: str) bytes
Function returning the data from the indicated resources.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- abstractmethod load_sample_df(resources_base_dir: str, *, upload_alter: bool = True) ListRecords | DataFrame
Function returning the data from the indicated resources as a pandas DataFrame. This is the DataFrame equivalent for load_sample_data.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- patch_request(ckan: CkanApiManage, package_id: str, *, df_upload: DataFrame = None, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo
Function to perform all the necessary requests to initiate/reupload the resource on the CKAN server.
- Parameters:
resources_base_dir
ckan
reupload – option to reupload the resource
- Returns:
- setup_default_file_mapper(*, primary_key: List[str] = None, file_query_list: Collection[Tuple[str, dict]] = None) None
- upsert_request_df(ckan: CkanApiManage, df_upload: DataFrame, *, total_lines_read: int, file_name: str, method: UpsertChoice = UpsertChoice.Upsert, apply_last_condition: bool = None, always_last_condition: bool = None) Tuple[DataFrame, DataFrame]
Call to ckan datastore_upset. Before sending the DataFrame, a call to df_upload_alter is made. This method is overloaded in BuilderDataStoreMultiABC and BuilderDataStoreFolder
- Parameters:
ckan
df_upload
method
- Returns:
- upsert_request_final(ckan: CkanApiManage, *, force: bool = False) None
Final steps after the last upsert query. These steps are automatically done for a DataStore defined by one file.
- Parameters:
ckan
force – perform request anyways
- Returns:
- class ckanapi_harvesters.builder.builder_resource_datastore.BuilderResourceIgnored(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, file_url: str = None, options_string: str = None, base_dir: str = None)
Bases:
BuilderDataStoreABCClass to maintain a line in the resource builders list but has no action and can hold field metadata.
- copy(*, dest=None, parent=None)
- download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = True, force: bool = False, threads: int = 1, return_data: bool = False) Any
Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.
threads
force – option to bypass the enable_download attribute of resources
- Returns:
- download_resource_bytes(ckan: CkanApiManage, full_download: bool = True, **kwargs) bytes
Download the resource and return the data as bytes.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are also limited to one request for this function by default.
threads
- Returns:
- get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None) str | None
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- load_sample_data(resources_base_dir: str) bytes | None
Function returning the data from the indicated resources.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- load_sample_df(resources_base_dir: str, *, upload_alter: bool = True) None
Function returning the data from the indicated resources as a pandas DataFrame. This is the DataFrame equivalent for load_sample_data.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- patch_request(ckan: CkanApiManage, package_id: str, *, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, payload: bytes | BufferedIOBase = None, inhibit_datastore_patch_indexes: bool = False) None
Function to perform all the necessary requests to initiate/reupload the resource on the CKAN server.
- Parameters:
resources_base_dir
ckan
reupload – option to reupload the resource
- Returns:
- static resource_mode_str() str
- upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) ContextErrorLevelMessage | None
Test the presence of the files/urls used in the upload/patch requests.
- Parameters:
resources_base_dir
- Returns:
None if success, error message otherwise
ckanapi_harvesters.builder.builder_resource_datastore_file module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to initiate a DataStore.
- class ckanapi_harvesters.builder.builder_resource_datastore_file.BuilderDataStoreFile(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, file_name: str = None, options_string: str = None, base_dir: str = None)
Bases:
BuilderDataStoreFolderImplementation supporting the reading of a file by chunks
- copy(*, dest=None, parent=None)
- download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = True, force: bool = False, threads: int = 1, return_data: bool = False) DataFrame | None
Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.
threads
force – option to bypass the enable_download attribute of resources
- Returns:
- download_request_full(ckan: CkanApiManage, out_dir: str, threads: int = 1, external_stop_event=None, start_index: int = 0, end_index: int = None, force: bool = False) None
- get_local_file_offset(file_chunk: FileChunkDataFrame) int
Get the position of the current data in the overall upload.
- get_local_file_size_units()
- get_local_file_total_size() int
Get the overall size of the upload, normally in bytes or line count.
- get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None, file_index: int = 0) str
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- list_local_files(resources_base_dir: str, ckan: CkanApiManage, cancel_if_present: bool = True) List[str]
- static resource_mode_str() str
- to_builder_datastore_folder(*, dir_name: str = None, primary_key: List[str] = None, file_query_list: Collection[Tuple[str, dict]] = None) BuilderDataStoreFolder
- upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage
Test the presence of the files/urls used in the upload/patch requests.
- Parameters:
resources_base_dir
- Returns:
None if success, error message otherwise
ckanapi_harvesters.builder.builder_resource_datastore_multi_abc module
Code to initiate a DataStore defined by a large number of files to concatenate into one table
- class ckanapi_harvesters.builder.builder_resource_datastore_multi_abc.BuilderDataStoreMultiABC(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, options_string: str = None, base_dir: str = None)
Bases:
BuilderDataStoreABC,BuilderMultiABC,ABCgeneric class to manage large DataStore, divided into files/parts This abstract class is intended to be overloaded in order to be used to generate data from the workspace, without using CSV files
- _update_metadata(ckan: CkanApiManage, *, base_dir: str = None) None
In certain implementations, the resource & field metadata can be derived from the data source. Normally, the metadata is defined by the user in an Excel worksheet. When a description is left empty, the value left on the CKAN server is left unchanged. The objective here is to propose values that override the Excel worksheet when the description is empty on the CKAN side (still leave CKAN values unchanged, if present).
- Parameters:
ckan – CkanApi instance
override_ckan – when True, override the values from the CKAN server, if present
- copy(*, dest=None, parent=None)
- download_file_query_generator(ckan: CkanApiManage, file_query: dict) Generator[DataFrame, Any, None]
Download the DataFrame with the file_query arguments
- download_request_full(ckan: CkanApiManage, out_dir: str, threads: int = 1, external_stop_event=None, start_index: int = 0, end_index: int = None, force: bool = False) None
- download_request_generator(ckan: CkanApiManage, out_dir: str) Generator[Tuple[Any, DataFrame], Any, None]
Iterator on file_queries.
- download_resource_bytes(ckan: CkanApiManage, full_download: bool = False, **kwargs) bytes
Download the resource and return the data as bytes.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are also limited to one request for this function by default.
threads
- Returns:
- download_resource_df(ckan: CkanApiManage, search_all: bool = False, **kwargs) DataFrame
Download the resource and return it as a DataFrame. This is the DataFrame equivalent for download_resource_bytes.
- Parameters:
ckan
search_all
download_alter
kwargs
- Returns:
- get_datastore_len(ckan: CkanApiManage) int
- setup_default_file_mapper(*, primary_key: List[str] = None, file_query_list: Collection[Tuple[str, dict]] = None) None
This function enables the user to define the primary key and initializes the default file mapper.
- Parameters:
primary_key – manually specify the primary key
- Returns:
- upload_request_final(ckan: CkanApiManage, *, force: bool = False) None
- upload_request_full(ckan: CkanApiManage, resources_base_dir: str, *, method: UpsertChoice = None, threads: int = 1, external_stop_event=None, allow_chunks: bool = True, only_missing: bool = False, from_line_count: bool = False, start_index: int = 0, end_index: int = None, inhibit_datastore_patch_indexes: bool = False, **kwargs) None
Perform all the upload requests.
- Parameters:
ckan
resources_base_dir
threads
external_stop_event
only_missing
start_index
end_index
- Returns:
- upsert_request_df_no_return(ckan: CkanApiManage, df_upload: DataFrame, *, total_lines_read: int, file_name: str, method: UpsertChoice = UpsertChoice.Upsert, apply_last_condition: bool = None, always_last_condition: bool = None) None
Calls upsert_request_df but does not return anything
- Returns:
- upsert_request_final(ckan: CkanApiManage, *, force: bool = False) None
Final steps after the last upsert query. This call is mandatory at the end of all requests if the user called upsert_request_df for a multi-part DataStore manually.
- Parameters:
ckan
force – perform request anyways
- Returns:
ckanapi_harvesters.builder.builder_resource_datastore_multi_ckan module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to initiate a DataStore.
- class ckanapi_harvesters.builder.builder_resource_datastore_multi_ckan.BuilderDataStoreCkan(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, file_name: str = None, options_string: str = None, base_dir: str = None)
Bases:
BuilderDataStoreFolderMerge of existing CKAN DataStores (on the same server) into a single DataStore
- copy(*, dest=None, parent=None)
- get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, allow_chunks: bool = True, **kwargs) Generator[FileChunkDataFrame, None, None]
Returns an iterator over the data to upload and a position in the current file.
- get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None, file_index: int = 0) str
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- list_local_files(resources_base_dir: str, ckan: CkanApiManage, cancel_if_present: bool = True) List[str]
- static resource_mode_str() str
- upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage
Test the presence of the files/urls used in the upload/patch requests.
- Parameters:
resources_base_dir
- Returns:
None if success, error message otherwise
ckanapi_harvesters.builder.builder_resource_datastore_multi_folder module
Code to initiate a DataStore defined by a large number of files to concatenate into one table. This concrete implementation is linked to the file system.
- class ckanapi_harvesters.builder.builder_resource_datastore_multi_folder.BuilderDataStoreFolder(*, parent, file_query_list: List[Tuple[str, dict]] = None, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, dir_name: str = None, options_string: str = None, base_dir: str = None)
Bases:
BuilderDataStoreMultiABC- copy(*, dest=None, parent=None)
- download_file_query(ckan: CkanApiManage, out_dir: str, file_name: str, file_query: dict, *, return_df: bool = False) str | None | Tuple[str | None, DataFrame | None]
- download_file_query_item(ckan: CkanApiManage, out_dir: str, file_query_item: Tuple[str, dict]) str
Download the file_query item with the its arguments
- download_file_query_list(ckan: CkanApiManage, cancel_if_present: bool = True) List[Tuple[str, dict]]
- download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = False, force: bool = False, threads: int = 1, return_data: bool = False) None
Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.
threads
force – option to bypass the enable_download attribute of resources
- Returns:
- get_file_query_generator() Generator[Tuple[str, dict], Any, None]
Returns an iterator on all the file_queries.
- get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, allow_chunks: bool = True, **kwargs) Generator[FileChunkDataFrame, None, None]
Returns an iterator over the data to upload and a position in the current file.
- get_local_file_offset(file_chunk: FileChunkDataFrame) int
Get the position of the current data in the overall upload.
- get_local_file_size_units()
- get_local_file_total_size() int
Get the overall size of the upload, normally in bytes or line count.
- get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage = None, file_index: int = 0) str | None
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- init_download_file_query_list(ckan: CkanApiManage, out_dir: str, cancel_if_present: bool = True, **kwargs) List[Any]
Determine the list of queries to download to reconstruct the uploaded parts. By default, the unique combinations of the first columns of the primary key are used.
- init_local_files_list(resources_base_dir: str, ckan: CkanApiManage, cancel_if_present: bool = True, **kwargs) List[str]
Behavior to list parts of an upload.
- list_local_files(resources_base_dir: str, ckan: CkanApiManage, cancel_if_present: bool = True) List[str]
- load_sample_df(resources_base_dir: str, *, upload_alter: bool = True, file_index: int = 0, allow_chunks: bool = True, **kwargs) ListRecords | DataFrame
Function returning the data from the indicated resources as a pandas DataFrame. This is the DataFrame equivalent for load_sample_data.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- static resource_mode_str() str
- setup_download_file_query_list(file_query_list: List[Tuple[str, dict]]) None
- upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage
Test the presence of the files/urls used in the upload/patch requests.
- Parameters:
resources_base_dir
- Returns:
None if success, error message otherwise
- upsert_request_df(ckan: CkanApiManage, df_upload: DataFrame, *, total_lines_read: int, file_name: str, method: UpsertChoice = UpsertChoice.Upsert, apply_last_condition: bool = None, always_last_condition: bool = None) Tuple[DataFrame, DataFrame]
Call to ckan datastore_upsert. Before sending the DataFrame, a call to df_upload_alter is made. This implementation optionally checks for the last line of the DataFrame based on the first columns of the primary key.
- Parameters:
ckan
df_upload
method
- Returns:
ckanapi_harvesters.builder.builder_resource_datastore_multi_harvester module
Code to initiate a DataStore defined by a large number of files to concatenate into one table. This concrete implementation is linked to the file system.
- class ckanapi_harvesters.builder.builder_resource_datastore_multi_harvester.BuilderDataStoreHarvester(*, parent, file_query_list: List[Tuple[str, dict]] = None, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, dir_name: str = None, file_url_attr: str = None, options_string: str = None, base_dir: str = None)
Bases:
BuilderDataStoreFolder- clear_secrets_and_disconnect() None
- copy(*, dest=None, parent=None)
- static from_file_datastore(resource_file: BuilderDataStoreFile, *, dir_name: str = None, primary_key: List[str] = None, file_query_list: Collection[Tuple[str, dict]] = None) BuilderDataStoreHarvester
Do not initialize a BuilderDataStoreHarvester with this method. Rather initialize a new instance of the class.
- Raises:
NotImplementedError –
- get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, **kwargs) Generator[FileChunkDataFrame, None, None]
Returns an iterator over the data to upload and a position in the current file.
- get_local_file_size_units()
- get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None, file_index: int = 0) Any | None
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- property harvester: TableHarvesterABC | None
- init_local_files_list(resources_base_dir: str, ckan: CkanApiManage, cancel_if_present: bool = True, **kwargs) List[str]
Behavior to list parts of an upload.
- init_options_from_ckan(ckan: CkanApiManage, *, base_dir: str = None) None
Function to initialize some parameters from the ckan object
- initialize_extra_options_string(extra_options_string: str, base_dir: str) None
- list_local_files(resources_base_dir: str, ckan: CkanApiManage | None, cancel_if_present: bool = True) List[Any]
- static resource_mode_str() str
- upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage
Test the presence of the files/urls used in the upload/patch requests.
- Parameters:
resources_base_dir
- Returns:
None if success, error message otherwise
- upsert_request_df(ckan: CkanApiManage, df_upload: DataFrame, *, total_lines_read: int, file_name: str, method: UpsertChoice = UpsertChoice.Upsert, apply_last_condition: bool = None, always_last_condition: bool = None) Tuple[DataFrame, DataFrame]
Call to ckan datastore_upsert. Before sending the DataFrame, a call to df_upload_alter is made. This implementation optionally checks for the last line of the DataFrame based on the first columns of the primary key.
- Parameters:
ckan
df_upload
method
- Returns:
ckanapi_harvesters.builder.builder_resource_datastore_unmanaged module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to initiate a DataStore without uploading any data.
- class ckanapi_harvesters.builder.builder_resource_datastore_unmanaged.BuilderDataStoreUnmanaged(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, options_string: str = None, base_dir: str = None)
Bases:
BuilderDataStoreFileClass representing a DataStore (resource metadata and fields metadata) without managing its contents during the upload process.
- copy(*, dest=None, parent=None)
- get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, **kwargs) Generator[Tuple[ListRecords | DataFrame, int], None, None]
Returns an iterator over the data to upload and a position in the current file.
- get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None, file_index: int = 0) None
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- init_local_files_list(resources_base_dir: str, cancel_if_present: bool = True, **kwargs) List[str]
Behavior to list parts of an upload.
- load_sample_df(resources_base_dir: str, *, upload_alter: bool = True, file_index: int = 0, allow_chunks: bool = True, **kwargs) DataFrame | None
Function returning the data from the indicated resources as a pandas DataFrame. This is the DataFrame equivalent for load_sample_data.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- patch_request(ckan: CkanApiManage, package_id: str, *, df_upload: DataFrame = None, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo
Specific implementation of patch_request which does not upload any data and only updates the fields currently present in the database
- Parameters:
resources_base_dir
ckan
package_id
reupload
- Returns:
- static resource_mode_str() str
- upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage
Test the presence of the files/urls used in the upload/patch requests.
- Parameters:
resources_base_dir
- Returns:
None if success, error message otherwise
ckanapi_harvesters.builder.builder_resource_datastore_url module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to initiate a DataStore without uploading any data.
- class ckanapi_harvesters.builder.builder_resource_datastore_url.BuilderDataStoreUrl(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, url: str = None, options_string: str = None, base_dir: str = None)
Bases:
BuilderDataStoreFileClass representing a DataStore (resource metadata and fields metadata) defined by a url.
- copy(*, dest=None, parent=None)
- get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, **kwargs) Generator[FileChunkDataFrame, None, None]
Returns an iterator over the data to upload and a position in the current file.
- get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None, file_index: int = 0) str
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- init_local_files_list(resources_base_dir: str, cancel_if_present: bool = True, **kwargs) List[str]
Behavior to list parts of an upload.
- load_sample_data(resources_base_dir: str, *, ckan: CkanApiManage = None, proxies: dict = None, headers: dict = None) bytes
Function returning the data from the indicated resources.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- patch_request(ckan: CkanApiManage, package_id: str, *, df_upload: DataFrame = None, payload: bytes | BufferedIOBase = None, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo
Specific implementation of patch_request which does not upload any data and only updates the fields currently present in the database
- Parameters:
resources_base_dir
ckan
package_id
reupload
- Returns:
- static resource_mode_str() str
- upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, **kwargs) None | ContextErrorLevelMessage
Test the presence of the files/urls used in the upload/patch requests.
- Parameters:
resources_base_dir
- Returns:
None if success, error message otherwise
- upload_request_full(ckan: CkanApiManage, resources_base_dir: str, *, threads: int = 1, external_stop_event=None, start_index: int = 0, end_index: int = None, inhibit_datastore_patch_indexes: bool = False, **kwargs) None
Perform all the upload requests.
- Parameters:
ckan
resources_base_dir
threads
external_stop_event
only_missing
start_index
end_index
- Returns:
ckanapi_harvesters.builder.builder_resource_init module
Code to initialize a resource builder from a row
- ckanapi_harvesters.builder.builder_resource_init.init_resource_from_ckan(ckan: CkanApiMap, resource_info: CkanResourceInfo, parent) BuilderResourceABC
Function initiating a resource builder based on information provided by the CKAN API.
- Returns:
- ckanapi_harvesters.builder.builder_resource_init.init_resource_from_df(row: Series, parent, base_dir: str = None) BuilderResourceABC | None
Function mapping keywords to a resource builder type.
- Parameters:
row
- Returns:
ckanapi_harvesters.builder.builder_resource_multi_abc module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the basic resources. See builder_datastore for specific functions to initiate datastores.
- class ckanapi_harvesters.builder.builder_resource_multi_abc.BuilderMultiABC
Bases:
ABC- _call_progress_callback(position: int, total: int, *, info: Any = None, context: str = None, file_index: int = 0, file_count: int = None, lines_chunk: int = None, total_lines_read: int = None, canceled_request: bool = False, end_message: bool = False, level: int = 0) None
Progress callback function. Use to implement a progress indication for the user.
- Parameters:
position – the position within the resource (usually, in bytes or line count)
total – the total size of the resource
info – an object from which more information can be extracted, typically, the DataFrame itself, with an indication of the data origin.
context – the context of the call (ckan instance, upload/download, single/multi-threaded)
file_index – the index of the file in the list
file_count – the number of files in the list
lines_chunk – the number of lines in the chunk currently being processed
total_lines_read – the total number of lines read, including the current chunk
canceled_request – this callback is also called when a line is ignored
end_message – boolean indicating of the work in progress
level – the level of the progress callback (1: package/dataset, 2: resource builder, 3: used for multi-file resources)
- abstractmethod _unit_download_apply(ckan: CkanApiManage, file_query_item: Any, out_dir: str, index: int, start_index: int, end_index: int, total: int, **kwargs) Any
Unitary function deciding whether to perform download and making the steps for the request.
- _unit_upload_apply(*, ckan: CkanApiManage, file_chunk: FileChunkDataFrame, upload_alter: bool = True, overall_chunk_index: int, file_count: int, start_index: int, end_index: int, **kwargs) Any
Unitary function deciding whether to perform upload and making the steps for the upload.
- copy(*, dest=None)
- abstractmethod download_file_query_item(ckan: CkanApiManage, out_dir: str, file_query_item: Any) Any
Download the file_query item with the its arguments
- download_file_query_item_graceful(ckan: CkanApiManage, out_dir: str, file_query_item: Any, index: int, external_stop_event=None, start_index: int = 0, end_index: int = None, **kwargs) None
Implementation of download_file_query_item with checks for a multi-threaded download.
- download_request_full(ckan: CkanApiManage, out_dir: str, threads: int = 1, external_stop_event=None, start_index: int = 0, end_index: int = None, force: bool = False, **kwargs) None
- download_request_full_multi_threaded(ckan: CkanApiManage, out_dir: str, threads: int = None, external_stop_event=None, start_index: int = 0, end_index: int = -1, **kwargs) None
Multi-threaded implementation of download_request_full using ThreadPoolExecutor.
- abstractmethod download_request_generator(ckan: CkanApiManage, out_dir: str) Generator[Any, Any, None]
Generator to apply treatments after each request (single-threaded).
- Parameters:
ckan
out_dir
- Returns:
- abstractmethod get_file_query_generator() Generator[Any, Any, None]
Returns an iterator on all the file_queries.
- abstractmethod get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, **kwargs) Generator[FileChunkDataFrame, None, None]
Returns an iterator over the data to upload and a position in the current file.
- abstractmethod get_local_file_offset(file_chunk: FileChunkDataFrame) int
Get the position of the current data in the overall upload.
- abstractmethod get_local_file_size_units() CkanProgressUnits
- abstractmethod get_local_file_total_size() int
Get the overall size of the upload, normally in bytes or line count.
- abstractmethod init_download_file_query_list(ckan: CkanApiManage, out_dir: str, cancel_if_present: bool = True, **kwargs) List[Any]
Determine the list of queries to download to reconstruct the uploaded parts. By default, the unique combinations of the first columns of the primary key are used.
- abstractmethod init_local_files_list(resources_base_dir: str, ckan: CkanApiManage, cancel_if_present: bool = True, **kwargs) List[str]
Behavior to list parts of an upload.
- upload_request_final(ckan: CkanApiManage, *, force: bool = False) None
- upload_request_full(ckan: CkanApiManage, resources_base_dir: str, *, threads: int = 1, external_stop_event=None, from_line_count: bool = False, allow_chunks: bool = True, start_index: int = 0, end_index: int = None, inhibit_datastore_patch_indexes: bool = False, **kwargs) None
Perform all the upload requests.
- Parameters:
ckan
resources_base_dir
threads
external_stop_event
only_missing
start_index
end_index
- Returns:
- upload_request_full_multi_threaded(ckan: CkanApiManage, resources_base_dir: str, threads: int = 1, external_stop_event=None, allow_chunks: bool = True, start_index: int = 0, end_index: int = None, **kwargs)
Multi-threaded implementation of upload_request_full, using ThreadPoolExecutor.
- upload_request_graceful(ckan: CkanApiManage, file_chunk: FileChunkDataFrame, *, overall_chunk_index: int, external_stop_event=None, start_index: int = 0, end_index: int = None, **kwargs) None
Calls upload_file with checks specific to multi-threading.
- Returns:
ckanapi_harvesters.builder.builder_resource_multi_datastore module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the basic resources. See builder_datastore for specific functions to initiate datastores.
- class ckanapi_harvesters.builder.builder_resource_multi_datastore.BuilderMultiDataStore(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None)
Bases:
BuilderMultiFile,BuilderDataStoreABC- copy(*, dest=None, parent=None)
- download_file_query_item(ckan: CkanApiManage, out_dir: str, file_query_item: str, full_download: bool = True) Tuple[str | None, Response | None]
Download the file_query item with the its arguments
- download_file_query_item_df(ckan: CkanApiManage, out_dir: str, file_query_item: str, full_download: bool = True) Tuple[str, DataFrame]
- download_request_generator_df(ckan: CkanApiManage, out_dir: str, excluded_resource_names: Set[str] = None) Generator[Tuple[str | None, DataFrame | None], Any, None]
- get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, excluded_files: Set[str] = None, allow_chunks: bool = True, **kwargs) Generator[FileChunkDataFrame, None, None]
Returns an iterator over the data to upload and a position in the current file.
- load_sample_df(resources_base_dir: str, *, upload_alter: bool = True, file_index: int = 0, allow_chunks: bool = True, **kwargs) ListRecords | DataFrame
Function returning the data from the indicated resources as a pandas DataFrame. This is the DataFrame equivalent for load_sample_data.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- static resource_mode_str() str
- upload_file_chunk(ckan: CkanApiManage, package_id: str, file_chunk: FileChunkDataFrame, *, reupload: bool = False, override_ckan: bool = False, cancel_if_present: bool = True, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo
Upload a file, using its name as resource name
ckanapi_harvesters.builder.builder_resource_multi_file module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements the basic resources. See builder_datastore for specific functions to initiate datastores.
- class ckanapi_harvesters.builder.builder_resource_multi_file.BuilderMultiFile(*, parent, name: str = None, format: str = None, description: str = None, resource_id: str = None, download_url: str = None, dir_name: str = None)
Bases:
BuilderResourceABC,BuilderMultiABCClass to manage a set of files to upload as separate resources
- copy(*, dest=None, parent=None)
- download_file_query_item(ckan: CkanApiManage, out_dir: str, file_query_item: str) Tuple[str | None, Response | None]
Download the file_query item with the its arguments
- download_request(ckan: CkanApiManage, out_dir: str, *, full_download: bool = True, threads: int = 1, force: bool = False, excluded_resource_names: Set[str] = None, return_data: bool = False, **kwargs) None
Download the resource and save in a file pointed by out_dir. In most implementations, this calls the download_resource_bytes method.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are treated with a multi-threaded approach.
threads
force – option to bypass the enable_download attribute of resources
- Returns:
- download_request_full(ckan: CkanApiManage, out_dir: str, threads: int = 1, external_stop_event=None, start_index: int = 0, end_index: int = None, force: bool = False, excluded_resource_names: Set[str] = None) None
- download_request_generator(ckan: CkanApiManage, out_dir: str, excluded_resource_names: Set[str] = None) Generator[Tuple[str | None, Response | None], Any, None]
Generator to apply treatments after each request (single-threaded).
- Parameters:
ckan
out_dir
- Returns:
- download_resource_bytes(ckan: CkanApiManage, full_download: bool = True, **kwargs) bytes | None
Download the resource and return the data as bytes.
- Parameters:
ckan
out_dir
full_download – Some resources like URLs are not downloaded by default. Large datasets are also limited to one request for this function by default.
threads
- Returns:
- get_file_query_generator() Generator[str, Any, None]
Returns an iterator on all the file_queries.
- get_local_df_chunk_generator(resources_base_dir: str, ckan: CkanApiManage, excluded_files: Set[str] = None, **kwargs) Generator[FileChunkDataFrame, None, None]
Returns an iterator over the data to upload and a position in the current file.
- get_local_file_generator(resources_base_dir: str, excluded_files: Set[str] = None, **kwargs) Generator[str, None, None]
- get_local_file_offset(file_chunk: FileChunkDataFrame) int
Get the position of the current data in the overall upload.
- get_local_file_size_units()
- get_local_file_total_size() int
Get the overall size of the upload, normally in bytes or line count.
- get_or_query_resource_id(ckan: CkanApiManage, cancel_if_present: bool = True, error_not_found: bool = True) None | str
Store/retrieve resource ID in the class attributes.
- get_sample_file_path(resources_base_dir: str, ckan: CkanApiManage | None = None, file_index: int = 0) str | None
Function returning the local resource file name for the sample file.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- init_download_file_query_list(ckan: CkanApiManage, out_dir: str = None, cancel_if_present: bool = True, excluded_resource_names: Set[str] = None, **kwargs) List[str]
Determine the list of queries to download to reconstruct the uploaded parts. By default, the unique combinations of the first columns of the primary key are used.
- init_local_files_list(resources_base_dir: str, cancel_if_present: bool = True, excluded_files: Set[str] = None, **kwargs) List[str]
Behavior to list parts of an upload.
- list_local_files(resources_base_dir: str, cancel_if_present: bool = True, excluded_files: Set[str] = None) List[str] | None
List files corresponding to the multi-file resource configuration and are not used in mono-resources
- Parameters:
resources_base_dir
cancel_if_present
excluded_files – files from mono-resources
- Returns:
- list_remote_resource_ids(ckan: CkanApiManage, *, excluded_resource_names: Set[str] = None, cancel_if_present: bool = True) List[str]
- list_remote_resources(ckan: CkanApiManage, *, excluded_resource_names: Set[str] = None, cancel_if_present: bool = True) List[str]
Defines the list of resources to download that correspond to the definition and are not used in mono-resources.
- Parameters:
ckan
excluded_resource_names – resource names of mono-resources
cancel_if_present
- Returns:
- load_sample_data(resources_base_dir: str, file_index: int = 0) bytes | None
Function returning the data from the indicated resources.
- Parameters:
resources_base_dir – base directory to find the resources on the local machine
- Returns:
- patch_request(ckan: CkanApiManage, package_id: str, *, reupload: bool = None, override_ckan: bool = False, resources_base_dir: str = None, payload: bytes | BufferedIOBase = None, inhibit_datastore_patch_indexes: bool = False) None | CkanResourceInfo
Function to perform all the necessary requests to initiate/reupload the resource on the CKAN server.
- Parameters:
resources_base_dir
ckan
reupload – option to reupload the resource
- Returns:
- resource_info_request(ckan: CkanApiManage, error_not_found: bool = True) CkanResourceInfo | None
- static resource_mode_str() str
- upload_file_checks(*, resources_base_dir: str = None, ckan: CkanApiManage = None, excluded_files: Set[str] = None, **kwargs) None | ContextErrorLevelMessage
Test the presence of the files/urls used in the upload/patch requests.
- Parameters:
resources_base_dir
- Returns:
None if success, error message otherwise
- upload_file_chunk(ckan: CkanApiManage, package_id: str, file_chunk: FileChunkDataFrame, *, reupload: bool = False, override_ckan: bool = False, cancel_if_present: bool = True, inhibit_datastore_patch_indexes: bool = False) CkanResourceInfo
Upload a file, using its name as resource name
- upload_request_final(ckan: CkanApiManage, *, force: bool = False) None
- upload_request_full(ckan: CkanApiManage, resources_base_dir: str, *, threads: int = 1, external_stop_event=None, start_index: int = 0, end_index: int = None, allow_chunks: bool = True, reupload: bool = False, only_missing: bool = False, from_line_count: bool = False, excluded_files: Set[str] = None, inhibit_datastore_patch_indexes: bool = False) None
Perform all the upload requests.
- Parameters:
ckan
resources_base_dir
threads
external_stop_event
only_missing
start_index
end_index
- Returns:
ckanapi_harvesters.builder.mapper_datastore module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to convert formats between database and local files.
- class ckanapi_harvesters.builder.mapper_datastore.DataSchemeConversion(*, df_upload_fun: Callable[[ListRecords | DataFrame, Any], ListRecords | DataFrame] = None, df_download_fun: Callable[[ListRecords | DataFrame, Any], ListRecords | DataFrame] = None)
Bases:
object- __init__(*, df_upload_fun: Callable[[ListRecords | DataFrame, Any], ListRecords | DataFrame] = None, df_download_fun: Callable[[ListRecords | DataFrame, Any], ListRecords | DataFrame] = None)
Class to convert between local data formats and database formats
- Parameters:
df_upload_fun
df_download_fun
- copy()
- df_download_alter(df_database: DataFrame | List[dict] | Any, file_query: dict = None, fields: Dict[str, CkanField] = None, mapper_kwargs: dict = None, **kwargs) DataFrame | ListRecords
Apply used-defined df_download_fun if present. df_download_fun should be the reverse function of df_upload_fun
- Parameters:
df_database – the downloaded dataframe from the database
- Returns:
the dataframe ready to save, converted in the local format
- df_upload_alter(df_local: DataFrame | List[dict] | Any, *, total_lines_read: int, fields: Dict[str, CkanField], file_query: str, mapper_kwargs: dict = None, **kwargs) DataFrame | ListRecords
Apply used-defined df_upload_fun if present
- Parameters:
df_local – the DataFrame to upload
total_lines_read – total number of lines read, including the current DataFrame
fields – the known fields metadata.
file_query – the name of the file the data originates from (or query)
mapper_kwargs – extra arguments passed to df_upload_fun
- Returns:
the DataFrame ready for upload, converted in the format of the database
- get_necessary_fields() Set[str]
ckanapi_harvesters.builder.mapper_datastore_multi module
Code to define the bondage between a file and a database query in the context of a large DataStore defined by the concatenation of multiple files.
- class ckanapi_harvesters.builder.mapper_datastore_multi.RequestFileMapperABC(*, df_upload_fun: Callable[[DataFrame], Any] = None, df_download_fun: Callable[[DataFrame], Any] = None)
Bases:
RequestMapperABC,ABCClass to define how to reconstruct a file from the full dataset This abstract class is oriented to treating files in the file system
- get_file_name_of_query(file_query: dict) str
- class ckanapi_harvesters.builder.mapper_datastore_multi.RequestFileMapperIndexKeys(group_by_keys: List[str], sort_by_keys: List[str] = None, *, df_upload_fun: Callable[[DataFrame], Any] = None, df_download_fun: Callable[[DataFrame], Any] = None)
Bases:
RequestFileMapperABCIn this implementation, a file is defined by a combination of file_keys values It is optionally ordered by an index_keys which enables to restart a transfer when interrupted By default, the index_keys is the last field of the primary key and the file_keys are the fields preceding the index_keys in the primary key
- df_upload_alter(df_local: DataFrame | List[dict] | Any, *, total_lines_read: int, fields: Dict[str, CkanField], file_query: str, mapper_kwargs: dict = None, **kwargs) DataFrame
Apply used-defined df_upload_fun if present
- Parameters:
df_local – the DataFrame to upload
total_lines_read – total number of lines read, including the current DataFrame
fields – the known fields metadata.
file_query – the name of the file the data originates from (or query)
mapper_kwargs – extra arguments passed to df_upload_fun
- Returns:
the DataFrame ready for upload, converted in the format of the database
- download_file_query_list(ckan: CkanApiManage, resource_id: str) List[dict]
Function to list the {key: value} combinations present in the CKAN datastore to reconstruct the file database before downloading.
- Parameters:
ckan
resource_id
- Returns:
a list of query arguments defining each file
- get_file_name_of_query(file_query: dict) str
- get_file_query_of_df(df_upload: DataFrame) dict | None
Return the dict of {field: value} combinations representing the arguments of the query to reconstruct a file
- Parameters:
df_upload – the DataFrame representing the file
- Returns:
- get_necessary_fields() Set[str]
- last_inserted_index_request(ckan: CkanApiManage, resource_id: str, file_query: dict, df_upload: DataFrame) Tuple[int, bool, int, DataFrame]
Knowing the data which needs to be uploaded, this function compares the last known row(s) to the dataframe and returns the index to restart the upload process.
- Parameters:
ckan
resource_id
file_query – a dict of {field: value} combinations representing the arguments of the query to reconstruct a file
df_upload – the known data corresponding to the file_query to be sent
- Returns:
a tuple (i_restart, upload_needed, row_count, df_last_row): - i_restart: the last known index in the dataframe - upload_needed: a boolean indicating if an update is necessary - row_count: the number of rows corresponding to the file_query - df_last_row: the last found row in the dataframe
- last_inserted_row_request(ckan: CkanApiManage, resource_id: str, file_query: dict) DataFrame | None
Request in CKAN the last inserted row(s) corresponding to a given file_query
- Parameters:
ckan
resource_id
file_query – a dict of {field: value} combinations representing the arguments of the query to reconstruct a file
- Returns:
The last row(s) in the database or None (if no specific method was defined)
- last_rows_limit = 1
- class ckanapi_harvesters.builder.mapper_datastore_multi.RequestFileMapperLimit(limit: int = None, *, df_upload_fun: Callable[[DataFrame], Any] = None, df_download_fun: Callable[[DataFrame], Any] = None)
Bases:
RequestFileMapperABCIn this implementation, a file is defined by a certain amount of rows
- default_limit = 10000
- download_file_query(ckan: CkanApiManage, resource_id: str, file_query: dict, *, progress_callback: CkanProgressCallbackABC) Generator[DataFrame, Any, None]
- download_file_query_list(ckan: CkanApiManage, resource_id: str) List[dict]
Function to list the {key: value} combinations present in the CKAN datastore to reconstruct the file database before downloading.
- Parameters:
ckan
resource_id
- Returns:
a list of query arguments defining each file
- get_file_name_of_query(file_query: dict) str
- class ckanapi_harvesters.builder.mapper_datastore_multi.RequestFileMapperUser(file_query_list: Iterable[Tuple[str, dict]], *, df_upload_fun: Callable[[DataFrame], Any] = None, df_download_fun: Callable[[DataFrame], Any] = None)
Bases:
RequestFileMapperABCUse this basic implementation if the file query list is provided by the user or if the builder is only used to upload files.
- download_file_query_list(ckan: CkanApiManage, resource_id: str) List[dict]
Function to list the {key: value} combinations present in the CKAN datastore to reconstruct the file database before downloading.
- Parameters:
ckan
resource_id
- Returns:
a list of query arguments defining each file
- class ckanapi_harvesters.builder.mapper_datastore_multi.RequestMapperABC(*, df_upload_fun: Callable[[DataFrame], Any] = None, df_download_fun: Callable[[DataFrame], Any] = None)
Bases:
DataSchemeConversion,ABCClass to define how to reconstruct a file from the full dataset This class overloads some data scheme conversion class functions This abstract class can be derived to specify custom data treatments
- download_file_query(ckan: CkanApiManage, resource_id: str, file_query: dict, *, progress_callback: CkanProgressCallbackABC) Generator[DataFrame, Any, None]
- download_file_query_generator(ckan: CkanApiManage, resource_id: str) Generator[dict, Any, None]
Generator for download_file_query_list which can be customized
- Parameters:
ckan
resource_id
- Returns:
- abstractmethod download_file_query_list(ckan: CkanApiManage, resource_id: str) List[dict]
Function to list the {key: value} combinations present in the CKAN datastore to reconstruct the file database before downloading.
- Parameters:
ckan
resource_id
- Returns:
a list of query arguments defining each file
- get_file_query_of_df(df_upload: DataFrame) dict | None
Return the dict of {field: value} combinations representing the arguments of the query to reconstruct a file
- Parameters:
df_upload – the DataFrame representing the file
- Returns:
- last_inserted_index_request(ckan: CkanApiManage, resource_id: str, file_query: dict, df_upload: DataFrame) Tuple[int, bool, int, DataFrame | None]
Knowing the data which needs to be uploaded, this function compares the last known row(s) to the dataframe and returns the index to restart the upload process.
- Parameters:
ckan
resource_id
file_query – a dict of {field: value} combinations representing the arguments of the query to reconstruct a file
df_upload – the known data corresponding to the file_query to be sent
- Returns:
a tuple (i_restart, upload_needed, row_count, df_last_row): - i_restart: the last known index in the dataframe - upload_needed: a boolean indicating if an update is necessary - row_count: the number of rows corresponding to the file_query - df_last_row: the last found row in the dataframe
- last_inserted_row_request(ckan: CkanApiManage, resource_id: str, file_query: dict) DataFrame | None
Request in CKAN the last inserted row(s) corresponding to a given file_query
- Parameters:
ckan
resource_id
file_query – a dict of {field: value} combinations representing the arguments of the query to reconstruct a file
- Returns:
The last row(s) in the database or None (if no specific method was defined)
- ckanapi_harvesters.builder.mapper_datastore_multi.default_file_mapper_from_primary_key(primary_key: List[str] = None, file_query_list: Iterable[Tuple[str, dict]] = None) RequestFileMapperABC
ckanapi_harvesters.builder.mapper_datastore_prototypes module
Code to upload metadata to the CKAN server to create/update an existing package The metadata is defined by the user in an Excel worksheet This file implements functions to convert formats between database and local files.
- ckanapi_harvesters.builder.mapper_datastore_prototypes.download_function_example(df_download: DataFrame, *, fields: Dict[str, CkanField] = None, file_query: str = None, **kwargs) DataFrame | List[dict]
- ckanapi_harvesters.builder.mapper_datastore_prototypes.replace_empty_str(df_local: DataFrame | List[dict], *, fields: Dict[str, CkanField] = None, file_query: str = None, total_lines_read: int = None, **kwargs) DataFrame | List[dict]
ckanapi_harvesters.builder.specific_builder_abc module
Abstract class to implement specific builders from code
- class ckanapi_harvesters.builder.specific_builder_abc.SpecificBuilderABC(ckan: CkanApiManage, package_name: str, organization_name: str, *, title: str = None, description: str = None, private: bool = None, state: CkanState = None, version: str = None, url: str = None, tags: List[str] = None, license_name: str = None)
Bases:
BuilderPackageWithHarvesters,ABC
Module contents
Section of the package dedicated to the initialization of a CKAN package