ckanapi_harvesters.ckan_api package

Subpackages

Submodules

ckanapi_harvesters.ckan_api.ckan_api module

Alias to most complete CkanApi implementation

ckanapi_harvesters.ckan_api.ckan_api_0_base module

class ckanapi_harvesters.ckan_api.ckan_api_0_base.CkanApiABC: Bases: ABC

class ckanapi_harvesters.ckan_api.ckan_api_0_base.CkanApiBase(url: str = None, *, proxies: str | dict | ProxyConfig = None, apikey: str | CkanApiKey = None, apikey_file: str = None, owner_org: str = None, params: CkanApiParamsBasic = None, identifier=None)

Bases: CkanApiABC

CKAN Database API interface to CKAN server with helper functions using pandas DataFrames. This class implements the basic parameters and request functions.

CKAN_URL_ENVIRON = 'CKAN_URL'

__init__(url: str = None, *, proxies: str | dict | ProxyConfig = None, apikey: str | CkanApiKey = None, apikey_file: str = None, owner_org: str = None, params: CkanApiParamsBasic = None, identifier=None)

CKAN Database API interface to CKAN server with helper functions using pandas DataFrames.

Parameters:

url – url of the CKAN server
proxies – proxies to use for requests
apikey – way to provide the API key directly (optional)
apikey_file – path to a file containing a valid API key in the first line of text (optional)
owner_org – name of the organization to limit package_search (optional)
params – other connection/behavior parameters
identifier – identifier of the ckan client

__str__() → str

String representation of the instance, for debugging purposes.

Returns:: URL representing the CKAN server

_api_action_request(action: str, *, method: RequestType, params: dict = None, headers: dict = None, data: dict | str | bytes = None, json: dict = None, files: List[tuple] = None, timeout: float = None, _attempt_counts: int = 0, _attempt_traceback: List[str] = None) → CkanActionResponse

Send API action request and return response.

Parameters:

action – action name
method – GET / POST
params – params to set in the url
data – information to encode in the request body (only for POST method)
json – information to encode as JSON in the request json (only for POST method)
files – files to upload in the request (only for POST method)
headers – headers for the request (authentication tokens are added by the function)
timeout – request timeout in seconds
_attempt_counts – internal argument in case of re-post of the request to count retries
_attempt_traceback – internal argument in case of re-post of the request to list error history

Returns:

_ckan_url_request(path: str, *, method: RequestType, params: dict = None, headers: dict = None, data: dict = None, json: dict = None, files: List[tuple] = None, timeout: float = None) → Response

Send request to server and return response.

Parameters:

path – relative path to server url
method – GET / POST
params – params to set in the url
data – information to encode in the request body (only for POST method)
headers – headers for the request (authentication tokens are added by the function)

Returns:

_cli_ckan_args_apply(args: Namespace, *, base_dir: str = None, error_not_found: bool = True, default_proxies: dict = None, proxy_headers: dict = None) → None

Apply the arguments parsed by the argument parser defined by _setup_cli_ckan_parser

Parameters:

args
base_dir – base directory to find the CKAN API key file, if a relative path is provided (recommended: leave None to use cwd)
error_not_found – option to raise an exception if the CKAN API key file is not found
default_proxies – proxies used if proxies=”default”
proxy_headers – headers used to access the proxies, generally for authentication

Returns:

_get_api_url(category: str = None)

Returns the base API url and appends the category

Parameters:: category – usually, “action”
Returns:

_init_session(*, internal: bool = False)

Initialize the session objects which are used to perform requests with this CKAN instance. This method can be overloaded to fit your needs (proxies, certificates, cookies, headers, etc.).

Parameters:: internal
Returns:

_prepare_headers(headers: dict = None, include_ckan_auth: bool = False) → dict

Prepare headers for a request. If the request is destined to the CKAN server, include authentication headers, if API key was provided.

Parameters:

headers – initial headers
include_ckan_auth – boolean to include CKAN authentication headers

Returns:

_request_all_results_df(api_fun: Callable, *, params: dict = None, list_attrs: bool = True, limit_per_request: int = None, offset: int = 0, total_limit: int = None, requests_limit: int = None, search_all: bool = True, progress_callback: CkanProgressCallbackABC = None, **kwargs) → DataFrame

Multiply request with a limited length until no more data is transmitted thanks to the offset parameter. DataFrame implementation returns the concatenated DataFrame from the unitary function calls.

Parameters:

api_fun – function to call, typically a unitary request function
params – api_fun must accept params argument in order to transmit other values and enforce the offset parameter
limit_per_request – api_fun must accept limit argument in order to update the limit value
offset – api_fun must accept offset argument in order to update the offset value
search_all – if False, only the first request is operated
list_attrs – option to aggregate DataFrame attrs field into lists. # False not tested
kwargs – additional keyword arguments to pass to api_fun

Returns:

_request_all_results_list(api_fun: Callable, *, params: dict = None, limit_per_request: int = None, offset: int = 0, total_limit: int = None, requests_limit: int = None, search_all: bool = True, progress_callback: CkanProgressCallbackABC = None, **kwargs) → List[CkanActionResponse] | list

Multiply request with a limited length until no more data is transmitted thanks to the offset parameter. List implementation returns the list of the unitary function return values.

Parameters:

api_fun – function to call, typically a unitary request function
params – api_fun must accept params argument in order to transmit other values and enforce the offset parameter
limit_per_request – api_fun must accept limit argument in order to update the limit value
offset – api_fun must accept offset argument in order to update the offset value
search_all – if False, only the first request is operated
kwargs – additional keyword arguments to pass to api_fun

Returns:

_request_all_results_page_generator(api_fun: Callable, *, params: dict = None, limit_per_request: int = None, offset: int = 0, requests_limit: int = None, total_limit: int = None, search_all: bool = True, default_limit_per_request: bool = True, progress_callback: CkanProgressCallbackABC = None, **kwargs) → Generator[Any, Any, None]

Multiply request with a limited length until no more data is transmitted thanks to the offset parameter. Lazy auxiliary function which yields a result for each request.

Parameters:

api_fun – function to call, typically a unitary request function
params – api_fun must accept params argument in order to transmit other values and enforce the offset parameter
limit_per_request – api_fun must accept limit argument in order to update the limit value. The limit represents the maximum number of records per call of the api_fun (also referred as a request or page).
offset – api_fun must accept offset argument in order to update the offset value
total_limit – strict limitation of the total number of records received (the result is truncated if it exceeds the limit). The count starts after the initial offset.
requests_limit – limit the number of requests (pages) of the api_fun.
search_all – if False, only the first request is operated
kwargs – additional keyword arguments to pass to api_fun

Returns:

_setup_cli_ckan_parser(parser: ArgumentParser = None) → ArgumentParser

Define or add CLI arguments to initialize a CKAN API connection parser help message:

CKAN API connection parameters initialization

options:

-h, --help: show this help message and exit
--ckan-url CKAN_URL: CKAN URL
--apikey APIKEY: CKAN API key
--apikey-file APIKEY_FILE: Path to a file containing the CKAN API key (first line)
--policy-file POLICY_FILE: Path to a file containing the CKAN data format policy (json format)
--owner-org OWNER_ORG: CKAN Owner Organization
--default-limit DEFAULT_LIMIT: Default number of rows per request
--verbose VERBOSE: Option to set verbosity

Parameters:: parser – option to provide an existing parser to add the specific fields needed to initialize a CKAN API connection
Returns:

api_action_call(action: str, *, method: RequestType, params: dict = None, headers: dict = None, data: dict = None, json: dict = None, files: List[tuple] = None) → CkanActionResponse

api_help_show(action_name: str, *, print_output: bool = True) → str

API help command on a given action.

Parameters:

action_name
print_output – Option to print the output in the command line

Returns:

property apikey: CkanApiKey

connect()

copy(new_identifier: str = None, *, dest=None): Returns a copy of the current instance. Useful to use an initialized ckan object in a multithreaded context. Each thread would have its own copy. It is recommended to purge the last response before doing a copy (with purge_map=False)

disconnect()

download_url_proxy(url: str, *, method: str = None, auth_if_ckan: bool = None, proxies: dict = None, headers: dict = None, auth: AuthBase | Tuple[str, str] = None, verify: bool | str | None = None, stream: bool = False, timeout: float = None) → Response

Download a URL using the CKAN parameters (proxy, authentication etc.)

Parameters:

url
proxies
headers

Returns:

download_url_proxy_test_head(url: str, *, raise_error: bool = False, auth_if_ckan: bool = None, proxies: dict = None, headers: dict = None, auth: AuthBase | Tuple[str, str] = None, verify: bool | str | None = None, context: str = None, timeout: float = None) → None | ContextErrorLevelMessage

This sends a HEAD request to the url using the CKAN connexion parameters via download_url_proxy. The resource is not downloaded but the headers indicate if the url is valid.

Returns:: None if successful

full_unlock(unlock: bool = True, *, no_ca: bool = None, external_url_resource_download: bool = None) → None

Function to unlock full capabilities of the CKAN API

Parameters:: unlock
Returns:

init_from_environ(*, init_api_key: bool = True, error_not_found: bool = False) → None

Initialize CKAN from environment variables.

CKAN_URL for the url of the CKAN server.

And optionally: - CKAN_API_KEY: for the raw API key (it is not recommended to store API key in an environment variable) - CKAN_API_KEY_FILE: path to a file containing a valid API key in the first line of text

Parameters:: error_not_found – raise an error if the API key file was not found
Returns:

initialize_from_cli_args(*, args: Sequence[str] = None, base_dir: str = None, error_not_found: bool = True, parser: ArgumentParser = None, default_proxies: dict = None, proxy_headers: dict = None) → None

Intialize the CKAN API connection from command line arguments.

Parameters:: args – Option to provide arguments from another source.
Returns:

initialize_from_options_string(options_string: str = None, base_dir: str = None, error_not_found: bool = True, parser: ArgumentParser = None, default_proxies: dict = None, proxy_headers: dict = None) → None

input_cli_args(*, base_dir: str = None, error_not_found: bool = True, only_if_necessary: bool = False, default_proxies: dict = None, proxy_headers: dict = None)

Initialize the query for initialization parameters in the command-line format in the console window.

Returns:

input_missing_info(*, base_dir: str = None, input_args: bool = False, input_args_if_necessary: bool = False, input_apikey: bool = True, error_not_found: bool = True)

Ask user information in the console window.

Returns:

is_url_internal(url: str) → bool

Tests whether a url points to the same server as the CKAN url.

Parameters:: url
Returns:

load_apikey(apikey_file: str = None, base_dir: str = None, error_not_found: bool = True)

Load the CKAN API key from file. The file should contain a valid API key in the first line of text.

Parameters:

apikey_file – API key file (optional if specified at the creation of the object)
base_dir – base directory, if the apikey_file is a relative path

Returns:

prepare_arguments_for_url_download_request(url: str, *, auth_if_ckan: bool = None, headers: dict = None, verify: bool | str | None = None) → Tuple[bool, dict]

Include CKAN authentication headers only if the URL points to the CKAN server.

Parameters:

url – target URL
headers – initial headers
auth_if_ckan – option to include CKAN authentication headers if the url is recognized as part of the CKAN server.

Returns:

prepare_for_multithreading(mode_reduced: bool = True) → None

This method disables unnecessary writes to this object. It is recommended to enable the reduced writes mode in a multithreaded context. Do not forget to reset sessions at the beginning of each thread.

Parameters:: mode_reduced
Returns:

print_help_cli(display: bool = True) → str

purge() → None: Erase temporary data stored in this object

set_limits(limit_read: int | None) → None

set_limits_per_request(limit_read: int | None) → None

Set default query limits. If only one argument is provided, it applies to both limits.

Parameters:: limit_read – default limit for read requests
Returns:

set_proxies(proxies: str | dict | ProxyConfig, *, default_proxies: dict = None, proxy_headers: dict = None) → None

Set up the proxy configuration

Parameters:: proxies – string or proxies dict or ProxyConfig object.

If a string is provided, it must be an url to a proxy or one of the following values:

“environ”: use the proxies specified in the environment variables “http_proxy” and “https_proxy”
“noproxy”: do not use any proxies
“unspecified”: do not specify the proxies
“default”: use value provided by default_proxies

Parameters:

default_proxies – proxies used if proxies=”default”
proxy_headers – headers used to access the proxies, generally for authentication

Returns:

set_requests_delay(time_between_requests: int) → None

Set delay between requests in seconds.

Parameters:: time_between_requests – delay between requests in seconds

set_requests_timeout(requests_timeout: float, multi_requests_timeout=None) → None

Set timeout for requests.

Parameters:

requests_timeout – timeout for each request (seconds)
multi_requests_timeout – timeout for grouped request (seconds)

Returns:

set_verbosity(verbosity: bool = True, verbose_extra: bool = None) → None

Enable/disable full verbose output

Parameters:: verbosity – boolean. Cannot be None
Returns:

test_ckan_url_reachable(raise_error: bool = False) → bool: Test if the CKAN URL is reachable with a HEAD request. This does not check it is really a CKAN server and does not check authentication.

static unlock_external_url_resource_download(value: bool = True): This function enables the download of resources external from the CKAN server.

static unlock_no_ca(value: bool = True)

This function enables you to disable the CA verification of the CKAN server.

__Warning__: Only allow in a local environment!

property url: str

ckanapi_harvesters.ckan_api.ckan_api_1_map module

class ckanapi_harvesters.ckan_api.ckan_api_1_map.CkanApiMap(url: str = None, *, proxies: str | dict | ProxyConfig = None, apikey: str | CkanApiKey = None, apikey_file: str = None, owner_org: str = None, params: CkanApiParamsBasic = None, map: CkanMap = None, identifier=None)

Bases: CkanApiBase

CKAN Database API interface to CKAN server with helper functions using pandas DataFrames. This class implements the resource mapping capabilities to obtain resource ids necessary for the requests.

For API reference: - Basic API: https://docs.ckan.org/en/latest/api/ - DataStore extension API: https://docs.ckan.org/en/latest/maintaining/datastore.html

__init__(url: str = None, *, proxies: str | dict | ProxyConfig = None, apikey: str | CkanApiKey = None, apikey_file: str = None, owner_org: str = None, params: CkanApiParamsBasic = None, map: CkanMap = None, identifier=None)

CKAN Database API interface to CKAN server with helper functions using pandas DataFrames.

Parameters:

url – url of the CKAN server
proxies – proxies to use for requests
apikey – way to provide the API key directly (optional)
apikey_file – path to a file containing a valid API key in the first line of text (optional)
owner_org – name of the organization to limit package_search (optional)
params – other connection/behavior parameters
map – map of known resources
identifier – identifier of the ckan client

_api_datastore_info(resource_id: str, *, params: dict = None, display_request_not_found: bool = True) → CkanDataStoreInfo

API call to datastore_info. Returns the information on the DataStore. Used to know the number of rows in a DataStore.

Parameters:

resource_id – resource id.
params – N/A
display_request_not_found – whether to display the request in the command window, in case of a CkanNotFoundError. This option is recommended if you are testing whether the resource has a DataStore or not.

Returns:

_api_group_list(*, limit_per_request: int = None, offset: int = 0, groups: List[str] = None, all_fields: bool = True, include_users: bool = True, params: dict = None) → List[CkanGroupInfo] | List[str]

API call to group_list.

Parameters:: params
Returns:

_api_group_list_all(*, all_fields: bool = True, include_users: bool = True, params: dict = None, limit_per_request: int = None, offset: int = None) → List[CkanUserInfo] | List[str]

API call to group_list until an empty list is received.

See:: _api_group_list()
Parameters:: params
Returns:

_api_license_list(*, params: dict = None) → List[CkanLicenseInfo]

API call to license_list.

Parameters:: params
Returns:

_api_organization_list(*, params: dict = None, all_fields: bool = True, include_users: bool = False, limit_per_request: int = None, offset: int = None) → List[CkanOrganizationInfo] | List[str]

API call to organization_list.

Parameters:

params – typically, the request can be limited to an organization with the owner_org parameter
all_fields – whether to return full information or only the organization names in a list

Returns:

_api_organization_list_all(*, params: dict = None, all_fields: bool = True, include_users: bool = False, limit_per_request: int = None, offset: int = None) → List[CkanOrganizationInfo] | List[str]

API call to organization_list until an empty list is received.

See:: _api_organization_list()
Parameters:: params
Returns:

_api_organization_show(id: str, *, params: dict = None) → CkanOrganizationInfo

API call to organization_show.

Parameters:

id – organization id or name.
params – typically, the request can be limited to an organization with the owner_org parameter

Returns:

_api_package_collaborator_list(package_id: str, *, params: dict = None, cancel_if_present: bool = False) → Dict[str, CkanCollaboration]

API call to package_collaborator_list.

Parameters:: params
Returns:

_api_package_search(*, params: dict = None, owner_org: str = None, filter: dict = None, q: str = None, include_private: bool = True, include_drafts: bool = True, sort: str = None, facet: bool = None, limit_per_request: int = None, offset: int = None) → List[CkanPackageInfo]

API call to package_search.

Parameters:

owner_org – ability to filter packages by owner_org
filter – dict of filters to apply, which translate to the API fq argument fq documentation: any filter queries to apply. Note: +site_id:{ckan_site_id} is added to this string prior to the query being executed.
q – the solr query. Optional. Default is ‘:’
include_private – if True, private datasets will be included in the results. Only private datasets from the user’s organizations will be returned and sysadmins will be returned all private datasets. Optional, the default is False in the API
include_drafts – if True, draft datasets will be included in the results. A user will only be returned their own draft datasets, and a sysadmin will be returned all draft datasets. Optional, the default is False.
sort – sorting of the search results. Optional. Default: ‘score desc, metadata_modified desc’. As per the solr documentation, this is a comma-separated string of field names and sort-orderings.
facet – whether to enable faceted results. Default: True in API.
limit_per_request – maximum number of results to return. Translatees to the API rows argument.
offset – the offset in the complete result for where the set of returned datasets should begin. Translatees to the API start argument.
params – other parameters to pass to package_search

Returns:

_api_package_search_all(*, params: dict = None, owner_org: str = None, filter: dict = None, q: str = None, include_private: bool = True, include_drafts: bool = True, sort: str = None, facet: bool = None, limit_per_request: int = None, offset: int = None, search_all: bool = True) → List[CkanPackageInfo]

API call to package_search until an empty list is received.

See:

_api_package_search()

Parameters:

owner_org – ability to filter packages by owner_org
filter – dict of filters to apply, which translate to the API fq argument fq documentation: any filter queries to apply. Note: +site_id:{ckan_site_id} is added to this string prior to the query being executed.
q – the solr query. Optional. Default is ‘:’
include_private – if True, private datasets will be included in the results. Only private datasets from the user’s organizations will be returned and sysadmins will be returned all private datasets. Optional, the default is False in the API
include_drafts – if True, draft datasets will be included in the results. A user will only be returned their own draft datasets, and a sysadmin will be returned all draft datasets. Optional, the default is False.
sort – sorting of the search results. Optional. Default: ‘score desc, metadata_modified desc’. As per the solr documentation, this is a comma-separated string of field names and sort-orderings.
facet – whether to enable faceted results. Default: True in API.
limit_per_request – maximum number of results to return. Translatees to the API rows argument.
offset – the offset in the complete result for where the set of returned datasets should begin. Translatees to the API start argument.
params – other parameters to pass to package_search

Returns:

_api_package_show(package_id, *, params: dict = None) → CkanPackageInfo

API call to package_show. Returns the information on the package and the resources contained in the package. Not recommended for outer use because this method does not return information about the DataStores. Prefer the map_resources method.

See:

map_resources()

Parameters:

package_id – package id.
params – See API documentation.

Returns:

_api_resource_show(resource_id, *, params: dict = None) → CkanResourceInfo

API call to resource_show. Returns the metadata on a resource.

Parameters:

resource_id – resource id.
params – See API documentation.

Returns:

_api_resource_view_list(resource_id: str, *, params: dict = None) → List[CkanViewInfo]

API call to resource_view_list.

Parameters:: params – typically, the request can be limited to an organization with the owner_org parameter
Returns:

_api_status_show(*, params: dict = None) → CkanStatus

API call to status_show. Returns information on the CKAN installation (version, extensions).

Returns:

_api_user_list(*, q: str = None, email: str = None, params: dict = None) → List[CkanUserInfo]

API call to user_list.

Parameters:: params
Returns:

_api_user_show(*, params: dict = None) → CkanUserInfo | None

API call to user_show. With no params, returns the name of the current user logged in.

Returns:: dict with information on the current user

_enrich_resource_info(resource_info: CkanResourceInfo, *, datastore_info: bool = False, resource_view_list: bool = False) → None

Perform additional optional queries to add more information on a resource.

Parameters:

resource_info
datastore_info – option to query datastore_info
resource_view_list – option to query resource_view_list

Returns:

check_package_name_arg(*, package_name: str, package_id: str, raise_error: bool = True) → bool

Check package name argument against ID which was found by API

Parameters:

package_name – package name, ID or title
package_id – package ID known by the API
raise_error – Option to raise an error

Returns:

complete_package_list(package_list: str | List[str] = None, *, owner_org: str = None, include_private: bool = True, include_drafts: bool = True, params: dict = None) → List[str]: This function can list all packages of a CKAN server, for an organization or keeps the list as is. It is an auxiliary function to initialize a package_list argument

connect()

copy(new_identifier: str = None, *, dest=None): Returns a copy of the current instance. Useful to use an initialized ckan object in a multithreaded context. Each thread would have its own copy. It is recommended to purge the last response before doing a copy (with purge_map=False)

datastore_info(resource_id: str, *, error_not_found: bool = True, params: dict = None, display_request_not_found: bool = True) → CkanDataStoreInfo | None

get_datastore_fields_or_request_of_id(resource_id: str, *, request_missing: bool = True, error_not_mapped: bool = False, error_not_found: bool = True, return_list: bool = False) → List[dict] | OrderedDict[str, CkanField] | None

get_datastore_info_or_request(resource_name: str, package_name: str = None, *, request_missing: bool = True, error_not_mapped: bool = False, error_not_found: bool = True) → CkanDataStoreInfo | None

Get information on a DataStore if present in the map or perform request.

Parameters:

resource_name – resource name or id
package_name – package name or id (required if the resource name is provided)
request_missing – confirm to perform the request if the information is missing
error_not_mapped – raise error if the resource is not mapped

Returns:

get_datastore_info_or_request_of_id(resource_id: str, *, request_missing: bool = True, error_not_mapped: bool = False, error_not_found: bool = True) → CkanDataStoreInfo | None

Get information on a DataStore if present in the map or perform request.

Parameters:

resource_id – resource id
request_missing – confirm to perform the request if the information is missing
error_not_mapped – raise error if the resource is not mapped

Returns:

get_organization_info_or_request(organization_name: str, *, request_missing: bool = True, error_not_mapped: bool = False, error_not_found: bool = True) → CkanOrganizationInfo | None

Get information on a Package if present in the map or perform request.

Parameters:

organization_name – organization name or id
request_missing – confirm to perform the request if the information is missing
error_not_mapped – raise error if the resource is not mapped

Returns:

get_package_info_or_request(package_name: str, *, request_missing: bool = True, error_not_mapped: bool = False, error_not_found: bool = True, datastore_info: bool = None, resource_view_list: bool = None, organization_info: bool = None, license_list: bool = None) → CkanPackageInfo | None

Get information on a Package if present in the map or perform request.

Parameters:

package_name – package name or id
request_missing – confirm to perform the request if the information is missing
error_not_mapped – raise error if the resource is not mapped

Returns:

get_package_page_url(package_name: str, *, error_not_found: bool = False, default_url: bool = True) → str

Get URL of package presentation page in CKAN (landing page).

Parameters:

package_name
error_not_found
default_url – return url based on package name, even if it was not found.

Returns:

get_resource_id_or_request(resource_name: str, package_name: str = None, *, request_missing: bool = True, error_not_mapped: bool = False, error_not_found: bool = True, map_package: bool = False) → str | None

Get resource ID if present in the map or perform request.

Parameters:

resource_name – resource name or id
package_name – package name (needs to be specified if resource_name is not an ID)
request_missing – confirm to perform the request if the information is missing
error_not_mapped – raise error if the resource was not previously mapped
error_not_found – if False, return None if resource was not found, otherwise raise CkanNotFoundError

Returns:

get_resource_ids_of_package_list(package_list: List[str] | str = None, *, return_package_dict: bool = False, only_missing: bool = True) → List[str] | Tuple[List[str], Dict[str, CkanPackageInfo]]: Returns a list of resource ids corresponding to the package list. Order is not preserved.

get_resource_info_or_request(resource_name: str, package_name: str = None, *, request_missing: bool = True, error_not_mapped: bool = False, error_not_found: bool = True, datastore_info: bool = False, map_package: bool = False) → CkanResourceInfo | None

get_resource_info_or_request_of_id(resource_id: str, *, request_missing: bool = True, error_not_mapped: bool = False, error_not_found: bool = True, datastore_info: bool = False) → CkanResourceInfo | None

Get information on a resource if present in the map or perform request. Recommended: self.map.get_resource_info() rather than this for this usage because resource information is returned when calling package_info during the mapping process.

Parameters:

resource_id – resource id
request_missing – confirm to perform the request if the information is missing
error_not_mapped – raise error if the resource was not previously mapped
error_not_found – if False, return None if resource was not found, otherwise raise CkanNotFoundError

Returns:

get_resource_page_url(resource_name: str, package_name: str = None, *, error_not_mapped: bool = True) → str

Get URL of resource presentation page in CKAN (landing page).

Parameters:: package_name
Returns:

get_resource_view_list_or_request(resource_id: str, error_not_found: bool = True) → List[CkanViewInfo] | None

Returns either the resource view list which was already received or emits a new query for this information.

Parameters:

resource_id
error_not_found

Returns:

group_list(*, limit_per_request: int = None, offset: int = 0, groups: List[str] = None, all_fields: bool = True, include_users: bool = True, params: dict = None) → List[CkanGroupInfo]

group_list_all(*, all_fields: bool = True, include_users: bool = True, cancel_if_present: bool = False, params: dict = None, limit_per_request: int = None, offset: int = None) → List[CkanGroupInfo] | List[str]

API call to group_list. The call can be canceled if the list is already present (not recommended, rather use get_organization_info_or_request).

Parameters:

params
cancel_if_present – option to cancel when list is already present.

Returns:

input_missing_info(*, base_dir: str = None, input_args: bool = False, input_args_if_necessary: bool = False, input_apikey: bool = True, error_not_found: bool = True, input_owner_org: bool = False)

Ask user information in the console window.

Parameters:: input_owner_org – option to ask for the owner organization.
Returns:

license_list(*, cancel_if_present: bool = True, params: dict = None) → List[CkanLicenseInfo]

API call to license_list. The call can be canceled if the list is already present.

Parameters:

params
cancel_if_present – option to cancel when list is already present.

Returns:

map_resources(package_list: str | List[str] = None, *, params: dict = None, datastore_info: bool = None, resource_view_list: bool = None, organization_info: bool = None, license_list: bool = None, only_missing: bool = True, error_not_found: bool = True, owner_org: str = None, progress_callback: CkanProgressCallbackABC = None) → CkanMap

Map the resources of a given package to obtain resource IDs associated with the package name and its resources.

Parameters:

package_list – List of packages to request. If not provided, the result of package_search is used.
params – Additional parameters to pass to all API calls (not recommended).
datastore_info – If True, enables the request of the API datastore_info to return information about DataStore fields, aliases, and row count. Required to search a DataStore by alias.
resource_view_list – If True, enables the request of the view_list API for each resource.
organization_info – If True, enables the request of the organization_list API before other requests.
license_list – If True, enables the request of the license_list API.
only_missing – If True, skips requesting already-mapped packages.
error_not_found – If True, packages not found by the API are ignored (no error is raised).
owner_org – Filters packages by a specific organization (only if package_search is used).

Returns:

A mapping of resources for the specified package(s).

Note

Packages were previously referred to as DataSets in earlier CKAN implementations.
A single name can be shared across multiple resources within a package. In such cases, the first occurrence is used as a reference, and a warning is issued.

map_user_rights(*, cancel_if_present: bool = True, progress_callback: CkanProgressCallbackABC = None) → CkanMap: Map user and group access rights to the packages currently mapped by CKAN :return:

organization_list_all(*, cancel_if_present: bool = False, params: dict = None, all_fields: bool = True, include_users: bool = False, limit_per_request: int = None, offset: int = None) → List[CkanOrganizationInfo] | List[str]

API call to license_list. The call can be canceled if the list is already present (not recommended, rather use get_organization_info_or_request).

Parameters:

params
cancel_if_present – option to cancel when list is already present.

Returns:

organization_show(id: str, *, params: dict = None) → CkanOrganizationInfo

package_collaborator_list(package_id: str, *, params: dict = None, cancel_if_present: bool = False) → Dict[str, CkanCollaboration]

package_search_all(*, params: dict = None, owner_org: str = None, filter: dict = None, q: str = None, include_private: bool = True, include_drafts: bool = True, sort: str = None, facet: bool = None, limit_per_request: int = None, offset: int = None, search_all: bool = True) → List[CkanPackageInfo]

API call to package_search until an empty list is received.

See:

_api_package_search()

Parameters:

owner_org – ability to filter packages by owner_org
filter – dict of filters to apply, which translate to the API fq argument. Example to filter for a given group “GROUP”: filter={“groups”: “GROUP”}; Example to filter for a given organization “ORG”: filter={“organization”: “ORG”}; Example to filter for a given author “NAME”: filter={“author”: “NAME”}. fq documentation: any filter queries to apply. Note: +site_id:{ckan_site_id} is added to this string prior to the query being executed.
q – the solr query. Optional. Default is ‘:’
include_private – if True, private datasets will be included in the results. Only private datasets from the user’s organizations will be returned and sysadmins will be returned all private datasets. Optional, the default is False in the API
include_drafts – if True, draft datasets will be included in the results. A user will only be returned their own draft datasets, and a sysadmin will be returned all draft datasets. Optional, the default is False.
sort – sorting of the search results. Optional. Default: ‘score desc, metadata_modified desc’. As per the solr documentation, this is a comma-separated string of field names and sort-orderings.
facet – whether to enable faceted results. Default: True in API.
limit_per_request – maximum number of results to return. Translatees to the API rows argument.
offset – the offset in the complete result for where the set of returned datasets should begin. Translatees to the API start argument.
params – other parameters to pass to package_search
search_all – if True, the request is renewed until an empty list is received.

Returns:

package_show(package_id, *, params: dict = None, cancel_if_exists: bool = True) → CkanPackageInfo

purge(purge_map: bool = False) → None

Erase temporary data stored in this object

Parameters:: purge_map – whether to purge the map created with map_resources

query_current_user(*, verbose: bool = None, error_not_found: bool = False) → CkanUserInfo | None

remap_resources(*, params=None, purge: bool = True, datastore_info: bool = None, resource_view_list: bool = None, organization_info: bool = None, license_list: bool = None)

Perform a new request on previously mapped packages.

Parameters:

params
purge – option to reset the map before remapping.
datastore_info – enforce the request of api_datastore_info
resource_view_list – enforce the request of view_list API for each resource
license_list – enforce the request of license_list API

Returns:

resource_is_datastore(resource_id: str) → bool

Basic test to know whether a resource is DataStore.

Parameters:: resource_id
Returns:

resource_show(resource_id, *, params: dict = None) → CkanResourceInfo

resource_view_list(resource_id: str, *, params: dict = None) → List[CkanViewInfo]

set_default_map_mode(datastore_info: bool = None, resource_view_list: bool = None, organization_info: bool = None, license_list: bool = None) → None

Set up the optional queries orchestrated by the map_resources function

Parameters:

datastore_info
resource_view_list
organization_info
license_list

Returns:

set_owner_org(owner_org: str, *, error_not_found: bool = True) → None

Set the default owner organization.

Parameters:: owner_org – owner organization name, title or id.
Returns:

status_show(*, params: dict = None, cancel_if_present: bool = True) → CkanStatus

API call to status_show. Returns information on the CKAN installation (version, extensions).

Returns:

test_ckan_connection(raise_error: bool = False) → bool: Test if the CKAN URL aims to a CKAN server by testing the package_search API. This does not check authentication.

test_ckan_login(*, raise_error: bool = False, verbose: bool = None, empty_key_connected: bool = False) → bool

Test if your login leads to a user account.

Parameters:

raise_error – option to raise an error if no account was detected
verbose – option to display username in console
empty_key_connected – option to ignore the test if the API key is empty

property url: str

user_list(*, cancel_if_present: bool = False, q: str = None, email: str = None, params: dict = None) → List[CkanUserInfo]

API call to user_list. The call can be canceled if the list is already present.

Parameters:

params
cancel_if_present – option to cancel when list is already present.

Returns:

ckanapi_harvesters.ckan_api.ckan_api_2_readonly module

class ckanapi_harvesters.ckan_api.ckan_api_2_readonly.CkanApiReadOnly(url: str = None, *, proxies: str | dict | ProxyConfig = None, apikey: str | CkanApiKey = None, apikey_file: str = None, owner_org: str = None, params: CkanApiReadOnlyParams = None, map: CkanMap = None, identifier=None)

Bases: CkanApiMap

CKAN Database API interface to CKAN server with helper functions using pandas DataFrames. This class implements requests to read data from the CKAN server resources / DataStores.

__init__(url: str = None, *, proxies: str | dict | ProxyConfig = None, apikey: str | CkanApiKey = None, apikey_file: str = None, owner_org: str = None, params: CkanApiReadOnlyParams = None, map: CkanMap = None, identifier=None)

CKAN Database API interface to CKAN server with helper functions using pandas DataFrames.

Parameters:

url – url of the CKAN server
proxies – proxies to use for requests
apikey – way to provide the API key directly (optional)
apikey_file – path to a file containing a valid API key in the first line of text (optional)
owner_org – name of the organization to limit package_search (optional)
params – other connection/behavior parameters
map – map of known resources
identifier – identifier of the ckan client

_api_datastore_dump_all(resource_id: str, *, filters: dict = None, q: str = None, fields: List[str] = None, sort: str = None, limit_per_request: int = None, offset: int = 0, format: str = None, bom: bool = None, total_limit: int = None, requests_limit: int = None, params: dict = None, search_all: bool = True, return_df: bool = True, progress_callback: CkanProgressCallbackABC = None) → DataFrame | Response

Successive calls to _api_datastore_dump_df until an empty list is received.

See:

_api_datastore_dump()

Parameters:

resource_id – resource id.
filters – The base argument to filter values in a table (optional)
q – Full text query (optional)
fields – The base argument to filter columns (optional)
format – The return format in the returned response (default=csv, tsv, json, xml) (optional)
params – Additional parameters such as filters, q, sort and fields can be given. See DataStore API documentation.
search_all – if False, only the first request is operated

Returns:

_api_datastore_dump_all_page_generator(resource_id: str, *, filters: dict = None, q: str = None, fields: List[str] = None, sort: str = None, limit_per_request: int = None, offset: int = 0, total_limit: int = None, requests_limit: int = None, progress_callback: CkanProgressCallbackABC = None, format: str = None, bom: bool = None, params: dict = None, search_all: bool = True, return_df: bool = True) → Generator[DataFrame, Any, None] | Generator[Response, Any, None]

Successive calls to _api_datastore_dump until an empty list is received. Generator implementation which yields one DataFrame per request.

See:

_api_datastore_dump()

Parameters:

resource_id – resource id.
filters – The base argument to filter values in a table (optional)
q – Full text query (optional)
fields – The base argument to filter columns (optional)
format – The return format in the returned response (default=csv, tsv, json, xml) (optional)
params – Additional parameters such as filters, q, sort and fields can be given. See DataStore API documentation.
search_all – if False, only the first request is operated

Returns:

_api_datastore_dump_df(resource_id: str, *, filters: dict = None, q: str = None, fields: List[str] = None, sort: str = None, limit_per_request: int = None, offset: int = 0, format: str = None, bom: bool = None, params: dict = None) → DataFrame: Convert output of _api_datastore_dump_raw to pandas DataFrame.

_api_datastore_dump_raw(resource_id: str, *, filters: dict = None, q: str = None, fields: List[str] = None, sort: str = None, limit_per_request: int = None, offset: int = 0, format: str = None, bom: bool = None, params: dict = None, compute_len: bool = False) → Response

URL call to datastore/dump URL. Dumps successive lines in the DataStore.

Parameters:

resource_id – resource id.
filters – The base argument to filter values in a table (optional)
q – Full text query (optional)
fields – The base argument to filter columns (optional)
format – The return format in the returned response (default=csv, tsv, json, xml) (optional)
params – Additional parameters such as filters, q, sort and fields can be given. See DataStore API documentation.

Returns:

raw response

_api_datastore_search_all(resource_id: str, *, filters: dict = None, q: str = None, fields: List[str] = None, distinct: bool = None, sort: str = None, limit_per_request: int = None, offset: int = 0, format: str = None, total_limit: int = None, requests_limit: int = None, progress_callback: CkanProgressCallbackABC = None, search_all: bool = True, params: dict = None, return_df: bool = True, compute_len: bool = False) → DataFrame | ListRecords | Any

Successive calls to _api_datastore_search_df until an empty list is received.

See:

_api_datastore_search()

Parameters:

resource_id – resource id.
filters – The base argument to filter values in a table (optional)
q – Full text query (optional)
fields – The base argument to filter columns (optional)
distinct – return only distinct rows (optional, default: false) e.g. to return distinct ids: fields=”id”, distinct=True
sort – Argument to sort results e.g. sort=”index, quantity desc” or sort=”index asc”
limit_per_request – Limit the number of records per request
offset – Offset in the returned records
format – The return format in the returned response (default=objects, csv, tsv, lists) (optional)
params – Additional parameters such as filters, q, sort and fields can be given. See DataStore API documentation.
search_all – if False, only the first request is operated

Returns:

_api_datastore_search_all_page_generator(resource_id: str, *, filters: dict = None, q: str = None, fields: List[str] = None, distinct: bool = None, sort: str = None, limit_per_request: int = None, offset: int = 0, format: str = None, search_all: bool = True, total_limit: int = None, requests_limit: int = None, progress_callback: CkanProgressCallbackABC = None, params: dict = None, return_df: bool = True) → Generator[DataFrame, Any, None] | Generator[CkanActionResponse, Any, None]

Successive calls to _api_datastore_search_df until an empty list is received. Generator implementation which yields one DataFrame per request.

See:

_api_datastore_search()

Parameters:

resource_id – resource id.
filters – The base argument to filter values in a table (optional)
q – Full text query (optional)
fields – The base argument to filter columns (optional)
distinct – return only distinct rows (optional, default: false) e.g. to return distinct ids: fields=”id”, distinct=True
sort – Argument to sort results e.g. sort=”index, quantity desc” or sort=”index asc”
limit_per_request – Limit the number of records per request
offset – Offset in the returned records
format – The return format in the returned response (default=objects, csv, tsv, lists) (optional)
params – Additional parameters such as filters, q, sort and fields can be given. See DataStore API documentation.
search_all – if False, only the first request is operated

Returns:

_api_datastore_search_df(resource_id: str, *, filters: dict = None, q: str = None, fields: List[str] = None, distinct: bool = None, sort: str = None, limit_per_request: int = None, offset: int = 0, format: str = None, params: dict = None, compute_len: bool = True) → DataFrame: Convert output of _api_datastore_search_raw to pandas DataFrame.

_api_datastore_search_raw(resource_id: str, *, filters: dict = None, q: str = None, fields: List[str] = None, distinct: bool = None, sort: str = None, limit_per_request: int = None, offset: int = 0, format: str = None, params: dict = None, compute_len: bool = False) → CkanActionResponse

API call to datastore_search. Performs queries on the DataStore.

Parameters:

resource_id – resource id.
filters – The base argument to filter values in a table (optional)
q – Full text query (optional)
fields – The base argument to filter columns (optional)
distinct – return only distinct rows (optional, default: false) e.g. to return distinct ids: fields=”id”, distinct=True
sort – Argument to sort results e.g. sort=”index, quantity desc” or sort=”index asc”
limit_per_request – Limit the number of records per request
offset – Offset in the returned records
format – The return format in the returned response (default=objects, csv, tsv, lists) (optional)
params – Additional parameters such as filters, q, sort and fields can be given. See DataStore API documentation.

Returns:

_api_datastore_search_sql_all(sql: str, *, params: dict = None, search_all: bool = True, limit_per_request: int = None, offset: int = None, total_limit: int = None, requests_limit: int = None, progress_callback: CkanProgressCallbackABC = None, return_df: bool = True) → DataFrame | ListRecords

Successive calls to _api_datastore_search_sql until an empty list is received.

See:

_api_datastore_search_sql()

Parameters:

sql – SQL query e.g. f’SELECT * IN “{resource_id}” WHERE “USER_ID” < 0’
limit_per_request – Limit the number of records per request
offset – Offset in the returned records
total_limit – Strictly limit the number of records to return, counting from the initial offset
requests_limit – Limit the number of requests
params – N/A
search_all – if False, only the first request is operated

Returns:

_api_datastore_search_sql_all_page_generator(sql: str, *, params: dict = None, search_all: bool = True, limit_per_request: int = None, offset: int = 0, total_limit: int = None, requests_limit: int = None, progress_callback: CkanProgressCallbackABC = None, return_df: bool = True) → Generator[DataFrame, Any, None] | Generator[CkanActionResponse, Any, None]

Successive calls to _api_datastore_search_sql until an empty list is received. Generator implementation which yields one DataFrame per request.

See:

_api_datastore_search_sql()

Parameters:

sql – SQL query e.g. f’SELECT * IN “{resource_id}” WHERE “USER_ID” < 0’
limit_per_request – Limit the number of records per request. This parameter applies if there is no LIMIT statement in the sql query. Incompatible usage raises a CkanSqlLimitOffsetError.
offset – Offset in the returned records. This parameter applies if there is no OFFSET statement in the sql query. Incompatible usage raises a CkanSqlLimitOffsetError.
total_limit – Strictly limit the number of records to return, counting from the initial offset
requests_limit – Limit the number of requests
params – N/A
search_all – if False, only the first request is operated

Returns:

_api_datastore_search_sql_df(sql: str, *, params: dict = None, limit_per_request: int = None, offset: int = None) → DataFrame: Convert output of _api_datastore_search_sql_raw to pandas DataFrame.

_api_datastore_search_sql_raw(sql: str, *, params: dict = None, limit_per_request: int = None, offset: int = None) → CkanActionResponse

API call to datastore_search_sql. Performs SQL queries on the DataStore. These queries can be more complex than with datastore_search. The DataStores are referenced by their resource_id, surrounded by quotes. The field names are referred by their name in upper case, surrounded by quotes. __NB__: This action is not available when ckanapi_harvesters.datastore.sqlsearch.enabled is set to false

Parameters:

sql – SQL query e.g. f’SELECT * IN “{resource_id}” WHERE “USER_ID” < 0’
limit_per_request – Limit the number of records per request
offset – Offset in the returned records
params – N/A

Returns:

static _get_default_bom_option_read(bom: bool = None, format: str = None, search_method: bool = False, apply_defaults: bool = True) → bool | None: API datastore_dump includes an option to return the BOM (Byte Order Mark) for requests in CSV/TSV format. The BOM helps text-processing tools and applications determine the encoding of the file e.g. to distinguish between UTF-8 and UTF-16.

Note

To correctly handle BOM characters in pandas.read_csv, you should specify encoding=utf-8-sig parameter. This is taken into account in the decoding function.

static _get_default_format_read(format: str = None, search_method: bool = False, return_df: bool = True) → str | None: Configure default format for best interpretation when reading results

_rx_records_df_clean(df: DataFrame) → None

Auxiliary function for cleaning dataframe from DataStore requests

Parameters:: df
Returns:

datastore_search(resource_id: str, *, filters: dict = None, q: str = None, fields: List[str] = None, distinct: bool = None, sort: str = None, limit_per_request: int = None, offset: int = 0, total_limit: int = None, requests_limit: int = None, search_all: bool = False, search_method: bool = True, params: dict = None, limit: int = None, progress_callback: CkanProgressCallbackABC = None, format: str = None, bom: bool = None, return_df: bool = True) → DataFrame | ListRecords | Any | List[CkanActionResponse]

Preferred entry-point for a DataStore read request. Uses the API datastore_search

Parameters:

resource_id – resource id.
filters – The base argument to filter values in a table (optional)
q – Full text query (optional)
fields – The base argument to filter columns (optional)
distinct – return only distinct rows (optional, default: false) e.g. to return distinct ids: fields=”id”, distinct=True
sort – Argument to sort results e.g. sort=”index, quantity desc” or sort=”index asc”
limit_per_request – Limit the number of records per request
offset – Offset in the returned records
total_limit – Strictly limit the number of records to return, counting from the initial offset
requests_limit – Limit the number of requests
limit – previously limit_per_request, now stands for total_limit. This parameter is deprecated and will be removed in a future release.
progress_callback – Progress callback function
params – Additional parameters such as filters, q, sort and fields can be given. See DataStore API documentation.
search_all – Option to renew the request until there are no more records.
search_method – API method selection (True=datastore_search, False=datastore_dump)

Returns:

datastore_search_cursor(resource_id: str, *, filters: dict = None, q: str = None, fields: List[str] = None, distinct: bool = None, sort: str = None, limit_per_request: int = None, offset: int = 0, total_limit: int = None, requests_limit: int = None, progress_callback: CkanProgressCallbackABC = None, params: dict = None, search_all: bool = True, search_method: bool = True, format: str = None, bom: bool = None, return_df: bool = False, limit: int = None) → Generator[Series | dict | list | str, Any, None]

Cursor on rows of datastore_search

Parameters:

resource_id – resource id.
filters – The base argument to filter values in a table (optional)
q – Full text query (optional)
fields – The base argument to filter columns (optional)
distinct – return only distinct rows (optional, default: false) e.g. to return distinct ids: fields=”id”, distinct=True
sort – Argument to sort results e.g. sort=”index, quantity desc” or sort=”index asc”
limit_per_request – Limit the number of records per request
offset – Offset in the returned records
total_limit – Strictly limit the number of records to return, counting from the initial offset
requests_limit – Limit the number of requests
limit – previously limit_per_request, now stands for total_limit. This parameter is deprecated and will be removed in a future release.
progress_callback – Progress callback function
params – Additional parameters such as filters, q, sort and fields can be given. See DataStore API documentation.
search_all – Option to renew the request until there are no more records.
search_method – API method selection (True=datastore_search, False=datastore_dump)
return_df – Return pandas Series (True) or dict (False)
format – Format of the data requested through the API. This does not change the output if return_df is True.

Returns:

datastore_search_fields_type_dict(resource_id: str, *, filters: dict = None, q: str = None, distinct: bool = None, fields: List[str] = None, request_missing: bool = True, error_not_mapped: bool = False, error_not_found: bool = True) → OrderedDict

datastore_search_find_one(resource_id: str, *, filters: dict = None, q: str = None, distinct: bool = None, fields: List[str] = None, offset: int = 0, return_df: bool = True) → DataFrame | ListRecords | Any | List[CkanActionResponse]

Request first result of a query

Parameters:

resource_id – resource id.
filters – The base argument to filter values in a table (optional)
q – Full text query (optional)
fields – The base argument to filter columns (optional)
distinct – return only distinct rows (optional, default: false) e.g. to return distinct ids: fields=”id”, distinct=True
offset – Offset in the returned records
return_df – Return pandas Series (True) or dict (False)

Returns:

datastore_search_page_generator(resource_id: str, *, filters: dict = None, q: str = None, fields: List[str] = None, distinct: bool = None, sort: str = None, limit_per_request: int = None, offset: int = 0, total_limit: int = None, requests_limit: int = None, progress_callback: CkanProgressCallbackABC = None, params: dict = None, search_all: bool = True, search_method: bool = True, format: str = None, bom: bool = None, return_df: bool = True, limit: int = None) → Generator[DataFrame, Any, None] | Generator[CkanActionResponse, Any, None] | Generator[Response, Any, None]

Preferred entry-point for a DataStore read request. Uses the API datastore_search

Parameters:

resource_id – resource id.
filters – The base argument to filter values in a table (optional)
q – Full text query (optional)
fields – The base argument to filter columns (optional)
distinct – return only distinct rows (optional, default: false) e.g. to return distinct ids: fields=”id”, distinct=True
sort – Argument to sort results e.g. sort=”index, quantity desc” or sort=”index asc”
limit_per_request – Limit the number of records per request
offset – Offset in the returned records
total_limit – Strictly limit the number of records to return, counting from the initial offset
requests_limit – Limit the number of requests
limit – previously limit_per_request, now stands for total_limit. This parameter is deprecated and will be removed in a future release.
progress_callback – Progress callback function
params – Additional parameters such as filters, q, sort and fields can be given. See DataStore API documentation.
search_all – Option to renew the request until there are no more records.
search_method – API method selection (True=datastore_search, False=datastore_dump)
return_df – Return pandas DataFrame (True) or dict (False)

Returns:

datastore_search_row_count(resource_id: str, *, filters: dict = None, q: str = None, distinct: bool = None, fields: List[str] = None) → int

Request the number of rows in a DataStore

Parameters:

resource_id – resource id.
filters – The base argument to filter values in a table (optional)
q – Full text query (optional)
fields – The base argument to filter columns (optional)
distinct – return only distinct rows (optional, default: false) e.g. to return distinct ids: fields=”id”, distinct=True

Returns:

datastore_search_sql(sql: str, *, params: dict = None, search_all: bool = False, limit_per_request: int = None, offset: int = None, total_limit: int = None, requests_limit: int = None, progress_callback: CkanProgressCallbackABC = None, return_df: bool = True, limit: int = None) → DataFrame | Tuple[ListRecords, dict]

Preferred entry-point for a DataStore SQL request. :see: _api_datastore_search_sql() __NB__: This action is not available when ckanapi_harvesters.datastore.sqlsearch.enabled is set to false

Parameters:

sql – SQL query e.g. f’SELECT * IN “{resource_id}” WHERE “USER_ID” < 0’
limit_per_request – Limit the number of records per request. This parameter applies if there is no LIMIT statement in the sql query. Incompatible usage raises a CkanSqlLimitOffsetError.
offset – Offset in the returned records. This parameter applies if there is no OFFSET statement in the sql query. Incompatible usage raises a CkanSqlLimitOffsetError.
total_limit – Strictly limit the number of records to return, counting from the initial offset
requests_limit – Limit the number of requests
limit – previously limit_per_request, now stands for total_limit. This parameter is deprecated and will be removed in a future release.
progress_callback – Progress callback function
params – N/A
search_all – Option to renew the request until there are no more records.
return_df – Return pandas DataFrame (True) or dict (False)

Returns:

datastore_search_sql_cursor(sql: str, *, params: dict = None, search_all: bool = True, limit_per_request: int = None, offset: int = None, total_limit: int = None, requests_limit: int = None, progress_callback: CkanProgressCallbackABC = None, return_df: bool = False, limit: int = None) → Generator[Series | dict, Any, None]

Preferred entry-point for a DataStore SQL request, to iterate over records. :see: _api_datastore_search_sql()

__NB__: This action is not available when ckanapi_harvesters.datastore.sqlsearch.enabled is set to false

Parameters:

sql – SQL query e.g. f’SELECT * IN “{resource_id}” WHERE “USER_ID” < 0’
limit_per_request – Limit the number of records per request. This parameter applies if there is no LIMIT statement in the sql query. Incompatible usage raises a CkanSqlLimitOffsetError.
offset – Offset in the returned records. This parameter applies if there is no OFFSET statement in the sql query. Incompatible usage raises a CkanSqlLimitOffsetError.
total_limit – Strictly limit the number of records to return, counting from the initial offset
requests_limit – Limit the number of requests
limit – previously limit_per_request, now stands for total_limit. This parameter is deprecated and will be removed in a future release.
progress_callback – Progress callback function
params – N/A
search_all – Option to renew the request until there are no more records.
return_df – Return pandas Series (True) or dict (False)

Returns:

datastore_search_sql_fields_type_dict(sql: str, *, params: dict = None) → OrderedDict

datastore_search_sql_find_one(sql: str, *, params: dict = None, offset: int = 0, return_df: bool = True) → DataFrame | Tuple[ListRecords, dict]

First element of an SQL request

Parameters:

sql – SQL query e.g. f’SELECT * IN “{resource_id}” WHERE “USER_ID” < 0’
offset – Offset in the returned records. This parameter applies if there is no OFFSET statement in the sql query. Incompatible usage raises a CkanSqlLimitOffsetError.
params – N/A
return_df – Return pandas Series (True) or dict (False)

datastore_search_sql_page_generator(sql: str, *, params: dict = None, search_all: bool = True, limit_per_request: int = None, offset: int = None, total_limit: int = None, requests_limit: int = None, progress_callback: CkanProgressCallbackABC = None, return_df: bool = True, limit: int = None) → Generator[DataFrame, Any, None] | Generator[CkanActionResponse, Any, None]

Preferred entry-point for a DataStore SQL request. :see: _api_datastore_search_sql()

__NB__: This action is not available when ckanapi_harvesters.datastore.sqlsearch.enabled is set to false

Parameters:

sql – SQL query e.g. f’SELECT * IN “{resource_id}” WHERE “USER_ID” < 0’
limit_per_request – Limit the number of records per request. This parameter applies if there is no LIMIT statement in the sql query. Incompatible usage raises a CkanSqlLimitOffsetError.
offset – Offset in the returned records. This parameter applies if there is no OFFSET statement in the sql query. Incompatible usage raises a CkanSqlLimitOffsetError.
total_limit – Strictly limit the number of records to return, counting from the initial offset
requests_limit – Limit the number of requests
limit – previously limit_per_request, now stands for total_limit. This parameter is deprecated and will be removed in a future release.
progress_callback – Progress callback function
params – N/A
search_all – Option to renew the request until there are no more records.
return_df – Return pandas DataFrame (True) or dict (False)

Returns:

static from_dict_df_args(fields_type_dict: OrderedDict) → dict

get_datastore_search_url(resource_id: str, *, filters: dict = None, q: str = None, fields: List[str] = None, distinct: bool = None, sort: str = None, limit_per_request: int = None, offset: int = None, format: str = None, bom: bool = None, params: dict = None, default_limit_offset: bool = False, search_method: bool = True): Obtain the datastore search URL used for the datastore_search query

get_resource_download_url(resource_id: str, package_name: str = None)

list_datastore_aliases() → List[CkanAliasInfo]

map_file_resource_sizes(resource_list: List[str] = None, *, package_list: List[str] = None, cancel_if_present: bool = True, progress_callback: CkanProgressCallbackABC = None) → None

map_resources(package_list: str | List[str] = None, *, params: dict = None, datastore_info: bool = None, resource_view_list: bool = None, organization_info: bool = None, license_list: bool = None, only_missing: bool = True, error_not_found: bool = True, owner_org: str = None, progress_callback: CkanProgressCallbackABC = None) → CkanMap

Map the resources of a given package to obtain resource IDs associated with the package name and its resources.

Parameters:

package_list – List of packages to request. If not provided, the result of package_search is used.
params – Additional parameters to pass to all API calls (not recommended).
datastore_info – If True, enables the request of the API datastore_info to return information about DataStore fields, aliases, and row count. Required to search a DataStore by alias.
resource_view_list – If True, enables the request of the view_list API for each resource.
organization_info – If True, enables the request of the organization_list API before other requests.
license_list – If True, enables the request of the license_list API.
only_missing – If True, skips requesting already-mapped packages.
error_not_found – If True, packages not found by the API are ignored (no error is raised).
owner_org – Filters packages by a specific organization (only if package_search is used).

Returns:

A mapping of resources for the specified package(s).

Note

Packages were previously referred to as DataSets in earlier CKAN implementations.
A single name can be shared across multiple resources within a package. In such cases, the first occurrence is used as a reference, and a warning is issued.

static read_fields_df_args(fields_type_dict: OrderedDict) → dict

static read_fields_type_dict(fields_list_dict: List[dict]) → OrderedDict

resource_download(resource_id: str, *, method: str = None, proxies: dict = None, headers: dict = None, auth: AuthBase | Tuple[str, str] = None, verify: bool | str | None = None, stream: bool = False) → Tuple[CkanResourceInfo, Response | None]

Uses the link provided in resource_show to download a resource.

Parameters:: resource_id – resource id
Returns:

resource_download_df(resource_id: str, *, method: str = None, proxies: dict = None, headers: dict = None, auth: AuthBase | Tuple[str, str] = None, verify: bool | str | None = None) → Tuple[CkanResourceInfo, DataFrame | None]

Uses the link provided in resource_show to download a resource and interprets it as a DataFrame.

Parameters:: resource_id – resource id
Returns:

resource_download_test_head(resource_id: str, *, raise_error: bool = False, proxies: dict = None, headers: dict = None, auth: AuthBase | Tuple[str, str] = None, verify: bool | str | None = None) → None | ContextErrorLevelMessage

This sends a HEAD request to the resource download url using the CKAN connexion parameters via resource_download. The resource is not downloaded but the headers indicate if the url is valid.

Returns:: None if successful

test_sql_capabilities(*, raise_error: bool = False) → bool

Test the availability of the API datastore_search_sql

Returns:

class ckanapi_harvesters.ckan_api.ckan_api_2_readonly.CkanApiReadOnlyParams(*, proxies: str | dict | ProxyConfig = None, ckan_headers: dict = None, http_headers: dict = None)

Bases: CkanApiParamsBasic

apply_default_limit_to_sql_when_search_all: bool = True

copy(new_identifier: str = None, *, dest=None)

default_df_download_id_field_treatment: CkanIdFieldTreatment = 1

map_all_aliases: bool = True

ckanapi_harvesters.ckan_api.ckan_api_3_policy module

class ckanapi_harvesters.ckan_api.ckan_api_3_policy.CkanApiPolicy(url: str = None, *, proxies: str | dict | ProxyConfig = None, apikey: str | CkanApiKey = None, apikey_file: str = None, owner_org: str = None, params: CkanApiPolicyParams = None, map: CkanMap = None, policy: CkanPackageDataFormatPolicy = None, policy_file: str = None, identifier=None)

Bases: CkanApiReadOnly

__init__(url: str = None, *, proxies: str | dict | ProxyConfig = None, apikey: str | CkanApiKey = None, apikey_file: str = None, owner_org: str = None, params: CkanApiPolicyParams = None, map: CkanMap = None, policy: CkanPackageDataFormatPolicy = None, policy_file: str = None, identifier=None)

CKAN Database API interface to CKAN server with helper functions using pandas DataFrames.

Parameters:

url – url of the CKAN server
proxies – proxies to use for requests
apikey – way to provide the API key directly (optional)
apikey_file – path to a file containing a valid API key in the first line of text (optional)
policy – data format policy to use with policy_check function
policy_file – path to a JSON file containing the data format policy to use with policy_check function
owner_org – name of the organization to limit package_search (optional)
params – other connection/behavior parameters
map – map of known resources
policy – data format policy to be used with the policy_check function.
policy_file – path to a JSON file containing the data format policy to load.
identifier – identifier of the ckan client

copy(new_identifier: str = None, *, dest=None): Returns a copy of the current instance. Useful to use an initialized ckan object in a multithreaded context. Each thread would have its own copy. It is recommended to purge the last response before doing a copy (with purge_map=False)

load_default_policy(*, error_not_found: bool = False, load_error: bool = True, cancel_if_present: bool = False, force: bool = False) → CkanPackageDataFormatPolicy | None

Function to load the default data format policy from the CKAN server. The default policy is defined in ckan_configuration

Parameters:

error_not_found
cancel_if_present
force

Returns:

load_policy(policy_file: str, base_dir: str = None, proxies: dict = None, headers: dict = None, error_not_found: bool = True, load_error: bool = True) → CkanPackageDataFormatPolicy

Load the CKAN data format policy from file (JSON format).

Parameters:

policy_file – path to the policy file
base_dir – base directory, if the apikey_file is a relative path

Returns:

map_resources(package_list: str | List[str] = None, *, params: dict = None, datastore_info: bool = None, resource_view_list: bool = None, organization_info: bool = None, license_list: bool = None, only_missing: bool = True, error_not_found: bool = True, owner_org: str = None, load_policy: bool = None, progress_callback: CkanProgressCallbackABC = None) → CkanMap

Map the resources of a given package to obtain resource IDs associated with the package name and its resources.

Parameters:

package_list – List of packages to request. If not provided, the result of package_search is used.
params – Additional parameters to pass to all API calls (not recommended).
datastore_info – If True, enables the request of the API datastore_info to return information about DataStore fields, aliases, and row count. Required to search a DataStore by alias.
resource_view_list – If True, enables the request of the view_list API for each resource.
organization_info – If True, enables the request of the organization_list API before other requests.
license_list – If True, enables the request of the license_list API.
only_missing – If True, skips requesting already-mapped packages.
error_not_found – If True, packages not found by the API are ignored (no error is raised).
owner_org – Filters packages by a specific organization (only if package_search is used).

Returns:

A mapping of resources for the specified package(s).

Note

Packages were previously referred to as DataSets in earlier CKAN implementations.
A single name can be shared across multiple resources within a package. In such cases, the first occurrence is used as a reference, and a warning is issued.

policy_check(package_list: str | List[str] = None, policy: CkanPackageDataFormatPolicy = None, *, buffer: Dict[str, PackagePolicyReport] = None, raise_error: bool = False, verbose: bool = None, auto_update: bool = None, date_report: datetime = None, progress_callback: CkanProgressCallbackABC = None) → bool

Enforce policy on mapped packages

Parameters:: policy
Returns:

query_default_policy(*, error_not_found: bool = False, load_error: bool = True) → CkanPackageDataFormatPolicy | None

Download default policy and return it without loading it in the policy attribute.

Parameters:: error_not_found
Returns:

set_default_map_mode(datastore_info: bool = None, resource_view_list: bool = None, organization_info: bool = None, license_list: bool = None, load_policy: bool = None) → None

Set up the optional queries orchestrated by the map_resources function

Parameters:

datastore_info
resource_view_list
organization_info
license_list

Returns:

set_verbosity(verbosity: bool = True, verbose_extra: bool = None) → None

Enable/disable full verbose output

Parameters:: verbosity – boolean. Cannot be None
Returns:

class ckanapi_harvesters.ckan_api.ckan_api_3_policy.CkanApiPolicyParams(*, proxies: str | dict | ProxyConfig = None, ckan_headers: dict = None, http_headers: dict = None)

Bases: CkanApiReadOnlyParams

copy(new_identifier: str = None, *, dest=None)

ckanapi_harvesters.ckan_api.ckan_api_4_readwrite module

class ckanapi_harvesters.ckan_api.ckan_api_4_readwrite.CkanApiReadWrite(url: str = None, *, proxies: str | dict | ProxyConfig = None, apikey: str | CkanApiKey = None, apikey_file: str = None, owner_org: str = None, params: CkanApiPolicyParams = None, map: CkanMap = None, policy: CkanPackageDataFormatPolicy = None, policy_file: str = None, data_cleaner_upload: CkanDataCleanerABC = None, identifier=None)

Bases: CkanApiPolicy

CKAN Database API interface to CKAN server with helper functions using pandas DataFrames. This class implements requests to write data to the CKAN server resources / DataStores.

__init__(url: str = None, *, proxies: str | dict | ProxyConfig = None, apikey: str | CkanApiKey = None, apikey_file: str = None, owner_org: str = None, params: CkanApiPolicyParams = None, map: CkanMap = None, policy: CkanPackageDataFormatPolicy = None, policy_file: str = None, data_cleaner_upload: CkanDataCleanerABC = None, identifier=None)

CKAN Database API interface to CKAN server with helper functions using pandas DataFrames.

Parameters:

url – url of the CKAN server
proxies – proxies to use for requests
apikey – way to provide the API key directly (optional)
apikey_file – path to a file containing a valid API key in the first line of text (optional)
policy – data format policy to use with policy_check function
policy_file – path to a JSON file containing the data format policy to use with policy_check function
owner_org – name of the organization to limit package_search (optional)
params – other connection/behavior parameters
map – map of known resources
policy – data format policy to be used with the policy_check function.
policy_file – path to a JSON file containing the data format policy to load.
data_cleaner_upload – data cleaner object to use before uploading to a CKAN DataStore.
identifier – identifier of the ckan client

_api_datapusher_submit(resource_id: str, *, params: dict = None) → bool

Call to API action datapusher_submit. This triggers the normally asynchronous DataPusher service for a given resource.

Parameters:

resource_id – resource id
params

Returns:

_api_datastore_upsert_raw(records: dict | List[dict] | DataFrame, resource_id: str, *, method: UpsertChoice | str, params: dict = None, force: bool = None, dry_run: bool = False, last_insertion: bool = True) → CkanActionResponse

API call to api_datastore_upsert.

Parameters:

records – records, preferably in a pandas DataFrame - they will be converted to a list of dictionaries.
resource_id – destination resource id
method – see UpsertChoice (insert, update or upsert)
force – set to True to edit a read-only resource. If not provided, this is overridden by self.default_force
params – additional parameters
dry_run – set to True to abort transaction instead of committing, e.g. to check for validation or type errors
last_insertion – trigger for calculate_record_count

(doc: updates the stored count of records, used to optimize datastore_search in combination with the total_estimation_threshold parameter. If doing a series of requests to change a resource, you only need to set this to True on the last request.) :return: the inserted records as a pandas DataFrame, from the server response

_api_resource_patch(resource_id: str, *, name: str = None, format: str = None, description: str = None, title: str = None, state: CkanState = None, df: DataFrame = None, file_path: str = None, url: str = None, files=None, payload: bytes | BufferedIOBase = None, payload_name: str = None, params: dict = None) → CkanResourceInfo

Call to resource_patch API. This call can be used to change the resource parameters via params (cf. API documentation) or to reupload the resource file into FileStore. The latter action replaces the current resource. If it is a DataStore, it is reset to the new contents of the file. The file can be transmitted either as an url, a file path or a pandas DataFrame. The files argument can pass through these arguments to the requests.post function. A call to datapusher_submit() could be required to take immediately into account the newly downloaded file.

See:

_api_resource_create

See:

resource_create

Parameters:

resource_id – resource id
url – url of the resource to replace resource
params – parameters such as name, format, resource_type can be changed

For file uploads, the following parameters are taken, by order of priority: See upload_prepare_requests_files_arg for an example of formatting.

Parameters:

files – files pass through argument to the requests.post function. Use to send other data formats.
payload – bytes to upload as a file
payload_name – name of the payload to use (associated with the payload argument) - this determines the format recognized in CKAN viewers.
file_path – path of the file to transmit (binary and text files are supported here)
df – pandas DataFrame to replace resource

Returns:

_datastore_upsert_df(records: dict | List[dict] | DataFrame, resource_id: str, *, dry_run: bool = False, limit_per_request: int = None, offset: int = 0, total_limit: int = None, requests_limit: int = None, force: bool = None, method: UpsertChoice | str = UpsertChoice.Upsert, apply_last_condition: bool = True, always_last_condition: bool = None, return_df: bool = None, data_cleaner: CkanDataCleanerABC = None, progress_callback: CkanProgressCallbackABC = None, params: dict = None, return_documents: bool = True, return_counters: bool = False) → DataFrame | List[dict] | Tuple[DataFrame | List[dict], LinesRequestCounter] | LinesRequestCounter | None

Encapsulation of _api_datastore_upsert to cut the requests to a limited number of rows of a given DataFrame / list of dicts.

See:

_api_datastore_upsert()

Parameters:

records – records, preferably in a pandas DataFrame - they will be converted to a list of dictionaries.
resource_id – destination resource id
method – by default, set to Upsert
force – set to True to edit a read-only resource. If not provided, this is overridden by self.default_force
limit_per_request – number of records per transaction
offset – number of records to skip - use to restart the transfer
total_limit – maximum number of lines to transmit, counting from the initial offset
requests_limit – maximum number of requests
params – additional parameters
dry_run – set to True to abort transaction instead of committing, e.g. to check for validation or type errors
apply_last_condition – if True, the last upsert request applies the last insert operations (calculate_record_count and force_indexing).
always_last_condition – if True, each request applies the last insert operations - default is False
data_cleaner – data cleaner instance. A data cleaner detects and changes invalid values before upload.
progress_callback – progress callback function
params – additional parameters
return_df – if True, inserted documents are returned as a pandas DataFrame or else, a list of dictionaries.
return_documents – option to accumulate and return inserted documents
return_counters – if True, return a dict of request counters in addition to the received records

Returns:

rows_inserted, counters: - rows_inserted: the documents inserted (DataFrame or list of dictionaries depending on return_df) - counters: number of inserted records This represents the order of the return arguments with return_documents=True and return_counters=True. The presence of respective return values is controlled by these arguments.

_datastore_upsert_generator(records_generator: Iterable[ListRecords | DataFrame], resource_id: str, *, dry_run: bool = False, limit_per_request: int = None, offset: int = 0, request_threshold: int = None, total_limit: int = None, requests_limit: int = None, force: bool = None, method: UpsertChoice | str = UpsertChoice.Upsert, apply_last_condition: bool = True, always_last_condition: bool = None, return_df: bool = None, return_documents: bool = False, return_counters: bool = True, data_cleaner: CkanDataCleanerABC = None, progress_callback: CkanProgressCallbackABC = None, params: dict = None) → DataFrame | List[dict] | Tuple[DataFrame | List[dict], LinesRequestCounter] | LinesRequestCounter | None

Encapsulation of datastore_upsert to send the rows by chunks provided by records_generator.

Parameters:

records_generator – generator of records, e.g. chunks from a CSV file generated with pandas.read_csv(.., chunksize=1000)
resource_id – destination resource id
method – by default, set to Upsert
force – set to True to edit a read-only resource. If not provided, this is overridden by self.default_force
request_threshold – number of records to cumulate before sending a request (argument specific to this method). If not specified, chunks are sent to _datastore_upsert_df at each new iteration.
limit_per_request – number of records per transaction
offset – number of records to skip - use to restart the transfer
total_limit – maximum number of lines to transmit, counting from the initial offset
requests_limit – maximum number of requests
params – additional parameters
dry_run – set to True to abort transaction instead of committing, e.g. to check for validation or type errors
apply_last_condition – if True, the last upsert request applies the last insert operations (calculate_record_count and force_indexing).
always_last_condition – if True, each request applies the last insert operations - default is False
data_cleaner – data cleaner instance. A data cleaner detects and changes invalid values before upload.
progress_callback – progress callback function
return_df – if True, inserted documents are returned as a pandas DataFrame or else, a list of dictionaries.
return_documents – option to accumulate and return inserted documents
return_counters – if True, return a dict of request counters in addition to the received records

Returns:

rows_inserted, counters: - rows_inserted: the documents inserted (DataFrame or list of dictionaries depending on return_df) - counters: number of inserted records This represents the order of the return arguments with return_documents=True and return_counters=True. The presence of respective return values is controlled by these arguments.

copy(new_identifier: str = None, *, dest=None): Returns a copy of the current instance. Useful to use an initialized ckan object in a multithreaded context. Each thread would have its own copy. It is recommended to purge the last response before doing a copy (with purge_map=False)

datastore_submit(resource_id: str, *, apply_delay: bool = True, error_timeout: bool = True, params: dict = None) → bool

Submit file to re-initiate DataStore, using the preferred method. Current method is datapusher_submit. This encapsulation includes a call to datastore_wait.

Parameters:

resource_id
apply_delay – Keep true to wait until the datastore is ready (a datastore_search query is performed as a test)
params

Returns:

datastore_upsert(records_generator: DataFrame | List[dict] | Iterable[ListRecords | DataFrame], resource_id: str, *, dry_run: bool = False, limit_per_request: int = None, offset: int = 0, request_threshold: int = None, total_limit: int = None, requests_limit: int = None, force: bool = None, method: UpsertChoice | str = UpsertChoice.Upsert, apply_last_condition: bool = True, always_last_condition: bool = None, return_df: bool = None, return_documents: bool = False, return_counters: bool = True, data_cleaner: CkanDataCleanerABC = None, progress_callback: CkanProgressCallbackABC = None, params: dict = None, limit: int = None, exclude_generator_mode: bool = False, records: DataFrame | List[dict] = None) → DataFrame | List[dict] | Tuple[DataFrame | List[dict], LinesRequestCounter] | LinesRequestCounter | None

Main entry point for datastore_upsert accepting generators or DataFrames. The call to the correct function is made upon the type of the records_generator argument.

See:

datastore_upsert_generator(), datastore_upsert()

Parameters:

records_generator – records or generator of records, e.g. chunks from a CSV file generated with pandas.read_csv(.., chunksize=1000)*
records – keyword alias for records_generator (for previous versions compatibility). If used, records_generator must remain None.
exclude_generator_mode – option to raise an error if the generator mode is detected
resource_id – destination resource id
method – by default, set to Upsert
force – set to True to edit a read-only resource. If not provided, this is overridden by self.default_force
request_threshold – number of records to cumulate before sending a request, in case of the use of a DataFrame generator. If not specified, chunks are sent to _datastore_upsert_df at each new iteration.
limit_per_request – number of records per transaction
offset – number of records to skip - use to restart the transfer
total_limit – maximum number of lines to transmit, counting from the initial offset
requests_limit – maximum number of requests
limit – previously limit_per_request, now stands for total_limit. This parameter is deprecated and will be removed in a future release.
params – additional parameters
dry_run – set to True to abort transaction instead of committing, e.g. to check for validation or type errors
apply_last_condition – if True, the last upsert request applies the last insert operations (calculate_record_count and force_indexing).
always_last_condition – if True, each request applies the last insert operations - default is False
data_cleaner – data cleaner instance. A data cleaner detects and changes invalid values before upload.
progress_callback – progress callback function
return_df – if True, inserted documents are returned as a pandas DataFrame or else, a list of dictionaries.
return_documents – option to accumulate and return inserted documents
return_counters – if True, return a dict of request counters in addition to the received records

Returns:

the number of records inserted

datastore_upsert_last_line(resource_id: str): Apply last line treatments to a resource.

datastore_wait(resource_id: str, *, apply_delay: bool = True, error_timeout: bool = True) → Tuple[int, float]

Wait until a DataStore has at least one row. The delay between requests to peer on the presence of the DataStore is given by the class attribute submit_delay. If the loop exceeds submit_timeout, an exception is raised.

Parameters:

resource_id
apply_delay
error_timeout – option to raise an exception in case of timeout

Returns:

full_unlock(unlock: bool = True, *, no_ca: bool = None, external_url_resource_download: bool = None) → None

Function to unlock full capabilities of the CKAN API

Parameters:: unlock
Returns:

resource_patch(resource_id: str, *, name: str = None, format: str = None, description: str = None, title: str = None, state: CkanState = None, df: DataFrame = None, file_path: str = None, url: str = None, files=None, payload: bytes | BufferedIOBase = None, payload_name: str = None, params: dict = None) → CkanResourceInfo

set_limits(limit_read: int | None, limit_write: int = None) → None

set_limits_per_request(limit_read: int | None, limit_write: int = None) → None

Set default query limits. If only one argument is provided, it applies to both limits.

Parameters:

limit_read – default limit for read requests
limit_write – default limit for upsert (write) requests

Returns:

set_submit_timeout(submit_timeout: float, submit_delay: float = None) → None

Set timeout for the datastore_wait method. This is called after datastore_submit.

Parameters:

submit_timeout – timeout after which a TimeoutError is raised (seconds)
submit_delay – delay between requests to peer on DataStore initialization (datastore_wait) (seconds)

Returns:

class ckanapi_harvesters.ckan_api.ckan_api_4_readwrite.CkanApiReadWriteParams(*, proxies: str | dict | ProxyConfig = None, ckan_headers: dict = None, http_headers: dict = None)

Bases: CkanApiPolicyParams

copy(new_identifier: str = None, *, dest=None)

default_readonly: bool = False

upsert_limit_reached_warning: bool = False

ckanapi_harvesters.ckan_api.ckan_api_5_manage module

class ckanapi_harvesters.ckan_api.ckan_api_5_manage.CkanApiExtendedParams(*, proxies: str | dict | ProxyConfig = None, ckan_headers: dict = None, http_headers: dict = None)

Bases: CkanApiManageParams

copy(new_identifier: str = None, *, dest=None)

class ckanapi_harvesters.ckan_api.ckan_api_5_manage.CkanApiManage(url: str = None, *, proxies: str | dict | ProxyConfig = None, apikey: str | CkanApiKey = None, apikey_file: str = None, owner_org: str = None, params: CkanApiExtendedParams = None, map: CkanMap = None, policy: CkanPackageDataFormatPolicy = None, policy_file: str = None, data_cleaner_upload: CkanDataCleanerABC = None, identifier=None)

Bases: CkanApiReadWrite

CKAN Database API interface to CKAN server with helper functions using pandas DataFrames. This class implements more advanced requests to manage packages, resources and DataStores on the CKAN server.

__init__(url: str = None, *, proxies: str | dict | ProxyConfig = None, apikey: str | CkanApiKey = None, apikey_file: str = None, owner_org: str = None, params: CkanApiExtendedParams = None, map: CkanMap = None, policy: CkanPackageDataFormatPolicy = None, policy_file: str = None, data_cleaner_upload: CkanDataCleanerABC = None, identifier=None)

CKAN Database API interface to CKAN server with helper functions using pandas DataFrames.

Parameters:

url – url of the CKAN server
proxies – proxies to use for requests
apikey – way to provide the API key directly (optional)
apikey_file – path to a file containing a valid API key in the first line of text (optional)
policy – data format policy to use with policy_check function
policy_file – path to a JSON file containing the data format policy to use with policy_check function
owner_org – name of the organization to limit package_search (optional)
params – other connection/behavior parameters
map – map of known resources
policy – data format policy to be used with the policy_check function.
policy_file – path to a JSON file containing the data format policy to load.
data_cleaner_upload – data cleaner object to use before uploading to a CKAN DataStore.
identifier – identifier of the ckan client

_api_dataset_purge(package_id: str, *, params: dict = None) → dict

API call to dataset_purge. This fully removes the package. This action is not reversible. It requires an admin account.

Parameters:

package_id
params

Returns:

_api_datastore_create(resource_id: str, *, records: dict | List[dict] | DataFrame = None, fields: List[dict | CkanField] = None, delete_fields: bool = None, primary_key: str | List[str] = None, indexes: str | List[str] = None, aliases: str | List[str] = None, calculate_record_count: bool = None, params: dict = None, force: bool = None) → dict

API call to datastore_create. This endpoint also supports altering tables, aliases and indexes and bulk insertion.

Parameters:

resource_id – resource id
records
fields
primary_key
indexes
params
force

Returns:

_api_datastore_delete(resource_id: str, *, params: dict = None, force: bool = None) → dict

Function to delete rows an api_datastore using api_datastore_upsert. If no filter is given, the whole database will be erased. This function is private and should not be called directly.

Parameters:

resource_id
params
force – set to True to edit a read-only resource. If not provided, this is overridden by self.default_force

Returns:

_api_datastore_records_delete(resource_id: str, *, params: dict = None, force: bool = None) → dict

Function to delete rows an api_datastore using api_datastore_upsert. This API will never remove the table itself. Introduced in CKAN version >= 2.11

Parameters:

resource_id
params
force – set to True to edit a read-only resource. If not provided, this is overridden by self.default_force

Returns:

_api_package_create(name: str, private: bool, *, title: str = None, notes: str = None, owner_org: str = None, state: CkanState | str = None, license_id: str = None, tags: List[str] = None, tags_list_dict: List[Dict[str, str]] = None, url: str = None, version: str = None, custom_fields: dict = None, author: str = None, author_email: str = None, maintainer: str = None, maintainer_email: str = None, params: dict = None) → CkanPackageInfo

API call to package_create.

Parameters:

name
private
title
notes
owner_org
state
license_id
tags
params

Returns:

_api_package_delete(package_id: str, *, params: dict = None) → dict

API call to package_delete. This marks the package as deleted and does not remove data.

Parameters:

package_id
params

Returns:

_api_package_patch(package_id: str, package_name: str = None, private: bool = None, *, title: str = None, notes: str = None, owner_org: str = None, state: CkanState | str = None, license_id: str = None, tags: List[str] = None, tags_list_dict: List[Dict[str, str]] = None, url: str = None, version: str = None, custom_fields_update: dict = None, custom_fields: dict = None, author: str = None, author_email: str = None, maintainer: str = None, maintainer_email: str = None, params: dict = None) → CkanPackageInfo

API call to package_patch. Use to change the properties of a package. This method is preferred to package_update which requires to resend the full package configuration. (API doc for package_update: It is recommended to call ckanapi_harvesters.logic.action.get.package_show(), make the desired changes to the result, and then call package_update() with it.)

Parameters:

package_id
package_name
private
title
notes
owner_org
state
license_id
params

Returns:

_api_package_resource_reorder(package_id: str, resource_ids: List[str], *, params: dict = None) → dict

API call to package_resource_reorder. Reorders resources within a package. Reorder resources against datasets. If only partial resource ids are supplied then these are assumed to be first and the other resources will stay in their original order.

Parameters:

package_id – the id or name of the package to update
resource_ids – a list of resource ids in the order needed
params

Returns:

_api_resource_create(package_id: str, name: str, *, format: str = None, description: str = None, state: CkanState = None, df: DataFrame = None, file_path: str = None, url: str = None, files=None, payload: bytes | BufferedIOBase = None, payload_name: str = None, params: dict = None) → CkanResourceInfo

API call to resource_create.

See:

_api_resource_patch

See:

resource_create

Parameters:

package_id
name
format
url – url of the resource to replace resource
params – additional parameters such as resource_type can be set

Note

For file uploads, the following parameters are taken, by order of priority: See upload_prepare_requests_files_arg for an example of formatting.

Parameters:

files – files pass through argument to the requests.post function. Use to send other data formats.
payload – bytes to upload as a file
payload_name – name of the payload to use (associated with the payload argument) - this determines the format recognized in CKAN viewers.
file_path – path of the file to transmit (binary and text files are supported here)
df – pandas DataFrame to replace resource

Returns:

_api_resource_delete(resource_id: str, *, params: dict = None, force: bool = None, bypass_admin: bool = False) → dict

Function to delete a resource. This fully removes the resource, definitively. Requires enable_admin=True.

Parameters:

resource_id
params
force – set to True to edit a read-only resource. If not provided, this is overridden by self.default_force

Returns:

_api_resource_view_create(resource_id: str, title: str | List[str] = None, *, view_type: str | List[str] = None, params: dict = None) → List[CkanViewInfo]

API call to resource_view_create.

title and view_type must have same length if specified as lists.

Parameters:

resource_id – resource id
title – Title of the resource
view_type – Type of view, typically recline_view for Data Explorer
params

Returns:

static _datastore_fields_dict_merge(fields: List[dict | CkanField] | OrderedDict[str, CkanField | dict] = None, fields_merge: List[dict | CkanField] | OrderedDict[str, CkanField | dict] = None, fields_update: List[dict | CkanField] | OrderedDict[str, CkanField | dict] = None, *, fields_type_override: Dict[str, str] = None, fields_description: Dict[str, str] = None, fields_label: Dict[str, str] = None, return_list: bool = False) → Dict[str, CkanField] | List[dict]

Initialization of the fields parameter for datastore_create. Only parts used by this package are present. To complete the field’s dictionaries, refer to datastore_field_patch_dict.

Parameters:

fields – first source of field information, usually the fields from the DataStore
fields_merge – second source. Values from this dictionary will overwrite fields
fields_update – third source. Values from this dictionary will be prioritary over all values.
fields_type_override
fields_description
fields_label
return_list

Returns:

dict if return_list is False, list if return_list is True.

You can easily transform the dict to a list with the following code: `python fields = list(fields_update.values()) `

_datastore_fields_patch_dict(fields_merge: List[dict | CkanField] | OrderedDict[str, CkanField | dict] = None, fields_update: List[dict | CkanField] | OrderedDict[str, CkanField | dict] = None, *, fields_type_override: Dict[str, str] = None, fields_description: Dict[str, str] = None, fields_label: Dict[str, str] = None, return_list: bool = False, datastore_merge: bool = True, resource_id: str = None, error_not_found: bool = True) → Tuple[bool | None, Dict[str, CkanField] | List[dict]]

Calls datastore_field_dict and merges attributes with those found in datastore_info if datastore_merge=True.

Parameters:

fields_update
fields_type_override
fields_description
fields_label
return_list
datastore_merge
resource_id – required if datastore_merge=True

Returns:

copy(new_identifier: str = None, *, dest=None): Returns a copy of the current instance. Useful to use an initialized ckan object in a multithreaded context. Each thread would have its own copy. It is recommended to purge the last response before doing a copy (with purge_map=False)

datastore_clear(resource_id: str, *, error_not_found: bool = True, params: dict = None, force: bool = None, bypass_admin: bool = False) → dict | None

Function to clear data in a DataStore using _api_datastore_delete. Requires enable_admin=True. This implementation adds the option error_not_found. If set to False, no error is raised if the resource is found by the datastore is not.

See:

_api_datastore_delete()

Parameters:

resource_id
error_not_found – if False, does not raise an exception if the resource exists but there is not datastore
params
force – set to True to edit a read-only resource. If not provided, this is overridden by self.default_force
bypass_admin – option to bypass check of enable_admin

Returns:

datastore_create(resource_id: str, *, delete_previous: bool = False, bypass_admin: bool = False, records: dict | List[dict] | DataFrame = None, fields: List[dict | CkanField] = None, primary_key: str | List[str] = None, indexes: str | List[str] = None, aliases: str | List[str] = None, params: dict = None, force: bool = None, data_cleaner: CkanDataCleanerABC = None, inhibit_datastore_patch_indexes: bool = False, progress_callback: CkanProgressCallbackABC = None) → dict

Encapsulation of the datastore_create API call. This function can optionally clear the DataStore before creating it.

Parameters:

resource_id
delete_previous – option to delete the previous datastore, if exists (default:False)
records
fields
primary_key
indexes
params
force
inhibit_datastore_patch_indexes – option to ignore primary_key and indexes in case the DataStore already exists. In certain cases, running without this option can lead to impossible updates (recomputing indexes on large tables can be costly).

Returns:

datastore_default_alias(resource_name: str, package_name: str, *, query_names: bool = True, error_not_found: bool = True) → str

static datastore_default_alias_of_info(resource_info: CkanResourceInfo, package_info: CkanPackageInfo) → str

static datastore_default_alias_of_names(resource_name: str, package_name: str) → str

datastore_fields(resource_id: str, *, error_not_found: bool = True) → dict[str, CkanField] | None: Obtain the fields composing a DataStore

datastore_fields_delete(resource_id: str, fields_delete: str | List[str], *, params: dict = None, bypass_admin: bool = True) → dict

Request to delete fields from a DataStore. Introduced in CKAN version >= 2.11

Parameters:

resource_id – resource id
fields_delete – list of fields to delete or string with names separated by commas (,)
bypass_admin – option to bypass admin state check (locally)
params – additional parameters to pass to the CKAN API datastore_create

Function helper call to API datastore_create in order to update the parameters of some fields. The initial field configuration is taken from the mapped information or requested. Typically, this could be used to enforce a data type on a field. In this case, it is required to resubmit the resource data with the API resource_patch. The field_update argument would be e.g. field_update={“id”: {“info”: {“type_override”: “text”}}} This is equivalent to the option field_type_override={“id”: “text”}

Note

It is not possible to rename a field after creation through the API. To do this, the change must be done in the database.

Parameters:

resource_id – resource id
fields_update – dictionary of field id and properties to change. The update of the property dictionary is recursive, ensuring only the fields appearing in the update are changed. This field can be overridden by the values given in field_type_override, field_description, or field_label.
fields_type_override – argument to simplify the edition of the info.type_override value for each field id.
field_description – argument to simplify the edition of the info.notes value for each field id
fields_label – argument to simplify the edition of the info.label value for each field id
only_if_needed – Cancels the request if the changes do not affect the current configuration

Returns:

a tuple (update_needed, fields_new, update_dict)

datastore_records_delete(resource_id: str, filters: dict, *, params: dict = None, force: bool = None, calculate_record_count: bool = True) → dict

Function to delete certain rows a DataStore using _api_datastore_delete. The filters are mandatory here. If not given, the whole database would be erased. Prefer using datastore_clear for this usage.

See:

_api_datastore_delete()

Parameters:

resource_id
filters
params
force – set to True to edit a read-only resource. If not provided, this is overridden by self.default_force
calculate_record_count

Returns:

static default_resource_view(resource_format: str, is_datastore: bool = True) → Tuple[str, str]

Definition of the default resource view based on the resource format.

Parameters:: resource_format
Returns:

full_unlock(unlock: bool = True, *, no_ca: bool = None, external_url_resource_download: bool = None) → None

Function to unlock full capabilities of the CKAN API

Parameters:: unlock
Returns:

package_create(package_name: str, private: bool = True, *, title: str = None, notes: str = None, owner_org: str = None, state: CkanState | str = None, license_id: str = None, tags: List[str] = None, tags_list_dict: List[Dict[str, str]] = None, url: str = None, version: str = None, custom_fields_update: dict = None, custom_fields: dict = None, author: str = None, author_email: str = None, maintainer: str = None, maintainer_email: str = None, params: dict = None, cancel_if_exists: bool = True, update_if_exists=True, clear_if_deleted_state: bool = None) → CkanPackageInfo

Helper function to create a new package. This first checks if the package already exists.

See:

_api_package_create()

Parameters:

package_name
private
title
notes
owner_org
license_id
state
params
cancel_if_exists
update_if_exists
clear_if_deleted_state – Option to clear the resources of a package if it was found in Deleted state. Default behavior is set in params.

Returns:

package_delete(package_id: str, definitive_delete: bool = False, *, params: dict = None) → dict

Alias function for package removal. Either calls API package_delete to simply mark for deletion or dataset_purge to definitively delete the package.

Parameters:

package_id
definitive_delete – True: calls dataset_purge (action not reversible), False: calls API package_delete.
params

Returns:

package_delete_resources(package_name: str, *, bypass_admin: bool = False)

Definitively delete all resources associated with the package.

Parameters:: package_name
Returns:

package_patch(package_id: str, package_name: str = None, private: bool = None, *, title: str = None, notes: str = None, owner_org: str = None, state: CkanState | str = None, license_id: str = None, tags: List[str] = None, tags_list_dict: List[Dict[str, str]] = None, url: str = None, version: str = None, custom_fields_update: dict = None, custom_fields: dict = None, author: str = None, author_email: str = None, maintainer: str = None, maintainer_email: str = None, params: dict = None) → CkanPackageInfo

package_resource_reorder(package_id: str, resource_ids: List[str], *, params: dict = None) → dict

API call to package_resource_reorder. Reorders resources within a package. Reorder resources against datasets. If only partial resource ids are supplied then these are assumed to be first and the other resources will stay in their original order.

Parameters:

package_id – the id or name of the package to update
resource_ids – a list of resource ids in the order needed
params

Returns:

package_state_change(package_id: str, state: CkanState) → CkanPackageInfo

Change package state using the package_patch API.

Parameters:

package_id
state

Returns:

resource_create(package_id: str, name: str, *, format: str = None, description: str = None, state: CkanState = None, params: dict = None, url: str = None, files=None, file_path: str = None, df: DataFrame = None, payload: bytes | BufferedIOBase = None, payload_name: str = None, cancel_if_exists: bool = True, update_if_exists: bool = False, reupload: bool = False, create_default_view: bool = True, auto_submit: bool = False, error_submit_timeout: bool = True, datastore_create: bool = False, records: dict | List[dict] | DataFrame = None, fields: List[dict] = None, primary_key: str | List[str] = None, indexes: str | List[str] = None, aliases: str | List[str] = None, inhibit_datastore_patch_indexes: bool = False, data_cleaner: CkanDataCleanerABC = None, records_to_file: DataStoreReprFormat = None, progress_callback: CkanProgressCallbackABC = None) → CkanResourceInfo

Proxy to API call resource_create verifying if a resource with the same name already exists and adding the default view.

Parameters:

package_id
name
format
params
cancel_if_exists – check if a resource with the same name already exists in the package on CKAN server If a resource with the same name already exists, the info for this resource is returned
update_if_exists – If a resource with the same name already exists (and cancel_if_exists=True), a call to resource_patch is performed.
reupload – re-upload the resource if a resource with the same name already exists and cancel_if_exists=True and update_if_exists=True
create_default_view

Note

For file uploads, the following parameters are taken, by order of priority: See upload_prepare_requests_files_arg for an example of formatting.

Parameters:

files – files pass through argument to the requests.post function. Use to send other data formats.
payload – bytes to upload as a file
payload_name – name of the payload to use (associated with the payload argument) - this determines the format recognized in CKAN viewers.
file_path – path of the file to transmit (binary and text files are supported here)
df – pandas DataFrame to replace resource

Returns:

resource_delete(resource_id: str, *, params: dict = None, force: bool = None, bypass_admin: bool = False) → dict

resource_view_create(resource_id: str, title: str | List[str] = None, *, view_type: str | List[str] = None, params: dict = None, error_no_default_view_type: bool = False, cancel_if_exists: bool = True, is_datastore: bool = True) → List[CkanViewInfo]

Encapsulation of the API resource_view_create. If no resource view is provided to create (None), the function looks up the default view defined in default_resource_view. This function also looks at the existing views and cancels the creation of those which have the same title. If provided as a list, title and view_type must have same length.

Parameters:

resource_id
title
view_type
params
error_no_default_view_type
cancel_if_exists – option to cancel an existing view if it exists (based on the title)

Returns:

static verify_field_name_format(field_name: str, *, raise_error: bool = True, display_warnings: bool = True) → bool: Verifies that the field name format is correct.

static verify_package_name_format(package_name: str, *, raise_error: bool = True) → bool: Verifies that the package name format is correct.

class ckanapi_harvesters.ckan_api.ckan_api_5_manage.CkanApiManageParams(*, proxies: str | dict | ProxyConfig = None, ckan_headers: dict = None, http_headers: dict = None)

Bases: CkanApiReadWriteParams

copy(new_identifier: str = None, *, dest=None)

default_alias_enforce: bool = False

default_enable_admin: bool = False

get_num_rows_datastore_create_partial(limit_per_request: int = None) → int

package_create_default_clear_if_deleted_state: bool = True

ckanapi_harvesters.ckan_api.ckan_api_5_manage.clean_table_name(variable_name: str) → str: Replace unwanted characters and spaces to generate a table name similar to a table name

ckanapi_harvesters.ckan_api.ckan_api_params module

Basic parameters for the CkanApi class

class ckanapi_harvesters.ckan_api.ckan_api_params.CkanApiDebug

Bases: object

ckan_request_counter: int

extern_request_counter: int

last_response: Response | None

last_response_request_count: int

multi_requests_last_counters: LinesRequestCounter | None

multi_requests_last_successful_offset: int

class ckanapi_harvesters.ckan_api.ckan_api_params.CkanApiParamsBasic(*, proxies: str | dict | ProxyConfig = None, ckan_headers: dict = None, http_headers: dict = None)

Bases: object

__init__(*, proxies: str | dict | ProxyConfig = None, ckan_headers: dict = None, http_headers: dict = None)

Parameters:

proxies – proxies to use for requests
ckan_headers – headers to use for requests, only to the CKAN server
http_headers – headers to use for requests, for all requests, including external requests and to the CKAN server

_cli_ckan_args_apply(args: Namespace, *, base_dir: str = None, error_not_found: bool = True, default_proxies: dict = None, proxy_headers: dict = None) → None

Apply the arguments parsed by the argument parser defined by _setup_cli_ckan_parser

Parameters:

args
base_dir – base directory to find the CKAN API key file, if a relative path is provided (recommended: leave None to use cwd)
error_not_found – option to raise an exception if the CKAN API key file is not found
default_proxies – proxies used if proxies=”default”
proxy_headers – headers used to access the proxies, generally for authentication

Returns:

static _setup_cli_ckan_parser__params(parser: ArgumentParser = None) → ArgumentParser

Define or add CLI arguments to initialize a CKAN API connection parser help message:

CKAN API connection parameters initialization

Parameters:: parser – option to provide an existing parser to add the specific fields needed to initialize a CKAN API connection
Returns:

action_requests_retry_always: bool

property ckan_ca: bool | str | None

ckan_headers: dict

copy(*, dest=None)

default_limit_list_per_request: int | None

default_limit_read_per_request: int | None

dry_run: bool

property extern_ca: bool | str | None

http_headers: dict

max_requests_attempts: int

max_requests_count: int

multi_requests_limit_reached_warning: bool = False

multi_requests_time_between_requests: float

multi_requests_timeout: float

property proxies: dict

property proxy_auth: AuthBase | Tuple[str, str]

property proxy_string: str

requests_timeout: float | None

response_time_wait_threshold: None | float

store_last_response: bool

store_last_response_debug_info: bool

time_between_attempts: float

user_agent: str | None

verbose_extra: bool

verbose_multi_requests: bool

verbose_request: bool

verbose_request_error: bool

Module contents

Package with helper functions for CKAN requests using pandas DataFrames.