Example: Saving dataset metadata to Excel from CKAN API
The Python package ckanapi_harvesters.builder implements functions to download metadata from an existing CKAN dataset (previously known as a package). This notebook illustrates the usages which can be made from this feature among which:
Saving existing metadata from CKAN to an Excel file
Updating the dataset metadata policy scores
Creating a sample dataset from the current dataset
Initialisation
The Python package and its extra dependencies can be installed with the following command:
> pip install ckanapi-harvesters[extras]
The following cell refers to the code present in the Git directory with the option use_git_package.
# initial checks
import sys
print(f"Python version: {sys.version} in {sys.executable}")
# optionally, use the ckanapi_harvesters package present in the Git directory
use_git_package = False
if use_git_package:
import os
cwd = os.getcwd()
if not os.path.isdir(os.path.join(cwd, "ckanapi_harvesters")):
# we assume we are in the examples directory
cwd = os.path.join(cwd, r"../../src") # aim for src directory
assert(os.path.isdir(os.path.join(cwd, "ckanapi_harvesters")))
os.chdir(cwd)
print("CWD changed to: " + os.path.abspath(""))
import os
from ckanapi_harvesters import CkanApi, BuilderPackage, CkanCallbackLevel
from ckanapi_harvesters import __version__ as ckanapi_harvesters_version
from ckanapi_harvesters import package_dir as ckanapi_package_dir
print(f"ckanapi-harvesters version: {ckanapi_harvesters_version} in {ckanapi_package_dir}")
Script configuration
CKAN URL
Package name
API key file
Proxies
ckan_url = "https://demo.ckan.org/"
ckan_url = None # use this line if the CKAN URL is specified in the Excel workbook / user input
package_name = "builder-example-py" # Example dataset from ckanapi-harvesters
apikey_file = os.path.expanduser(os.path.join("~", ".config", "__CKAN_API_KEY__.txt")) # default location: ~/.config/__CKAN_API_KEY__.txt
apikey_file = None # if not specified, the package will look in the different locations and environment variables
print("API key file: " + str(apikey_file))
if apikey_file is not None and not os.path.exists(apikey_file):
print("API key file not found !!!")
# proxy configuration
proxies = {"http": "http://myproxy", "https": "http://myproxy"} # example
proxies = {"http": "", "https": ""} # no proxies
proxies = None # use the system configuration, defined in your environment variables
print("proxies = " + str(proxies))
Connecting to CKAN
ckan = CkanApi(ckan_url, proxies=proxies, apikey_file=apikey_file)
ckan.load_apikey()
ckan.input_missing_info(input_args_if_necessary=True, input_owner_org=True, error_not_found=False) # request user input to configure CKAN
ckan.set_limits_per_request(10000) # reduce if server hangs up
ckan.set_requests_delay(0.1) # increase if server errors 502
ckan.set_verbosity(True) # this displays all the steps performed by the script
ckan.test_ckan_login(raise_error=True, verbose=True) # test if you are correctly logged in
Loading dataset metadata from CKAN API
mdl = BuilderPackage.from_ckan(ckan, package_name)
print(f"Downloaded metadata from CKAN dataset {mdl.package_name}")
print(f"Source dataset URL: {mdl.get_package_page_url(ckan)}")
Displaying the dataset model
df_dict = mdl.get_all_df()
for tab, df in df_dict.items():
display(f"Tab {tab}:")
display(df)
Updating metadata policy scores
mdl.remote_policy_check(ckan, verbose=True)
Saving the current package metadata to an Excel file
This file can be used as an archive to restore metadata on CKAN.
excel_file_out = os.path.abspath("downloaded_metadata.xlsx")
mdl.to_excel(excel_file_out)
print(f"Metadata extracted from CKAN was saved to {excel_file_out}")
Example: Sample dataset creation
This section depicts how to create a sample dataset from the current dataset. A sample dataset is an extract of an existing dataset meant to reflect the contents of its source without exposing personal data.
The string “Sample” is appended to the package name.
In the second cell, samples of the original data are initialized
sample_mdl = mdl.setup_sample_package(ckan, sample_url_suffix="-sample", sample_title_suffix=" - Sample")
print(f"The sample dataset will have the following URL: {sample_mdl.get_package_page_url(ckan)}")
Requesting original data to extract samples
Adapt the code to your use case. By default, the first 10 lines of each DataStore are downloaded. The aim of a sample dataset is to not expose any personal data.
Resources containing files are fully downloaded. To remove them, use option empty_files=True when calling download_sample.
sample_df_dict = mdl.download_sample(ckan, total_limit=10) # prendre les 10 premières lignes de chaque ressource
# Requêtes particulières pour certains jeux de données:
sample_df_dict["users.csv"] = mdl.resource_builders["users.csv"].download_sample_df(ckan, total_limit=2, search_all=False)
sample_df_dict["traces.csv"] = mdl.resource_builders["traces.csv"].download_sample_df(ckan, total_limit=50, search_all=False)
print("Sample data initialized")
Uploading sample data
Please check the result on CKAN.
sample_mdl.patch_request_full(ckan, reupload=True, sample_df_dict=sample_df_dict)
sample_mdl.patch_request_final(ckan)
Updating sample dataset metadata policy scores
sample_mdl.remote_policy_check(ckan, verbose=True)
Database queries on sample dataset
Querying the dataset as a database table. Check the format of the first few lines with this method.
users_id = sample_mdl.get_or_query_resource_id(ckan, "users.csv")
traces_id = sample_mdl.get_or_query_resource_id(ckan, "traces.csv")
Simple requests
Using API datastore_search.
cursor = ckan.datastore_search_cursor(users_id, total_limit=1)
document = next(cursor)
user_id = document["user_id"]
cursor = ckan.datastore_search_cursor(traces_id, filters={"user_id": int(user_id)}, total_limit=10)
for document in cursor:
print(document)
SQL queries
Example of an SQL query joining two tables using API datastore_search_sql.
query = f"""
SELECT t.*, u.* FROM "{traces_id}" t
JOIN "{users_id}" u ON t.user_id = u.user_id
WHERE t.user_id = {user_id}
LIMIT 10
"""
cursor = ckan.datastore_search_sql_cursor(query)
for document in cursor:
print(document)
Restoring original files for sample dataset
This function downloads all the resources of a dataset to CSV files. The multi-threaded implementation is reserved to download large datasets.
# define the destination directory
sample_package_download_dir = os.path.abspath("sample_package_download")
print("Dataset will be downloaded in: " + sample_package_download_dir)
threads = 3 # > 1: number of threads to download large datasets
sample_mdl.download_request_full(ckan, sample_package_download_dir, full_download=True, threads=threads, skip_existing=False)