Example: Saving dataset metadata to Excel from CKAN API

The Python package ckanapi_harvesters.builder implements functions to download metadata from an existing CKAN dataset (previously known as a package). This notebook illustrates the usages which can be made from this feature among which:

Saving existing metadata from CKAN to an Excel file
Updating the dataset metadata policy scores
Creating a sample dataset from the current dataset

Initialisation

The Python package and its extra dependencies can be installed with the following command:

> pip install ckanapi-harvesters[extras]

The following cell refers to the code present in the Git directory with the option use_git_package.

# initial checks
import sys
print(f"Python version: {sys.version} in {sys.executable}")

# optionally, use the ckanapi_harvesters package present in the Git directory
use_git_package = False
if use_git_package:
    import os
    cwd = os.getcwd()
    if not os.path.isdir(os.path.join(cwd, "ckanapi_harvesters")):
        # we assume we are in the examples directory
        cwd = os.path.join(cwd, r"../../src")  # aim for src directory
        assert(os.path.isdir(os.path.join(cwd, "ckanapi_harvesters")))
        os.chdir(cwd)
        print("CWD changed to: " + os.path.abspath(""))

import os
from ckanapi_harvesters import CkanApi, BuilderPackage, CkanCallbackLevel

from ckanapi_harvesters import __version__ as ckanapi_harvesters_version
from ckanapi_harvesters import package_dir as ckanapi_package_dir
print(f"ckanapi-harvesters version: {ckanapi_harvesters_version} in {ckanapi_package_dir}")

Script configuration

CKAN URL
Package name
API key file
Proxies

ckan_url = "https://demo.ckan.org/"
ckan_url = None  # use this line if the CKAN URL is specified in the Excel workbook / user input

package_name = "builder-example-py"  # Example dataset from ckanapi-harvesters

apikey_file = os.path.expanduser(os.path.join("~", ".config", "__CKAN_API_KEY__.txt"))  # default location: ~/.config/__CKAN_API_KEY__.txt
apikey_file = None  # if not specified, the package will look in the different locations and environment variables
print("API key file: " + str(apikey_file))
if apikey_file is not None and not os.path.exists(apikey_file):
    print("API key file not found !!!")

# proxy configuration
proxies = {"http": "http://myproxy", "https": "http://myproxy"}  # example
proxies = {"http": "", "https": ""}  # no proxies
proxies = None  # use the system configuration, defined in your environment variables
print("proxies = " + str(proxies))

Connecting to CKAN

ckan = CkanApi(ckan_url, proxies=proxies, apikey_file=apikey_file)
ckan.load_apikey()
ckan.input_missing_info(input_args_if_necessary=True, input_owner_org=True, error_not_found=False)  # request user input to configure CKAN
ckan.set_limits_per_request(10000)  # reduce if server hangs up
ckan.set_requests_delay(0.1)  # increase if server errors 502
ckan.set_verbosity(True)  # this displays all the steps performed by the script
ckan.test_ckan_login(raise_error=True, verbose=True)  # test if you are correctly logged in

Loading dataset metadata from CKAN API

mdl = BuilderPackage.from_ckan(ckan, package_name)
print(f"Downloaded metadata from CKAN dataset {mdl.package_name}")
print(f"Source dataset URL: {mdl.get_package_page_url(ckan)}")

Displaying the dataset model

df_dict = mdl.get_all_df()
for tab, df in df_dict.items():
    display(f"Tab {tab}:")
    display(df)

Updating metadata policy scores

mdl.remote_policy_check(ckan, verbose=True)

Saving the current package metadata to an Excel file

This file can be used as an archive to restore metadata on CKAN.

excel_file_out = os.path.abspath("downloaded_metadata.xlsx")
mdl.to_excel(excel_file_out)
print(f"Metadata extracted from CKAN was saved to {excel_file_out}")

Example: Sample dataset creation

This section depicts how to create a sample dataset from the current dataset. A sample dataset is an extract of an existing dataset meant to reflect the contents of its source without exposing personal data.

The string “Sample” is appended to the package name.

In the second cell, samples of the original data are initialized

sample_mdl = mdl.setup_sample_package(ckan, sample_url_suffix="-sample", sample_title_suffix=" - Sample")
print(f"The sample dataset will have the following URL: {sample_mdl.get_package_page_url(ckan)}")

Requesting original data to extract samples

Adapt the code to your use case. By default, the first 10 lines of each DataStore are downloaded. The aim of a sample dataset is to not expose any personal data.

Resources containing files are fully downloaded. To remove them, use option empty_files=True when calling download_sample.

sample_df_dict = mdl.download_sample(ckan, total_limit=10)  # prendre les 10 premières lignes de chaque ressource

# Requêtes particulières pour certains jeux de données:
sample_df_dict["users.csv"] = mdl.resource_builders["users.csv"].download_sample_df(ckan, total_limit=2, search_all=False)
sample_df_dict["traces.csv"] = mdl.resource_builders["traces.csv"].download_sample_df(ckan, total_limit=50, search_all=False)

print("Sample data initialized")

Uploading sample data

Please check the result on CKAN.

sample_mdl.patch_request_full(ckan, reupload=True, sample_df_dict=sample_df_dict)
sample_mdl.patch_request_final(ckan)

Updating sample dataset metadata policy scores

sample_mdl.remote_policy_check(ckan, verbose=True)

Database queries on sample dataset

Querying the dataset as a database table. Check the format of the first few lines with this method.

users_id = sample_mdl.get_or_query_resource_id(ckan, "users.csv")
traces_id = sample_mdl.get_or_query_resource_id(ckan, "traces.csv")

Simple requests

Using API datastore_search.

cursor = ckan.datastore_search_cursor(users_id, total_limit=1)
document = next(cursor)
user_id = document["user_id"]

cursor = ckan.datastore_search_cursor(traces_id, filters={"user_id": int(user_id)}, total_limit=10)
for document in cursor:
    print(document)

SQL queries

Example of an SQL query joining two tables using API datastore_search_sql.

query = f"""
SELECT t.*, u.* FROM "{traces_id}" t
JOIN "{users_id}" u ON t.user_id = u.user_id
WHERE t.user_id = {user_id}
LIMIT 10
"""

cursor = ckan.datastore_search_sql_cursor(query)
for document in cursor:
    print(document)

Restoring original files for sample dataset

This function downloads all the resources of a dataset to CSV files. The multi-threaded implementation is reserved to download large datasets.

# define the destination directory
sample_package_download_dir = os.path.abspath("sample_package_download")
print("Dataset will be downloaded in: " + sample_package_download_dir)

threads = 3  # > 1: number of threads to download large datasets
sample_mdl.download_request_full(ckan, sample_package_download_dir, full_download=True, threads=threads, skip_existing=False)