{ "cells": [ { "cell_type": "markdown", "id": "3c7855c9-6d7c-407c-91a3-47c5a2f63f8a", "metadata": {}, "source": [ "# Example: Saving dataset metadata to Excel from CKAN API\n", "\n", "The Python package `ckanapi_harvesters.builder` implements functions to download metadata from an existing CKAN dataset (previously known as a package). This notebook illustrates the usages which can be made from this feature among which:\n", "\n", "- Saving existing metadata from CKAN to an Excel file\n", "- Updating the dataset metadata policy scores\n", "- Creating a sample dataset from the current dataset" ] }, { "cell_type": "markdown", "id": "5dd831cd-402d-4522-9c07-1d6de39925e0", "metadata": {}, "source": [ "## Initialisation\n", "The Python package and its extra dependencies can be installed with the following command:\n", "```sh\n", "> pip install ckanapi-harvesters[extras]\n", "```\n", "\n", "The following cell refers to the code present in the Git directory with the option `use_git_package`." ] }, { "metadata": {}, "cell_type": "code", "source": [ "# initial checks\n", "import sys\n", "print(f\"Python version: {sys.version} in {sys.executable}\")" ], "id": "63359e92a1ea3ad7", "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "f94124e60fe21f5e", "metadata": {}, "source": [ "# optionally, use the ckanapi_harvesters package present in the Git directory\n", "use_git_package = False\n", "if use_git_package:\n", " import os\n", " cwd = os.getcwd()\n", " if not os.path.isdir(os.path.join(cwd, \"ckanapi_harvesters\")):\n", " # we assume we are in the examples directory\n", " cwd = os.path.join(cwd, r\"../../src\") # aim for src directory\n", " assert(os.path.isdir(os.path.join(cwd, \"ckanapi_harvesters\")))\n", " os.chdir(cwd)\n", " print(\"CWD changed to: \" + os.path.abspath(\"\"))" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "ae3320c0c64822ec", "metadata": {}, "source": [ "import os\n", "from ckanapi_harvesters import CkanApi, BuilderPackage, CkanCallbackLevel\n", "\n", "from ckanapi_harvesters import __version__ as ckanapi_harvesters_version\n", "from ckanapi_harvesters import package_dir as ckanapi_package_dir\n", "print(f\"ckanapi-harvesters version: {ckanapi_harvesters_version} in {ckanapi_package_dir}\")" ], "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Script configuration\n", "\n", "- CKAN URL\n", "- Package name\n", "- API key file\n", "- Proxies" ], "id": "9eadf83fcdfe6b6a" }, { "metadata": {}, "cell_type": "code", "source": [ "ckan_url = \"https://demo.ckan.org/\"\n", "ckan_url = None # use this line if the CKAN URL is specified in the Excel workbook / user input" ], "id": "4546b4a1784eca8e", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": "package_name = \"builder-example-py\" # Example dataset from ckanapi-harvesters", "id": "a13aaebca063e73b", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "apikey_file = os.path.expanduser(os.path.join(\"~\", \".config\", \"__CKAN_API_KEY__.txt\")) # default location: ~/.config/__CKAN_API_KEY__.txt\n", "apikey_file = None # if not specified, the package will look in the different locations and environment variables\n", "print(\"API key file: \" + str(apikey_file))\n", "if apikey_file is not None and not os.path.exists(apikey_file):\n", " print(\"API key file not found !!!\")" ], "id": "ac5f335f49492e34", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "# proxy configuration\n", "proxies = {\"http\": \"http://myproxy\", \"https\": \"http://myproxy\"} # example\n", "proxies = {\"http\": \"\", \"https\": \"\"} # no proxies\n", "proxies = None # use the system configuration, defined in your environment variables\n", "print(\"proxies = \" + str(proxies))" ], "id": "8d49f95efd1eb7cd", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": "#### Connecting to CKAN", "id": "92a7988d186518b5" }, { "cell_type": "code", "id": "8bd9e9fa-0343-40dc-8f1f-588c70e92e5d", "metadata": {}, "source": [ "ckan = CkanApi(ckan_url, proxies=proxies, apikey_file=apikey_file)\n", "ckan.load_apikey()\n", "ckan.input_missing_info(input_args_if_necessary=True, input_owner_org=True, error_not_found=False) # request user input to configure CKAN\n", "ckan.set_limits_per_request(10000) # reduce if server hangs up\n", "ckan.set_requests_delay(0.1) # increase if server errors 502\n", "ckan.set_verbosity(True) # this displays all the steps performed by the script\n", "ckan.test_ckan_login(raise_error=True, verbose=True) # test if you are correctly logged in" ], "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": "## Loading dataset metadata from CKAN API", "id": "8dac92da8f6c3147" }, { "metadata": {}, "cell_type": "code", "source": [ "mdl = BuilderPackage.from_ckan(ckan, package_name)\n", "print(f\"Downloaded metadata from CKAN dataset {mdl.package_name}\")\n", "print(f\"Source dataset URL: {mdl.get_package_page_url(ckan)}\")" ], "id": "737728bc19b76391", "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "4477553b-0d83-4645-846a-1a1f53c176af", "metadata": {}, "source": "### Displaying the dataset model" }, { "cell_type": "code", "id": "59c19cfa-abb9-42b2-801c-3981101fe15b", "metadata": {}, "source": [ "df_dict = mdl.get_all_df()\n", "for tab, df in df_dict.items():\n", " display(f\"Tab {tab}:\")\n", " display(df)" ], "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": "## Updating metadata policy scores\n", "id": "7b65fe8393273816" }, { "metadata": {}, "cell_type": "code", "source": "mdl.remote_policy_check(ckan, verbose=True)", "id": "d2bb2e45e45bb254", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Saving the current package metadata to an Excel file\n", "\n", "This file can be used as an archive to restore metadata on CKAN." ], "id": "8108fe10a4f338f7" }, { "metadata": {}, "cell_type": "code", "source": [ "excel_file_out = os.path.abspath(\"downloaded_metadata.xlsx\")\n", "mdl.to_excel(excel_file_out)\n", "print(f\"Metadata extracted from CKAN was saved to {excel_file_out}\")" ], "id": "9b920881bf050d83", "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "efe8e84f-0c98-43ea-a1a0-bb2f8800e393", "metadata": {}, "source": [ "\n", "\n", "# Example: Sample dataset creation\n", "\n", "This section depicts how to create a sample dataset from the current dataset. A sample dataset is an extract of an existing dataset meant to reflect the contents of its source without exposing personal data.\n", "\n", "The string \"Sample\" is appended to the package name.\n", "\n", "In the second cell, samples of the original data are initialized" ] }, { "cell_type": "code", "id": "f41c0a6d-fc98-48a4-a964-f673ef20790e", "metadata": {}, "source": [ "sample_mdl = mdl.setup_sample_package(ckan, sample_url_suffix=\"-sample\", sample_title_suffix=\" - Sample\")\n", "print(f\"The sample dataset will have the following URL: {sample_mdl.get_package_page_url(ckan)}\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "8275c0f4-1fd9-4787-b0e4-cd3441008c2a", "metadata": {}, "source": [ "### Requesting original data to extract samples\n", "\n", "Adapt the code to your use case. By default, the first 10 lines of each DataStore are downloaded. The aim of a sample dataset is to not expose any personal data.\n", "\n", "Resources containing files are fully downloaded. To remove them, use option `empty_files=True` when calling `download_sample`." ] }, { "cell_type": "code", "id": "e2cde8b9-6fb9-40ea-bacc-df14ff366939", "metadata": {}, "source": [ "sample_df_dict = mdl.download_sample(ckan, total_limit=10) # prendre les 10 premières lignes de chaque ressource\n", "\n", "# Requêtes particulières pour certains jeux de données:\n", "sample_df_dict[\"users.csv\"] = mdl.resource_builders[\"users.csv\"].download_sample_df(ckan, total_limit=2, search_all=False)\n", "sample_df_dict[\"traces.csv\"] = mdl.resource_builders[\"traces.csv\"].download_sample_df(ckan, total_limit=50, search_all=False)\n", "\n", "print(\"Sample data initialized\")" ], "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Uploading sample data\n", "\n", "Please check the result on CKAN." ], "id": "a8e0f6a54f9b1d9f" }, { "metadata": {}, "cell_type": "code", "source": [ "sample_mdl.patch_request_full(ckan, reupload=True, sample_df_dict=sample_df_dict)\n", "sample_mdl.patch_request_final(ckan)" ], "id": "9a505d0057536980", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": "### Updating sample dataset metadata policy scores", "id": "2f18a37c9be48867" }, { "metadata": {}, "cell_type": "code", "source": "sample_mdl.remote_policy_check(ckan, verbose=True)", "id": "259487300b86fdaf", "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "06919a09-60b8-41c5-9460-a6b9f1a4e281", "metadata": {}, "source": [ "## Database queries on sample dataset\n", "Querying the dataset as a database table. Check the format of the first few lines with this method." ] }, { "metadata": {}, "cell_type": "code", "source": [ "users_id = sample_mdl.get_or_query_resource_id(ckan, \"users.csv\")\n", "traces_id = sample_mdl.get_or_query_resource_id(ckan, \"traces.csv\")" ], "id": "e800761e6a11bfdf", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "#### Simple requests\n", "Using API `datastore_search`." ], "id": "6a60eeab372f4331" }, { "metadata": {}, "cell_type": "code", "source": [ "cursor = ckan.datastore_search_cursor(users_id, total_limit=1)\n", "document = next(cursor)\n", "user_id = document[\"user_id\"]\n", "\n", "cursor = ckan.datastore_search_cursor(traces_id, filters={\"user_id\": int(user_id)}, total_limit=10)\n", "for document in cursor:\n", " print(document)" ], "id": "139f361fab71635b", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "#### SQL queries\n", "Example of an SQL query joining two tables using API `datastore_search_sql`." ], "id": "f9bf3b6058d2d9a" }, { "metadata": {}, "cell_type": "code", "source": [ "query = f\"\"\"\n", "SELECT t.*, u.* FROM \"{traces_id}\" t\n", "JOIN \"{users_id}\" u ON t.user_id = u.user_id\n", "WHERE t.user_id = {user_id}\n", "LIMIT 10\n", "\"\"\"" ], "id": "9e16f61297b24105", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "cursor = ckan.datastore_search_sql_cursor(query)\n", "for document in cursor:\n", " print(document)" ], "id": "821e1cd961ef6880", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Restoring original files for sample dataset\n", "This function downloads all the resources of a dataset to CSV files. The multi-threaded implementation is reserved to download large datasets." ], "id": "67f34c3917f53735" }, { "cell_type": "code", "id": "713ca54a01990fc4", "metadata": {}, "source": [ "# define the destination directory\n", "sample_package_download_dir = os.path.abspath(\"sample_package_download\")\n", "print(\"Dataset will be downloaded in: \" + sample_package_download_dir)" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "fd69d2e6ab81fe1", "metadata": {}, "source": [ "threads = 3 # > 1: number of threads to download large datasets\n", "sample_mdl.download_request_full(ckan, sample_package_download_dir, full_download=True, threads=threads, skip_existing=False)" ], "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": "", "id": "20dfac85bb6dc077", "outputs": [], "execution_count": null } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.1" } }, "nbformat": 4, "nbformat_minor": 5 }