{ "cells": [ { "cell_type": "markdown", "id": "3c7855c9-6d7c-407c-91a3-47c5a2f63f8a", "metadata": {}, "source": [ "# Example: Building a dataset from an Excel file\n", "The Python package `ckanapi_harvesters.builder` implements functions to manage a CKAN dataset (previously known as a package) with the help of an Excel workbook. This Excel file specifies the dataset metadata and information to upload/download the resources of the dataset. An illustration of these tasks is given in this notebook." ] }, { "cell_type": "markdown", "id": "5dd831cd-402d-4522-9c07-1d6de39925e0", "metadata": {}, "source": [ "## Initialisation\n", "The Python package and its extra dependencies can be installed with the following command:\n", "```sh\n", "> pip install ckanapi-harvesters[extras]\n", "```\n", "\n", "The following cell refers to the code present in the Git directory with the option `use_git_package`." ] }, { "metadata": {}, "cell_type": "code", "source": [ "# initial checks\n", "import sys\n", "print(f\"Python version: {sys.version} in {sys.executable}\")" ], "id": "63359e92a1ea3ad7", "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "f94124e60fe21f5e", "metadata": {}, "source": [ "# optionally, use the ckanapi_harvesters package present in the Git directory\n", "use_git_package = False\n", "if use_git_package:\n", " import os\n", " cwd = os.getcwd()\n", " if not os.path.isdir(os.path.join(cwd, \"ckanapi_harvesters\")):\n", " # we assume we are in the examples directory\n", " cwd = os.path.join(cwd, r\"../../src\") # aim for src directory\n", " assert(os.path.isdir(os.path.join(cwd, \"ckanapi_harvesters\")))\n", " os.chdir(cwd)\n", " print(\"CWD changed to: \" + os.path.abspath(\"\"))" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "ae3320c0c64822ec", "metadata": {}, "source": [ "import os\n", "from ckanapi_harvesters import CkanApi, BuilderPackage, CkanCallbackLevel\n", "from ckanapi_harvesters.builder.example import example_package_xls\n", "\n", "from ckanapi_harvesters import __version__ as ckanapi_harvesters_version\n", "from ckanapi_harvesters import package_dir as ckanapi_package_dir\n", "print(f\"ckanapi-harvesters version: {ckanapi_harvesters_version} in {ckanapi_package_dir}\")" ], "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Script configuration\n", "\n", "- CKAN URL\n", "- Owner organisation\n", "- Excel file\n", "- API key file\n", "- Proxies" ], "id": "9eadf83fcdfe6b6a" }, { "metadata": {}, "cell_type": "code", "source": [ "ckan_url = \"https://demo.ckan.org/\"\n", "ckan_url = None # use this line if the CKAN URL is specified in the Excel workbook / user input" ], "id": "4546b4a1784eca8e", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "# Owner organization for the new dataset\n", "owner_org = \"test\"\n", "owner_org = None # use this line if the owner organisation is specified in the Excel workbook / user input" ], "id": "18c2ab54b050ace3", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "xls_file = example_package_xls # Excel file\n", "print(\"Excel file: \" + xls_file)\n", "if not os.path.exists(xls_file):\n", " print(\"Excel file not found !!!\")" ], "id": "a13aaebca063e73b", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "apikey_file = os.path.expanduser(os.path.join(\"~\", \".config\", \"__CKAN_API_KEY__.txt\")) # default location: ~/.config/__CKAN_API_KEY__.txt\n", "apikey_file = None # if not specified, the package will look in the different locations and environment variables\n", "print(\"API key file: \" + str(apikey_file))\n", "if apikey_file is not None and not os.path.exists(apikey_file):\n", " print(\"API key file not found !!!\")" ], "id": "ac5f335f49492e34", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "# proxy configuration\n", "proxies = {\"http\": \"http://myproxy\", \"https\": \"http://myproxy\"} # example\n", "proxies = {\"http\": \"\", \"https\": \"\"} # no proxies\n", "proxies = None # use the system configuration, defined in your environment variables\n", "print(\"proxies = \" + str(proxies))" ], "id": "8d49f95efd1eb7cd", "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "b9a847c0-3ab2-49fe-af4e-ee7437848bf3", "metadata": {}, "source": [ "### Loading dataset metadata from Excel file\n", "The Excel workbook given in the example refers to an external code module for DataFrame upload/download functions. To activate this feature, a call to `unlock_external_code_execution` must be done.\n", "\n", "__Warning:__ Only enable this feature for code which comes from trusted sources as this executes the module referred in the Excel workbook (`Auxiliary functions file` field)." ] }, { "cell_type": "code", "id": "500e2647-9f72-45d0-9090-09c2db0f91ce", "metadata": {}, "source": [ "BuilderPackage.unlock_external_code_execution()\n", "mdl = BuilderPackage.from_excel(xls_file)\n", "print(f\"Excel file loaded for dataset '{mdl.package_attributes.title}'\")\n", "print(f\"The URL will be: {mdl.get_package_page_url(ckan)}\")" ], "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": "#### Connecting to CKAN", "id": "92a7988d186518b5" }, { "cell_type": "code", "id": "8bd9e9fa-0343-40dc-8f1f-588c70e92e5d", "metadata": {}, "source": [ "ckan = CkanApi(ckan_url, proxies=proxies, apikey_file=apikey_file, owner_org=owner_org)\n", "# you can specify the CKAN URL, owner organization, API key here or in the Excel workbook\n", "ckan = mdl.init_ckan(ckan)\n", "ckan.input_missing_info(input_args_if_necessary=True, input_owner_org=True, error_not_found=False) # request user input to configure CKAN\n", "ckan.set_limits_per_request(10000) # reduce if server hangs up\n", "ckan.set_requests_delay(0.1) # increase if server errors 502\n", "ckan.set_verbosity(True) # this displays all the steps performed by the script\n", "ckan.test_ckan_login(raise_error=True, verbose=True) # test if you are correctly logged in" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "4477553b-0d83-4645-846a-1a1f53c176af", "metadata": {}, "source": "### Displaying the dataset model" }, { "cell_type": "code", "id": "59c19cfa-abb9-42b2-801c-3981101fe15b", "metadata": {}, "source": [ "df_dict = mdl.get_all_df()\n", "for tab, df in df_dict.items():\n", " display(f\"Tab {tab}:\")\n", " display(df)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "efe8e84f-0c98-43ea-a1a0-bb2f8800e393", "metadata": {}, "source": [ "## Initiating the dataset\n", "This call creates the dataset if no other dataset with the same name exists. If the dataset already exists, it is updated with information from the Excel workbook. The resources are initialized. Optionally, the resources data can be fully reuploaded (even if the resources already exist) to ensure the server side of the dataset represents the information specified in the Excel workbook. However, if there are large datasets, this resets them." ] }, { "cell_type": "code", "id": "f41c0a6d-fc98-48a4-a964-f673ef20790e", "metadata": {}, "source": [ "reupload = True # True: reuploads all documents and resets large datasets to the first document (not recommended if there is a large dataset)\n", "mdl.patch_request_full(ckan, reupload=reupload)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "8275c0f4-1fd9-4787-b0e4-cd3441008c2a", "metadata": {}, "source": [ "### Uploading large datasets\n", "Large datasets are defined locally by a directory containing multiple CSV files. The first file found (in alphabetic order) is used to initialize the dataset. This function automates the concatenation of other files using the API `datastore_upsert` in a multi-threaded implementation. It can be executed multiple times without affecting the final result, as long as all the data has been transferred.\n", "The number of threads should be adjusted if there are too many request errors." ] }, { "cell_type": "code", "id": "e2cde8b9-6fb9-40ea-bacc-df14ff366939", "metadata": {}, "source": [ "threads = 3 # > 1: multi-threading mode - reduce if HTTP 502 errors\n", "mdl.upload_large_datasets(ckan, threads=threads, only_missing=True)" ], "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": "##### Final requests", "id": "347268c6f5ccf81a" }, { "metadata": {}, "cell_type": "code", "source": "mdl.patch_request_final(ckan) # sets dataset state, reorders resources, updates dataset policy fields - also called at the end of upload_large_datasets", "id": "3a505cb4f8c05821", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": "### Updating metadata policy scores", "id": "2f18a37c9be48867" }, { "metadata": {}, "cell_type": "code", "source": "mdl.remote_policy_check(ckan, verbose=True)", "id": "259487300b86fdaf", "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "06919a09-60b8-41c5-9460-a6b9f1a4e281", "metadata": {}, "source": [ "## Database queries\n", "Querying the dataset as a database table. Check the format of the first few lines with this method." ] }, { "metadata": {}, "cell_type": "code", "source": [ "users_id = mdl.get_or_query_resource_id(ckan, \"users.csv\")\n", "traces_id = mdl.get_or_query_resource_id(ckan, \"traces.csv\")" ], "id": "e800761e6a11bfdf", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "#### Simple requests\n", "Using API `datastore_search`." ], "id": "6a60eeab372f4331" }, { "metadata": {}, "cell_type": "code", "source": [ "cursor = ckan.datastore_search_cursor(users_id, total_limit=1)\n", "document = next(cursor)\n", "user_id = document[\"user_id\"]\n", "\n", "cursor = ckan.datastore_search_cursor(traces_id, filters={\"user_id\": int(user_id)}, total_limit=10)\n", "for document in cursor:\n", " print(document)" ], "id": "139f361fab71635b", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "#### SQL queries\n", "Example of an SQL query joining two tables using API `datastore_search_sql`." ], "id": "f9bf3b6058d2d9a" }, { "metadata": {}, "cell_type": "code", "source": [ "query = f\"\"\"\n", "SELECT t.*, u.* FROM \"{traces_id}\" t\n", "JOIN \"{users_id}\" u ON t.user_id = u.user_id\n", "WHERE t.user_id = {user_id}\n", "LIMIT 10\n", "\"\"\"" ], "id": "9e16f61297b24105", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "cursor = ckan.datastore_search_sql_cursor(query)\n", "for document in cursor:\n", " print(document)" ], "id": "821e1cd961ef6880", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Restoring original files\n", "This function downloads all the resources of a dataset to CSV files. The multi-threaded implementation is reserved to download large datasets." ], "id": "67f34c3917f53735" }, { "cell_type": "code", "id": "713ca54a01990fc4", "metadata": {}, "source": [ "# define the destination directory\n", "example_package_download_dir = os.path.abspath(\"package_download\")\n", "print(\"Dataset will be downloaded in: \" + example_package_download_dir)" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "fd69d2e6ab81fe1", "metadata": {}, "source": [ "threads = 3 # > 1: number of threads to download large datasets\n", "mdl.download_request_full(ckan, example_package_download_dir, full_download=True, threads=threads, skip_existing=False)" ], "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": "", "id": "20dfac85bb6dc077", "outputs": [], "execution_count": null } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.1" } }, "nbformat": 4, "nbformat_minor": 5 }