Resource builder specific options & database harvesters
This page documents specific options
which can be used in the Options column of the resources sheet of the Excel builder workbook.
These options are formatted using a UNIX CLI format e.g.
--option value.
Resource specific options
DataStore resource specific options
options:
--data-cleaner DATA_CLEANER
Data cleaner to call before uploading data
--one-frame-per-primary-key
Enabling this option makes the upload process expect
one DataFrame per primary key combination (except the
last field of the primary key, which could be an index
in the file).
This option should be associated with the file format
option --no-chunks to ensure a file is treated at once
--group-by GROUP_BY Fields of the primary key defining the request to
reconstruct a file, in --one-frame-per-primary-key
mode, separated by a comma (no spaces). By default,
the first columns of the primary, except the last one
is used. At least one field of the primary key must be
unused here.
--include-source-file
Add a column for the file name of the original data
(named py_source_file)
--no-upload-index Disable the generation of an upload index column in
case no primary key was given (named py_upload_index)
Examples:
- Selecting a Data Cleaner: --data-cleaner GeoJSON
- Process one file per primary key combination
(first columns of the primary key, except the last one): --one-frame-per-primary-key --no-chunks
File harvester options
The base file reader functions can be customized using the following arguments:
File format reader arguments
options:
--chunk-size CHUNK_SIZE
Chunk size for reading files by chunks (number of
records). The number of lines sent per request is the
minimum of chunk size and CKAN parameter
ckan.params.default_limit_write Enabling this option
activates reading by chunks (if supported by the file
format)
--no-chunks Option to disabling reading files by chunks
--allow-chunks Option to enable reading files by chunks (useful for
file formats not enabling this feature by default)
--read-kwargs [READ_KWARGS ...]
Keyword arguments for the read function in key=value
format
--write-kwargs [WRITE_KWARGS ...]
Keyword arguments for the write function in key=value
format
Examples:
- Changing chunk size: --chunk-size 10000
- Disabling reading files by chunks: --no-chunks
- Additional arguments for pandas.read_csv
for a CSV file: --read-kwargs compression=gzip header=10
Pre-defined file format I/O functions
The following section lists the pre-defined file format I/O functions. The choice of the function is determined by the Format attribute. The Options attribute described above can be used to customize the functions behavior according to the documentation of the underlying functions.
CSV: delimited text file format
This method can read text tabular file formats with the underlying function, which is
pandas.read_csv.
The default read parameters are dtype=str, keep_default_na=False, sep=None, engine='python'.
For the write parameters, index=False is imposed.
SHP: shape file format
This method can read geographic file formats manages by geopandas. The underlying function is
geopandas.read_file.
The default arguments for read/write are encoding='utf-8'.
XLS: Excel/ODS file format
This method reads the following file formats: xls, xlsx, xlsm, xlsb, odf, ods and odt.
The underlying function is
pandas.read_excel.
It accepts a special CLI argument to specify the sheet name: --sheet-name.
This argument can also be set using the --read-kwargs argument such as in --read-kwargs sheet_name=YourSheetName.
The default arguments of the read/write functions are those of the pandas documentation.
JSON: JSON file format
This method enables reading JSON files. It relies on
pandas.read_json.
The I/O functions are configured by default to write one line per record.
This mode enables reading the file by chunks and appending lines to an existing file.
The corresponding default read/write arguments are orient="records", lines=True.
These can be changed with the CLI arguments --read-kwargs and --write-kwargs.
User-defined file format I/O functions
In addition to these parameters, the user can specify his own read/write functions with the Read function / Write function columns in the resources sheet. If one function is defined, the reciprocal function must be defined, if used (there is no fallback to the default file format function).
Extra argument to enable append mode
An extra CLI argument enables writing a file in append mode. In this case, the DataFrames can be written to disk as they are received.
The option is called --allow-append.
Basic I/O prototypes
The function prototype should be as follows.
The positional arguments (before the asterisk *) are mandatory. As well as the **kwargs argument in order to remain compatible with future versions of the Python package.
The parameters defined above also apply to the user-defined functions. The example below returns a DataFrame for the read function.
from typing import Union, Dict, List, Generator
import io
import pandas as pd
from ckanapi_harvesters.auxiliary.ckan_model import CkanField
from ckanapi_harvesters.harvesters.file_formats.user_format import UserFileFormat
...
def read_function_example_df(file_path_or_stream:Union[str, io.IOBase], *, fields: Union[Dict[str, CkanField],None],
allow_chunks:bool=True, params:UserFileFormat = None, **kwargs) \
-> Union[pd.DataFrame, List[dict]]:
return pd.read_csv(file_path_or_stream)
def write_function_example(df: Union[pd.DataFrame, List[dict]], file_path_or_stream:Union[str, io.IOBase],
*, fields: Union[Dict[str, CkanField],None], append:bool=False,
params:UserFileFormat = None, **kwargs) -> None:
mode = 'a' if append else 'w'
df.to_csv(file_path_or_stream, mode=mode, index=False)
Reading a file by chunks
Returning a DataFrame generator requires implementing a ContextManager such as in the example below.
from contextlib import contextmanager
...
@contextmanager
def read_function_example_by_chunks(file_path_or_stream:Union[str, io.IOBase], *, fields: Union[Dict[str, CkanField],None],
allow_chunks:bool=True, params:UserFileFormat = None, **kwargs) \
-> Generator:
file_handle = open(file_path_or_stream, 'r')
try:
yield read_function_example_by_chunks_generator(file_handle)
finally:
file_handle.close()
def read_function_example_by_chunks_generator(file_handle) -> Generator[Union[pd.DataFrame, List[dict]], None, None]:
# function called by read_function_example_by_chunks
for df_chunk in pd.read_csv(file_handle, chunksize=100):
yield df_chunk
Database harvester options
These options are used to define the connection method (login, URL) and the database/dataset/table to harvest.
PostgreSQL
Examples
Example of arguments enabling a connection to a database without using SSL:
--harvester Postgre --login-file postgre_login.txt
--ca server_ca.pem --host my-postgre-server.com --port 5432
--database My_Database --schema My_Schema
--limit 1000
The argument --ca can be used to pass a user-specific signing certificate to authenticate the server.
You can disable certificate verification by passing --ca False.
This is not recommended, especially if the database server is on the global web
because any server could impersonate the target server, under certain conditions.
A warning is generated in this case.
In the Excel resources worksheet, the table name is specified in the File/URL column of the resource sheet. The PostgreSQL schema is comparable with a CKAN dataset (set of tables).
If you need to setup a secure connection to the database, you can setup an SSH tunnel outside of the Python package. This feature is not handled by the present package. Warning: Please verify if the SSL options are correctly transferred when the connection is created in the harvester code. These features were not tested in the package. An error is raised inviting you to contribute to this package.
Documentation
Harvester parameters
options:
options:
--harvester HARVESTER
Type of harvester to use
--proxy PROXY Proxy for HTTP and HTTPS
--http-proxy HTTP_PROXY
HTTP proxy
--https-proxy HTTPS_PROXY
HTTPS proxy
--no-proxy NO_PROXY Proxy exceptions
--proxy-auth-file PROXY_AUTH_FILE
Path to a proxy authentication file with 3 lines
(authentication method, username, password)
--ca CA Server CA certificate location (.pem file)
--ssl-certfile SSL_CERTFILE
Client certificate location (.pem file). This file
sometimes includes the client private key.
--ssl-keyfile SSL_KEYFILE
Client private key location (.pem/.key file)
--ssl Enable SSL encryption
--no-ssl Disable SSL encryption (when enabled by default, not
recommended)
--allow-invalid-hostnames
Allow invalid host names (not recommended)
--timeout TIMEOUT Server timeout (seconds)
--host HOST Host for queries
--port PORT Port for queries
--auth-url-suffix AUTH_URL_SUFFIX
URL suffix used to authenticate user
--auth-url AUTH_URL URL to authenticate user
--url URL Base URL for queries
--apikey APIKEY API key
--apikey-file APIKEY_FILE
Path to a file containing the API key (first line)
--login-file LOGIN_FILE
Path to a text file containing login credentials for
authentification (user, password)
--login-user LOGIN_USER
Login user name (prefer using a file)
--login-password LOGIN_PASSWORD
Login user password (prefer using a file)
-v, --verbose Option to set verbosity
--database DATABASE Database name
--ckan-postgis Option to use CKAN with PostGIS geometric types
--ckan-epsg CKAN_EPSG
Default EPSG for CKAN
-l LIMIT, --limit LIMIT
Number of rows per request
--once Option to perform only one request with the default
limit. This will limit the size of the Data.
--dataset DATASET Dataset name
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory of download, relative to the download
directory (normally provided by File/URL attribute)
--no-download NO_DOWNLOAD
Option to disable download
--resource-url RESOURCE_URL
URL of resource
--table TABLE Table name
--query QUERY Query to restrict the lines of the table
--schema SCHEMA PostgreSQL schema name
MongoDB
Examples
Example of arguments enabling a connection to a database without using SSL:
--harvester MongoDB --login-file mongodb_login.txt
--ca server_ca.pem --host mongodb://my-postgre-server.com:27017/admin
--url --host mongodb://my-postgre-server.com:27017
--dataset My_Schema
--limit 1000
The argument --ca can be used to pass a user-specific signing certificate to authenticate the server.
You can disable certificate verification by passing --ca False.
This is not recommended, especially if the database server is on the global web
because any server could impersonate the target server, under certain conditions.
A warning is generated in this case.
In the Excel resources worksheet, the table name is specified in the File/URL column of the resource sheet. In MongoDB, tables are called collections.
If you need to setup a secure connection to the database, you can setup an SSH tunnel outside of the Python package. This feature is not handled by the present package. Warning: Please verify if the SSL options are correctly transferred when the connection is created in the harvester code. These features were not tested in the package. An error is raised inviting you to contribute to this package.
Documentation
Harvester parameters
options:
--harvester HARVESTER
Type of harvester to use
--proxy PROXY Proxy for HTTP and HTTPS
--http-proxy HTTP_PROXY
HTTP proxy
--https-proxy HTTPS_PROXY
HTTPS proxy
--no-proxy NO_PROXY Proxy exceptions
--proxy-auth-file PROXY_AUTH_FILE
Path to a proxy authentication file with 3 lines
(authentication method, username, password)
--ca CA Server CA certificate location (.pem file)
--ssl-certfile SSL_CERTFILE
Client certificate location (.pem file). This file
sometimes includes the client private key.
--ssl-keyfile SSL_KEYFILE
Client private key location (.pem/.key file)
--ssl Enable SSL encryption
--no-ssl Disable SSL encryption (when enabled by default, not
recommended)
--allow-invalid-hostnames
Allow invalid host names (not recommended)
--timeout TIMEOUT Server timeout (seconds)
--host HOST Host for queries
--port PORT Port for queries
--auth-url-suffix AUTH_URL_SUFFIX
URL suffix used to authenticate user
--auth-url AUTH_URL URL to authenticate user
--url URL Base URL for queries
--apikey APIKEY API key
--apikey-file APIKEY_FILE
Path to a file containing the API key (first line)
--login-file LOGIN_FILE
Path to a text file containing login credentials for
authentification (user, password)
--login-user LOGIN_USER
Login user name (prefer using a file)
--login-password LOGIN_PASSWORD
Login user password (prefer using a file)
-v, --verbose Option to set verbosity
--database DATABASE Database name
--ckan-postgis Option to use CKAN with PostGIS geometric types
--ckan-epsg CKAN_EPSG
Default EPSG for CKAN
-l LIMIT, --limit LIMIT
Number of rows per request
--once Option to perform only one request with the default
limit. This will limit the size of the Data.
--dataset DATASET Dataset name
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory of download, relative to the download
directory (normally provided by File/URL attribute)
--no-download NO_DOWNLOAD
Option to disable download
--resource-url RESOURCE_URL
URL of resource
--table TABLE Table name
--query QUERY Query to restrict the lines of the table
--collection COLLECTION
MongoDB collection name
--dbref-expand Option to expand DBRefs