Resource builder specific options & database harvesters

This page documents specific options which can be used in the Options column of the resources sheet of the Excel builder workbook. These options are formatted using a UNIX CLI format e.g. --option value.

Resource specific options

DataStore resource specific options

options:
  --data-cleaner DATA_CLEANER
                        Data cleaner to call before uploading data
  --one-frame-per-primary-key
                        Enabling this option makes the upload process expect
                        one DataFrame per primary key combination (except the
                        last field of the primary key, which could be an index
                        in the file). 
                        This option should be associated with the file format 
                        option --no-chunks to ensure a file is treated at once
  --group-by GROUP_BY   Fields of the primary key defining the request to
                        reconstruct a file, in --one-frame-per-primary-key
                        mode, separated by a comma (no spaces). By default,
                        the first columns of the primary, except the last one
                        is used. At least one field of the primary key must be
                        unused here.
  --include-source-file
                        Add a column for the file name of the original data
                        (named py_source_file)
  --no-upload-index     Disable the generation of an upload index column in
                        case no primary key was given (named py_upload_index)

Examples: 
- Selecting a Data Cleaner: --data-cleaner GeoJSON 
- Process one file per primary key combination 
(first columns of the primary key, except the last one): --one-frame-per-primary-key --no-chunks

File harvester options

The base file reader functions can be customized using the following arguments:

File format reader arguments

options:
  --chunk-size CHUNK_SIZE
                        Chunk size for reading files by chunks (number of
                        records). The number of lines sent per request is the
                        minimum of chunk size and CKAN parameter
                        ckan.params.default_limit_write Enabling this option
                        activates reading by chunks (if supported by the file
                        format)
  --no-chunks           Option to disabling reading files by chunks
  --allow-chunks        Option to enable reading files by chunks (useful for
                        file formats not enabling this feature by default)
  --read-kwargs [READ_KWARGS ...]
                        Keyword arguments for the read function in key=value
                        format
  --write-kwargs [WRITE_KWARGS ...]
                        Keyword arguments for the write function in key=value
                        format

Examples: 
- Changing chunk size: --chunk-size 10000 
- Disabling reading files by chunks: --no-chunks 
- Additional arguments for pandas.read_csv 
for a CSV file: --read-kwargs compression=gzip header=10

Pre-defined file format I/O functions

The following section lists the pre-defined file format I/O functions. The choice of the function is determined by the Format attribute. The Options attribute described above can be used to customize the functions behavior according to the documentation of the underlying functions.

CSV: delimited text file format

This method can read text tabular file formats with the underlying function, which is pandas.read_csv. The default read parameters are dtype=str, keep_default_na=False, sep=None, engine='python'. For the write parameters, index=False is imposed.

SHP: shape file format

This method can read geographic file formats manages by geopandas. The underlying function is geopandas.read_file. The default arguments for read/write are encoding='utf-8'.

XLS: Excel/ODS file format

This method reads the following file formats: xls, xlsx, xlsm, xlsb, odf, ods and odt. The underlying function is pandas.read_excel. It accepts a special CLI argument to specify the sheet name: --sheet-name. This argument can also be set using the --read-kwargs argument such as in --read-kwargs sheet_name=YourSheetName. The default arguments of the read/write functions are those of the pandas documentation.

JSON: JSON file format

This method enables reading JSON files. It relies on pandas.read_json. The I/O functions are configured by default to write one line per record. This mode enables reading the file by chunks and appending lines to an existing file. The corresponding default read/write arguments are orient="records", lines=True. These can be changed with the CLI arguments --read-kwargs and --write-kwargs.

User-defined file format I/O functions

In addition to these parameters, the user can specify his own read/write functions with the Read function / Write function columns in the resources sheet. If one function is defined, the reciprocal function must be defined, if used (there is no fallback to the default file format function).

Extra argument to enable append mode

An extra CLI argument enables writing a file in append mode. In this case, the DataFrames can be written to disk as they are received. The option is called --allow-append.

Basic I/O prototypes

The function prototype should be as follows. The positional arguments (before the asterisk *) are mandatory. As well as the **kwargs argument in order to remain compatible with future versions of the Python package. The parameters defined above also apply to the user-defined functions. The example below returns a DataFrame for the read function.

from typing import Union, Dict, List, Generator
import io
import pandas as pd
from ckanapi_harvesters.auxiliary.ckan_model import CkanField
from ckanapi_harvesters.harvesters.file_formats.user_format import UserFileFormat

...

def read_function_example_df(file_path_or_stream:Union[str, io.IOBase], *, fields: Union[Dict[str, CkanField],None],
                             allow_chunks:bool=True, params:UserFileFormat = None, **kwargs) \
        -> Union[pd.DataFrame, List[dict]]:
    return pd.read_csv(file_path_or_stream)

def write_function_example(df: Union[pd.DataFrame, List[dict]], file_path_or_stream:Union[str, io.IOBase],
                           *, fields: Union[Dict[str, CkanField],None], append:bool=False,
                           params:UserFileFormat = None, **kwargs) -> None:
    mode = 'a' if append else 'w'
    df.to_csv(file_path_or_stream, mode=mode, index=False)

Reading a file by chunks

Returning a DataFrame generator requires implementing a ContextManager such as in the example below.

from contextlib import contextmanager

...

@contextmanager
def read_function_example_by_chunks(file_path_or_stream:Union[str, io.IOBase], *, fields: Union[Dict[str, CkanField],None],
                                    allow_chunks:bool=True, params:UserFileFormat = None, **kwargs) \
        -> Generator:
    file_handle = open(file_path_or_stream, 'r')
    try:
        yield read_function_example_by_chunks_generator(file_handle)
    finally:
        file_handle.close()

def read_function_example_by_chunks_generator(file_handle) -> Generator[Union[pd.DataFrame, List[dict]], None, None]:
    # function called by read_function_example_by_chunks
    for df_chunk in pd.read_csv(file_handle, chunksize=100):
        yield df_chunk

Database harvester options

These options are used to define the connection method (login, URL) and the database/dataset/table to harvest.

PostgreSQL

Examples

Example of arguments enabling a connection to a database without using SSL:

--harvester Postgre --login-file postgre_login.txt 
--ca server_ca.pem --host my-postgre-server.com --port 5432 
--database My_Database --schema My_Schema 
--limit 1000

The argument --ca can be used to pass a user-specific signing certificate to authenticate the server. You can disable certificate verification by passing --ca False. This is not recommended, especially if the database server is on the global web because any server could impersonate the target server, under certain conditions. A warning is generated in this case.

In the Excel resources worksheet, the table name is specified in the File/URL column of the resource sheet. The PostgreSQL schema is comparable with a CKAN dataset (set of tables).

If you need to setup a secure connection to the database, you can setup an SSH tunnel outside of the Python package. This feature is not handled by the present package. Warning: Please verify if the SSL options are correctly transferred when the connection is created in the harvester code. These features were not tested in the package. An error is raised inviting you to contribute to this package.

Documentation

Harvester parameters

options:
options:
  --harvester HARVESTER
                        Type of harvester to use
  --proxy PROXY         Proxy for HTTP and HTTPS
  --http-proxy HTTP_PROXY
                        HTTP proxy
  --https-proxy HTTPS_PROXY
                        HTTPS proxy
  --no-proxy NO_PROXY   Proxy exceptions
  --proxy-auth-file PROXY_AUTH_FILE
                        Path to a proxy authentication file with 3 lines
                        (authentication method, username, password)
  --ca CA               Server CA certificate location (.pem file)
  --ssl-certfile SSL_CERTFILE
                        Client certificate location (.pem file). This file
                        sometimes includes the client private key.
  --ssl-keyfile SSL_KEYFILE
                        Client private key location (.pem/.key file)
  --ssl                 Enable SSL encryption
  --no-ssl              Disable SSL encryption (when enabled by default, not
                        recommended)
  --allow-invalid-hostnames
                        Allow invalid host names (not recommended)
  --timeout TIMEOUT     Server timeout (seconds)
  --host HOST           Host for queries
  --port PORT           Port for queries
  --auth-url-suffix AUTH_URL_SUFFIX
                        URL suffix used to authenticate user
  --auth-url AUTH_URL   URL to authenticate user
  --url URL             Base URL for queries
  --apikey APIKEY       API key
  --apikey-file APIKEY_FILE
                        Path to a file containing the API key (first line)
  --login-file LOGIN_FILE
                        Path to a text file containing login credentials for
                        authentification (user, password)
  --login-user LOGIN_USER
                        Login user name (prefer using a file)
  --login-password LOGIN_PASSWORD
                        Login user password (prefer using a file)
  -v, --verbose         Option to set verbosity
  --database DATABASE   Database name
  --ckan-postgis        Option to use CKAN with PostGIS geometric types
  --ckan-epsg CKAN_EPSG
                        Default EPSG for CKAN
  -l LIMIT, --limit LIMIT
                        Number of rows per request
  --once                Option to perform only one request with the default
                        limit. This will limit the size of the Data.
  --dataset DATASET     Dataset name
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory of download, relative to the download
                        directory (normally provided by File/URL attribute)
  --no-download NO_DOWNLOAD
                        Option to disable download
  --resource-url RESOURCE_URL
                        URL of resource
  --table TABLE         Table name
  --query QUERY         Query to restrict the lines of the table
  --schema SCHEMA       PostgreSQL schema name

MongoDB

Examples

Example of arguments enabling a connection to a database without using SSL:

--harvester MongoDB --login-file mongodb_login.txt 
--ca server_ca.pem --host mongodb://my-postgre-server.com:27017/admin 
--url --host mongodb://my-postgre-server.com:27017
--dataset My_Schema 
--limit 1000

The argument --ca can be used to pass a user-specific signing certificate to authenticate the server. You can disable certificate verification by passing --ca False. This is not recommended, especially if the database server is on the global web because any server could impersonate the target server, under certain conditions. A warning is generated in this case.

In the Excel resources worksheet, the table name is specified in the File/URL column of the resource sheet. In MongoDB, tables are called collections.

If you need to setup a secure connection to the database, you can setup an SSH tunnel outside of the Python package. This feature is not handled by the present package. Warning: Please verify if the SSL options are correctly transferred when the connection is created in the harvester code. These features were not tested in the package. An error is raised inviting you to contribute to this package.

Documentation

Harvester parameters

options:
  --harvester HARVESTER
                        Type of harvester to use
  --proxy PROXY         Proxy for HTTP and HTTPS
  --http-proxy HTTP_PROXY
                        HTTP proxy
  --https-proxy HTTPS_PROXY
                        HTTPS proxy
  --no-proxy NO_PROXY   Proxy exceptions
  --proxy-auth-file PROXY_AUTH_FILE
                        Path to a proxy authentication file with 3 lines
                        (authentication method, username, password)
  --ca CA               Server CA certificate location (.pem file)
  --ssl-certfile SSL_CERTFILE
                        Client certificate location (.pem file). This file
                        sometimes includes the client private key.
  --ssl-keyfile SSL_KEYFILE
                        Client private key location (.pem/.key file)
  --ssl                 Enable SSL encryption
  --no-ssl              Disable SSL encryption (when enabled by default, not
                        recommended)
  --allow-invalid-hostnames
                        Allow invalid host names (not recommended)
  --timeout TIMEOUT     Server timeout (seconds)
  --host HOST           Host for queries
  --port PORT           Port for queries
  --auth-url-suffix AUTH_URL_SUFFIX
                        URL suffix used to authenticate user
  --auth-url AUTH_URL   URL to authenticate user
  --url URL             Base URL for queries
  --apikey APIKEY       API key
  --apikey-file APIKEY_FILE
                        Path to a file containing the API key (first line)
  --login-file LOGIN_FILE
                        Path to a text file containing login credentials for
                        authentification (user, password)
  --login-user LOGIN_USER
                        Login user name (prefer using a file)
  --login-password LOGIN_PASSWORD
                        Login user password (prefer using a file)
  -v, --verbose         Option to set verbosity
  --database DATABASE   Database name
  --ckan-postgis        Option to use CKAN with PostGIS geometric types
  --ckan-epsg CKAN_EPSG
                        Default EPSG for CKAN
  -l LIMIT, --limit LIMIT
                        Number of rows per request
  --once                Option to perform only one request with the default
                        limit. This will limit the size of the Data.
  --dataset DATASET     Dataset name
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory of download, relative to the download
                        directory (normally provided by File/URL attribute)
  --no-download NO_DOWNLOAD
                        Option to disable download
  --resource-url RESOURCE_URL
                        URL of resource
  --table TABLE         Table name
  --query QUERY         Query to restrict the lines of the table
  --collection COLLECTION
                        MongoDB collection name
  --dbref-expand        Option to expand DBRefs