User Manual
===========

Laserfarm provides a set of tools to process massive LiDAR point-cloud data sets. In order to tackle volumes
of data such as the ones produced by airborne laser scanning, Laserfarm makes use of a *divide-et-impera*
strategy. First, the raw data is split into tiles whose size is manageable by standard computing infrastructures.
Then, point-cloud properties ("features") are extracted from the re-tiled data using the cells of a second user-defined
grid as ensembles for the statistical averaging. The mesh size of this second finer grid ultimately defines the
resolution (i.e. the pixel size) of the raster data. Finally, the properties computed for each sub-cell are exported in
a raster format and all tile contributions merged to form a single output GeoTIFF file, ideally covering the same area
as the initial raw data.

The full processing pipeline can be thus subdivided into three pipeline steps: the raw data re-tiling, the point-cloud
data processing and the raster-data merging and export. Each of these tasks can be run in a Python script that imports
the dedicated Laserfarm module or through the command line tool ``laserfarm``. The following sections
describes the main features of each of these pipelines and their general use. User interested in more advanced features
and in how these tasks can be implemented for large-scale macro-ecology calculations can have a look at the Jupyter
notebooks provided (see :ref:`Examples`).

.. _Retiling:

Raw Data Re-tiling
------------------

Point-cloud data from airborne laser scanning are typically provided as compressed files in LAZ format. The following
script illustrates a minimal example in which a data set provided as a LAZ file (``point_cloud.LAZ``) is re-tiled
according to a regular grid:

.. code-block:: python

    from laserfarm import Retiler

    pipeline = Retiler(input_file="point_cloud.LAZ")
    input_dict = {
        'set_grid': {
            'min_x': -113107.81,
            'max_x': 398892.19,
            'min_y': 214783.87,
            'max_y': 726783.87,
            'n_tiles_side': 256
        },
        'split_and_redistribute': {},
        'validate': {}
    }
    pipeline.config(input_dict)
    pipeline.run()

The re-tiling pipeline is configured using a dictionary, whose keys identify the pipeline sub-tasks that need to be
run and the values the corresponding arguments.

.. NOTE::
    The input dictionary elements can be provided in any order. The order in which the tasks are executed is defined in
    a dedicated list (the ``pipeline`` attribute of the ``Retiler`` class), which can be inspected using the
    command-line tool: ``laserfarm retiling pipeline``.

Alternatively, if the input dictionary is serialized to JSON format and stored in a file, the pipeline can be directly
configured in the following way:

.. code-block:: python

    pipeline.config(from_file="retiling_config.json")

The example above entails three sub-tasks: setting up the grid, running the splitting and validating the result.
In the first sub-task, the grid is defined using its bounding box and the number of tiles in each direction (a square
re-tiling scheme is assumed). In the following sub-task (which takes no input argument) the splitting is carried out
and the sub-sets are saved in LAZ format. The output is organized in subdirectories that are labeled with two integer
numbers corresponding to the ``X`` and ``Y`` indices in the chosen tiling scheme (each index run from ``0`` to
``n_tiles_side - 1``). Note that the `PDAL library`_ is employed for the splitting, thus taking advantage of a low-level
(C++) implementation. Finally, the result of the splitting is validated by checking that the generated LAZ files contain
the same number of points as the parent file.

.. _PDAL library: https://pdal.io

The same calculation can be run using the command-line tool in the following way:

.. code-block:: shell

    laserfarm retiling --input_file=point_cloud.LAZ - set_grid --min_x=-113107.81 --max_x=398892.19 --min_y=214783.87 --max_y=726783.87 --n_tiles_side=256 - split_and_redistribute - validate

Note that the various tasks with the corresponding arguments are chained using hyphens. If the tasks and arguments are
provided as a JSON configuration file, the pipeline can be executed in the following way:

.. code-block:: shell

    laserfarm retiling --input_file=point_cloud.LAZ - config --from_file=retiling_config.json - run

.. _DataProcessing:

Point-Cloud Data Processing
---------------------------

Once the raw data is split into tiles whose volume can be handled by the infrastructure available to the user,
point-cloud-based properties can be extracted. Laserfarm implements a wrapper to `laserchicken`_, which is the
engine employed to parse and process point-cloud data. The following example Python script processes a LAZ file that
contains the point-cloud subset corresponding to the ``(X=0, Y=0)`` tile in the chosen tiling scheme:

.. code-block:: python

    from laserfarm import DataProcessing

    pipeline = DataProcessing(input="tile.LAZ", tile_index=(0, 0))
    input_dict = {
        'load': {},
        'normalize': {'cell_size': 1},
        'generate_targets': {
            'min_x': -113107.81,
            'max_x': 398892.19,
            'min_y': 214783.87,
            'max_y': 726783.87,
            'n_tiles_side': 256,
            'tile_mesh_size' : 10.,
            'validate' : True,
        },
        'extract_features': {
            'volume_type': 'cell',
            'volume_size': 10.,
            'feature_names': ['point_density']
        },
        'export_targets': {}
    }
    pipeline.config(input_dict)
    pipeline.run()

.. _laserchicken: https://github.com/eEcoLiDAR/laserchicken

Also here a dictionary is employed to configure the pipeline (a JSON file could be used exactly as in :ref:`Retiling`).
The command-line tool can also be used to run the data processing pipeline (the ``data_processing`` command is issued
here):

.. code-block:: shell

    laserfarm data_processing --input=tile.LAZ --tile_index=[0,0] - load - generate_targets --min_x=-113107.81 --max_x=398892.19 --min_y=214783.87 --max_y=726783.87 --n_tiles_side=256 --tile_mesh_size=10. --validate - extract_features --feature_names=[point_density] - export_targets

or, if the configuration dictionary is serialized in the ``data_processing.json`` file:

.. code-block:: shell

    laserfarm data_processing --input=tile.LAZ --tile_index=[0,0] - config --from_file=data_processing.json - run

The full (ordered) list of tasks that can be executed within the data processing pipeline can be inspected from the
command line:

.. code-block:: shell

    laserfarm data_processing pipeline

The example pipeline above entails five steps. First, the point-cloud data is loaded into memory. Note that the input
path provided can point to either a file or a directory, in which case all files in a point-cloud format that is known
to ``laserchicken`` are considered. In order to reduce the memory requirements, one can load only the attributes that are
necessary for further data processing from the input LAZ file(s). These attributes can be provided using the optional
argument ``attributes`` of the ``DataProcessing``'s ``load`` method:

.. code-block:: python

    input_dict = {
        ...
        'load': {'attributes': ['intensity', 'gps_time']}
        ...
    }

If no attribute other than the (X, Y, Z) coordinates of the points is required, one can assign ``attributes`` with an
empty list.

The second step of the pipeline consists in the point-cloud heights' normalization, which is required for the extraction
of some of the features (see the ``laserchicken`` `manual`_. Square cells are employed for this purpose, and the length
of the cell sides (in meters) is set with the ``cell_size`` argument.

In order to extract statistical properties from the data, the point cloud must be subdivided into partitions that
represent the ensembles over which the properties are calculated. Such partitions (the 'neighborhoods') can be defined
using contiguous square cells, and the properties computed over each neighborhood assigned to the cells' centroids
(see also the ``laserchicken`` `manual`_). For a given tile the full set of centroids, i.e. the target points, is
generated by the ``generate_targets`` method, which requires information about the tiling scheme and the desired mesh
size of the target grid (``tile_mesh_size``, in meters). Note that ``tile_mesh_size`` ultimately sets the desired
resolution of the raster maps, since it corresponds to the pixel size in the final GeoTIFFs. If ``validate`` is set to
true, the points belonging to the input point cloud are checked to lie within the boundaries of the tile for which
target points are generated (recommended).

Once the target point set is generated, the desired properties of the input point cloud can be computed. The example
above will calculate a single feature, i.e. ``point_density``, but multiple features can be extracted in a single run.
``volume_type`` and ``volume_size`` define the neighborhoods employed for the extraction of properties: by assigning
them with ``cell`` and the value employed for ``tile_mesh_size``, respectively, the neighborhoods are defined as the
cells the centroids of which are the generated target points.
Statistical properties can be computed over a subset of points in each neighborhoods (for instance, to mimic data
sets with lower point densities). This is achieved by specifying the ``sample_size`` argument to the ``extract_features``
method, which defines the number of randomly-selected points considered in each cell (all points are considered for
cells that include :math:`N\leq` ``sample_size`` points).

Finally, the target points and the associated properties are written to disk. By default, the polygon (PLY) format
is employed, with one output file including all extracted features. However, single-feature files can also be exported
by setting the ``multi_band_files`` argument to false.

Additional steps that can be optionally included in the data-processing pipeline allows the user to generate
parametrized features using the extractors available in ``laserchicken`` (see the `manual`_) and to select a subset of
the input point cloud for the feature extraction. Specific information on the required arguments can be obtained from
the corresponding command line helpers:

.. code-block:: shell

    laserfarm data_processing add_custom_feature --help

and:

.. code-block:: shell

    laserfarm data_processing apply_filter --help

.. _manual: https://laserchicken.readthedocs.io/en/latest

.. NOTE::
    ``laserchicken`` computes and caches the k-d tree of the point cloud in order to efficiently querying point-cloud
    points in the filter (with polygons), normalization and feature extraction tasks. The cache can be cleared using the
    ``clear_cache`` method in the point-cloud data processing pipeline, e.g. by setting:

    .. code-block:: python

        input_dict = {
            ...
            'clear_cache: {},
            ...
        }

GeoTIFF Export
--------------

In the last step of the full processing pipeline the properties extracted from the raw input point cloud in a tile-wise
fashion are tiled back together and exported as raster maps. The following example illustrates how to generate
a single-band GeoTIFF file for the ``point_density`` feature from a set of PLY files containing the target points for
all the tiles in which an initial LAZ file has been split:

.. code-block:: python

    from laserfarm import GeotiffWriter

    pipeline = GeotiffWriter(input_dir="/path/to/PLY/files", bands='point_density')
    input_dict = {
        'parse_point_cloud': {},
        'data_split': {'xSub': 1, 'ySub': 1},
        'create_subregion_geotiffs': {'output_handle': 'geotiff'}
    }
    pipeline.config(input_dict)
    pipeline.run()

Similarly to the re-tiling and point-cloud data-processing pipelines, the ``config`` and ``run`` methods are employed
to configure and run the pipeline, respectively. The same pipeline can be run via the command line as:

.. code-block:: shell

    laserfarm geotiff_writer --input_dir=/path/to/PLY/files --bands=point_density - parse_point_cloud - data_split --xSub=1 --ySub=1 - create_subregion_geotiffs --output_handle=geotiff

As for the other pipelines, JSON files can be used to configure the pipeline as well.
This example pipeline entails the following steps. First, the list of PLY files to be parsed is constructed and a
representative file is parsed in order to obtain information on the number or target points per tile and the spacing
between target points.

.. NOTE::
    All tiles are assumed to be square and to include the same number of target points with the same target mesh size.

For data sets with large lateral extend or very large resolution (i.e. very fine target meshes), a single GeoTIFF file
could be difficult to handle with standard GIS tools. It is thus possible to partition the area covered by the tiles
into (``xSub`` :math:`\times` ``ySub``) sub-regions and to generate a GeoTIFF for each of the sub-regions. In the example above,
``xSub = ySub = 1`` sets a single GeoTIFF file to cover all tiles.

.. NOTE::
    The sub-region dimensions should be multiple of the corresponding tile dimensions.

Finally, Laserfarm generates the GeoTIFF file(s) using `GDAL`_ (``output_handle`` is employed as file-name
handle).

.. _GDAL: https://gdal.org

Point Classification
--------------------

Laserfarm allows to classify the points belonging to a point-cloud data set using (multi-)polygons defined in
a set of files in shapefile format (``.shp``). For macro-ecology applications, this can be useful, for instance, to
classify points as part of water-bodies, buildings, vegetation, etc. In this example, the target points in
the PLY file ``tile.ply`` are classified using the shapefiles provided at a given path:

.. code-block:: python

    from laserfarm import Classification

    pipeline = Classification(input_file="tile.ply")
    input_dict = {
        'locate_shp': {'shp_dir': '/path/to/dir/with/shp/files'},
        'classification': {'ground_type': 1},
        'export_point_cloud' : {}
    }
    pipeline.config(input_dict)
    pipeline.run()

To run the same pipeline using the command-line tool:

.. code-block:: shell

    laserfarm classification --input_file=tile.ply - locate_shp --shp_dir=/path/to/dir/with/shp/files - classification --ground_type=1 - export_point_cloud

As for all the other pipelines, JSON files can be used to configure the pipeline as well.
The first task in the pipeline consists in identifying which among all shapefiles provided are relevant for the given
point-set (this is determined by checking whether any of the polygons intersect the point-cloud bounding box). Then,
the points are classified: for the points falling within the polygons, the feature ``ground_type`` is updated to ``1``
(the feature is added if not already present). Finally, the point-cloud data set is written to disk.

Pipelines with Remote Data
--------------------------

LiDAR-based macro-ecology studies could easily involve several TBs of raw point-cloud data. These data volumes are
difficult to handle on standard local machines. In addition, the data should also be accessible to the infrastructure(s)
where the processing takes place (e.g. to all the nodes of a compute cluster). In order to avoid data duplication and to
limit the disk-space requirement of the processing unit(s), a remote storage infrastructure can be used to dump the raw
data and the result of the pipeline calculations. The raw-data re-tiling, point-cloud data-processing and
GeoTIFF-writing pipelines implement methods to retrieve input and drop output to storage services using the WebDAV
protocol.

The following example shows how the example in :ref:`Retiling` can be modified to retrieve ``point_cloud.LAZ`` from the
storage facility with hostname ``https://webdav.hostname.com`` (connecting to port ``8888``) using the specified
credentials to log in:

.. code-block:: python
    :emphasize-lines: 4-9,11,12,22,23

    from laserfarm import Retiler

    pipeline = Retiler(input_file="point_cloud.LAZ")
    webdav_options = {
        'webdav_hostname': 'https://webdav.hostname.com:8888',
        'webdav_login': 'username',
        'webdav_password': 'password'
    }
    pipeline.set_wdclient(webdav_options)
    input_dict = {
        'setup_local_fs': {'tmp_folder': '/path/to/local/tmp/dir'},
        'pullremote': '/remote/path/to/input',
        'set_grid': {
            'min_x': -113107.81,
            'max_x': 398892.19,
            'min_y': 214783.87,
            'max_y': 726783.87,
            'n_tiles_side': 256
        },
        'split_and_redistribute': {},
        'validate': {},
        'pushremote': '/remote/path/to/output',
        'cleanlocalfs': {}
    }
    pipeline.config(input_dict)
    pipeline.run()

Laserfarm will create two directories for input and output as sub-folders of ``tmp_folder``, download the
input file ``point_cloud.LAZ`` from the path ``/remote/path/to/input`` on the WebDAV server to the input folder,
perform the re-tiling as described in :ref:`Retiling`, upload the results from the output folder to the remote path
``/remote/path/to/output`` on the WebDAV server and delete the local input and output folders.

It is also possible to set arbitrary paths for the input and output folders:

.. code-block:: python

    input_dict = {
        ...
        'setup_local_fs': {
            'input_folder': '/path/to/local/input/folder',
            'output_folder': '/path/to/local/output/folder'
        }
        ...
    }

The point-cloud data-processing pipeline and the GeoTIFF-exporting pipeline can be configured to retrieve input files
(or directories) from a storage service with WebDAV support in the very same way.

Macro-Pipelines
---------------

For a macro-ecology study where the point-cloud data is stored in multiple LAZ files, the re-tiling of all input files,
the feature extraction for all tiles in which the raw data is split, and the generation of GeoTIFFs for all desired
features are embarrassingly parallel tasks. The following example shows how the example in :ref:`Retiling` can be
modified to perform the re-tiling of 10 point-cloud files (``point_cloud_X.LAZ``, where ``X`` ranges from 0 to 9)
exploiting the parallelization over input files:

.. code-block:: python

    from laserfarm import Retiler, MacroPipeline

    macro = MacroPipeline()
    input_dict = {
        'set_grid': {
            'min_x': -113107.81,
            'max_x': 398892.19,
            'min_y': 214783.87,
            'max_y': 726783.87,
            'n_tiles_side': 256
        },
        'split_and_redistribute': {},
        'validate': {}
    }
    filenames = ['point_cloud_{}.LAZ'.format(n) for n in range(10)]
    macro.tasks = [Retiler(input_file=f, label=f).config(input_dict) for f in filenames]
    macro.setup_cluster(mode='local', processes=True, n_workers=2, threads_per_worker=1)
    macro.run()
    macro.print_outcome(to_file='results.txt')

The parallelization is achieved using `Dask`_, which is employed to deploy the cluster and to distribute the tasks. In
the example above, the computing cluster consists of two local processes (two 'workers') spawning one thread each
(recommended for all pipelines, and required for the feature extraction tasks that involve ``laserchicken``). Each of
the workers takes care of the execution of one task at a time until all tasks are completed.

.. NOTE::
    When performing macro-pipeline calculations including ``DataProcessing`` pipelines (see :ref:`DataProcessing`), it
    is important to include the ``clear_cache`` task in the input to avoid the cache to fill up the memory of the
    workers.

In order to distribute tasks to a cluster deployed over compute nodes using SSH, the script above can be modified in the
following way:

.. code-block:: python

    ...
    macro.setup_cluster(mode='ssh',
                        hosts=['172.17.0.1', '172.17.0.1', '172.17.0.2'],
                        connect_options={'known_hosts': None,
                                         'username': 'username',
                                         'client_keys': '.ssh/id_rsa'}
                        worker_options={'nthreads': 1, 'nprocs': 2}
                        scheduler_options={'dashboard_address': '8787'})
    ...

The first address or hostname in the host list is employed for the scheduler, all the other addresses/hostnames are
used for the workers. The ``nprocs`` and ``nthreads`` arguments set the number of workers running on each host and the
number of threads spawned by each worker, respectively. For further information we refer to the `Dask documentation`_.

Any other deployed Dask cluster can be used to distribute tasks within ``MacroPipeline`` if passed as an argument to the
``setup_cluster`` method, for instance:

.. code-block:: python

    from dask_jobqueue import SLURMCluster
    ...
    cluster = SLURMCluster(...)
    macro.setup_cluster(cluster=cluster)
    macro.run()

.. _Dask: https://dask.org
.. _Dask documentation: https://docs.dask.org/en/latest/setup/ssh.html

.. NOTE::
    No command line support is provided in Laserfarm for macro-pipeline calculations.

.. _Examples:

Examples
--------

The GitHub `repository`_ of Laserfarm includes a tutorial structured as a Jupyter notebook
(``tutorial.ipynb``). The notebook illustrates how to use Laserfarm to process a subset of the
*Actueel Hoogtebestand Nederland* (`AHN3`_) data set, from the retrieval of an example point-cloud data file in LAZ
format to the export of the extracted features to a GeoTIFF file.

.. _repository: https://github.com/eEcoLiDAR/Laserfarm
.. _AHN3: https://www.pdok.nl/introductie/-/article/actueel-hoogtebestand-nederland-ahn3-

A second notebook (``workflow.ipynb``) shows the workflow employed to process the full AHN3 data set. The
notebook illustrates how the re-tiling, point-cloud data-processing and GeoTIFF-exporting tasks have been configured
and distributed over the nodes of a compute cluster.

Finally, Python scripts and pipeline configuration files that have been used to test the various pipelines either on
local machines or on a virtual `docker-container-based cluster`_ can be found `here`_.

.. _docker-container-based cluster: https://github.com/eEcoLiDAR/dockerTestCluster
.. _here: https://github.com/eEcoLiDAR/Laserfarm/tree/master/examples

Current Limitations
-------------------

This package has been tested on data provided in a metric-based 2D-projected Cartesian coordinate system. While some of
the tools of Laserfarm could be applied to data in an ellipsoidal latitude/longitude coordinate system as well, this has
not been tested and it is generally expected to fail.