User Manual

Laserfarm provides a set of tools to process massive LiDAR point-cloud data sets. In order to tackle volumes of data such as the ones produced by airborne laser scanning, Laserfarm makes use of a divide-et-impera strategy. First, the raw data is split into tiles whose size is manageable by standard computing infrastructures. Then, point-cloud properties (“features”) are extracted from the re-tiled data using the cells of a second user-defined grid as ensembles for the statistical averaging. The mesh size of this second finer grid ultimately defines the resolution (i.e. the pixel size) of the raster data. Finally, the properties computed for each sub-cell are exported in a raster format and all tile contributions merged to form a single output GeoTIFF file, ideally covering the same area as the initial raw data.

The full processing pipeline can be thus subdivided into three pipeline steps: the raw data re-tiling, the point-cloud data processing and the raster-data merging and export. Each of these tasks can be run in a Python script that imports the dedicated Laserfarm module or through the command line tool laserfarm. The following sections describes the main features of each of these pipelines and their general use. User interested in more advanced features and in how these tasks can be implemented for large-scale macro-ecology calculations can have a look at the Jupyter notebooks provided (see Examples).

Raw Data Re-tiling

Point-cloud data from airborne laser scanning are typically provided as compressed files in LAZ format. The following script illustrates a minimal example in which a data set provided as a LAZ file (point_cloud.LAZ) is re-tiled according to a regular grid:

from laserfarm import Retiler

pipeline = Retiler(input_file="point_cloud.LAZ")
input_dict = {
    'set_grid': {
        'min_x': -113107.81,
        'max_x': 398892.19,
        'min_y': 214783.87,
        'max_y': 726783.87,
        'n_tiles_side': 256
    },
    'split_and_redistribute': {},
    'validate': {}
}
pipeline.config(input_dict)
pipeline.run()

The re-tiling pipeline is configured using a dictionary, whose keys identify the pipeline sub-tasks that need to be run and the values the corresponding arguments.

Note

The input dictionary elements can be provided in any order. The order in which the tasks are executed is defined in a dedicated list (the pipeline attribute of the Retiler class), which can be inspected using the command-line tool: laserfarm retiling pipeline.

Alternatively, if the input dictionary is serialized to JSON format and stored in a file, the pipeline can be directly configured in the following way:

pipeline.config(from_file="retiling_config.json")

The example above entails three sub-tasks: setting up the grid, running the splitting and validating the result. In the first sub-task, the grid is defined using its bounding box and the number of tiles in each direction (a square re-tiling scheme is assumed). In the following sub-task (which takes no input argument) the splitting is carried out and the sub-sets are saved in LAZ format. The output is organized in subdirectories that are labeled with two integer numbers corresponding to the X and Y indices in the chosen tiling scheme (each index run from 0 to n_tiles_side - 1). Note that the PDAL library is employed for the splitting, thus taking advantage of a low-level (C++) implementation. Finally, the result of the splitting is validated by checking that the generated LAZ files contain the same number of points as the parent file.

The same calculation can be run using the command-line tool in the following way:

laserfarm retiling --input_file=point_cloud.LAZ - set_grid --min_x=-113107.81 --max_x=398892.19 --min_y=214783.87 --max_y=726783.87 --n_tiles_side=256 - split_and_redistribute - validate

Note that the various tasks with the corresponding arguments are chained using hyphens. If the tasks and arguments are provided as a JSON configuration file, the pipeline can be executed in the following way:

laserfarm retiling --input_file=point_cloud.LAZ - config --from_file=retiling_config.json - run

Point-Cloud Data Processing

Once the raw data is split into tiles whose volume can be handled by the infrastructure available to the user, point-cloud-based properties can be extracted. Laserfarm implements a wrapper to laserchicken, which is the engine employed to parse and process point-cloud data. The following example Python script processes a LAZ file that contains the point-cloud subset corresponding to the (X=0, Y=0) tile in the chosen tiling scheme:

from laserfarm import DataProcessing

pipeline = DataProcessing(input="tile.LAZ", tile_index=(0, 0))
input_dict = {
    'load': {},
    'normalize': {'cell_size': 1},
    'generate_targets': {
        'min_x': -113107.81,
        'max_x': 398892.19,
        'min_y': 214783.87,
        'max_y': 726783.87,
        'n_tiles_side': 256,
        'tile_mesh_size' : 10.,
        'validate' : True,
    },
    'extract_features': {
        'volume_type': 'cell',
        'volume_size': 10.,
        'feature_names': ['point_density']
    },
    'export_targets': {}
}
pipeline.config(input_dict)
pipeline.run()

Also here a dictionary is employed to configure the pipeline (a JSON file could be used exactly as in Raw Data Re-tiling). The command-line tool can also be used to run the data processing pipeline (the data_processing command is issued here):

laserfarm data_processing --input=tile.LAZ --tile_index=[0,0] - load - generate_targets --min_x=-113107.81 --max_x=398892.19 --min_y=214783.87 --max_y=726783.87 --n_tiles_side=256 --tile_mesh_size=10. --validate - extract_features --feature_names=[point_density] - export_targets

or, if the configuration dictionary is serialized in the data_processing.json file:

laserfarm data_processing --input=tile.LAZ --tile_index=[0,0] - config --from_file=data_processing.json - run

The full (ordered) list of tasks that can be executed within the data processing pipeline can be inspected from the command line:

laserfarm data_processing pipeline

The example pipeline above entails five steps. First, the point-cloud data is loaded into memory. Note that the input path provided can point to either a file or a directory, in which case all files in a point-cloud format that is known to laserchicken are considered. In order to reduce the memory requirements, one can load only the attributes that are necessary for further data processing from the input LAZ file(s). These attributes can be provided using the optional argument attributes of the DataProcessing’s load method:

input_dict = {
    ...
    'load': {'attributes': ['intensity', 'gps_time']}
    ...
}

If no attribute other than the (X, Y, Z) coordinates of the points is required, one can assign attributes with an empty list.

The second step of the pipeline consists in the point-cloud heights’ normalization, which is required for the extraction of some of the features (see the laserchicken manual. Square cells are employed for this purpose, and the length of the cell sides (in meters) is set with the cell_size argument.

In order to extract statistical properties from the data, the point cloud must be subdivided into partitions that represent the ensembles over which the properties are calculated. Such partitions (the ‘neighborhoods’) can be defined using contiguous square cells, and the properties computed over each neighborhood assigned to the cells’ centroids (see also the laserchicken manual). For a given tile the full set of centroids, i.e. the target points, is generated by the generate_targets method, which requires information about the tiling scheme and the desired mesh size of the target grid (tile_mesh_size, in meters). Note that tile_mesh_size ultimately sets the desired resolution of the raster maps, since it corresponds to the pixel size in the final GeoTIFFs. If validate is set to true, the points belonging to the input point cloud are checked to lie within the boundaries of the tile for which target points are generated (recommended).

Once the target point set is generated, the desired properties of the input point cloud can be computed. The example above will calculate a single feature, i.e. point_density, but multiple features can be extracted in a single run. volume_type and volume_size define the neighborhoods employed for the extraction of properties: by assigning them with cell and the value employed for tile_mesh_size, respectively, the neighborhoods are defined as the cells the centroids of which are the generated target points. Statistical properties can be computed over a subset of points in each neighborhoods (for instance, to mimic data sets with lower point densities). This is achieved by specifying the sample_size argument to the extract_features method, which defines the number of randomly-selected points considered in each cell (all points are considered for cells that include \(N\leq\) sample_size points).

Finally, the target points and the associated properties are written to disk. By default, the polygon (PLY) format is employed, with one output file including all extracted features. However, single-feature files can also be exported by setting the multi_band_files argument to false.

Additional steps that can be optionally included in the data-processing pipeline allows the user to generate parametrized features using the extractors available in laserchicken (see the manual) and to select a subset of the input point cloud for the feature extraction. Specific information on the required arguments can be obtained from the corresponding command line helpers:

laserfarm data_processing add_custom_feature --help

and:

laserfarm data_processing apply_filter --help

Note

laserchicken computes and caches the k-d tree of the point cloud in order to efficiently querying point-cloud points in the filter (with polygons), normalization and feature extraction tasks. The cache can be cleared using the clear_cache method in the point-cloud data processing pipeline, e.g. by setting:

input_dict = {
    ...
    'clear_cache: {},
    ...
}

GeoTIFF Export

In the last step of the full processing pipeline the properties extracted from the raw input point cloud in a tile-wise fashion are tiled back together and exported as raster maps. The following example illustrates how to generate a single-band GeoTIFF file for the point_density feature from a set of PLY files containing the target points for all the tiles in which an initial LAZ file has been split:

from laserfarm import GeotiffWriter

pipeline = GeotiffWriter(input_dir="/path/to/PLY/files", bands='point_density')
input_dict = {
    'parse_point_cloud': {},
    'data_split': {'xSub': 1, 'ySub': 1},
    'create_subregion_geotiffs': {'output_handle': 'geotiff'}
}
pipeline.config(input_dict)
pipeline.run()

Similarly to the re-tiling and point-cloud data-processing pipelines, the config and run methods are employed to configure and run the pipeline, respectively. The same pipeline can be run via the command line as:

laserfarm geotiff_writer --input_dir=/path/to/PLY/files --bands=point_density - parse_point_cloud - data_split --xSub=1 --ySub=1 - create_subregion_geotiffs --output_handle=geotiff

As for the other pipelines, JSON files can be used to configure the pipeline as well. This example pipeline entails the following steps. First, the list of PLY files to be parsed is constructed and a representative file is parsed in order to obtain information on the number or target points per tile and the spacing between target points.

Note

All tiles are assumed to be square and to include the same number of target points with the same target mesh size.

For data sets with large lateral extend or very large resolution (i.e. very fine target meshes), a single GeoTIFF file could be difficult to handle with standard GIS tools. It is thus possible to partition the area covered by the tiles into (xSub \(\times\) ySub) sub-regions and to generate a GeoTIFF for each of the sub-regions. In the example above, xSub = ySub = 1 sets a single GeoTIFF file to cover all tiles.

Note

The sub-region dimensions should be multiple of the corresponding tile dimensions.

Finally, Laserfarm generates the GeoTIFF file(s) using GDAL (output_handle is employed as file-name handle).

Point Classification

Laserfarm allows to classify the points belonging to a point-cloud data set using (multi-)polygons defined in a set of files in shapefile format (.shp). For macro-ecology applications, this can be useful, for instance, to classify points as part of water-bodies, buildings, vegetation, etc. In this example, the target points in the PLY file tile.ply are classified using the shapefiles provided at a given path:

from laserfarm import Classification

pipeline = Classification(input_file="tile.ply")
input_dict = {
    'locate_shp': {'shp_dir': '/path/to/dir/with/shp/files'},
    'classification': {'ground_type': 1},
    'export_point_cloud' : {}
}
pipeline.config(input_dict)
pipeline.run()

To run the same pipeline using the command-line tool:

laserfarm classification --input_file=tile.ply - locate_shp --shp_dir=/path/to/dir/with/shp/files - classification --ground_type=1 - export_point_cloud

As for all the other pipelines, JSON files can be used to configure the pipeline as well. The first task in the pipeline consists in identifying which among all shapefiles provided are relevant for the given point-set (this is determined by checking whether any of the polygons intersect the point-cloud bounding box). Then, the points are classified: for the points falling within the polygons, the feature ground_type is updated to 1 (the feature is added if not already present). Finally, the point-cloud data set is written to disk.

Pipelines with Remote Data

LiDAR-based macro-ecology studies could easily involve several TBs of raw point-cloud data. These data volumes are difficult to handle on standard local machines. In addition, the data should also be accessible to the infrastructure(s) where the processing takes place (e.g. to all the nodes of a compute cluster). In order to avoid data duplication and to limit the disk-space requirement of the processing unit(s), a remote storage infrastructure can be used to dump the raw data and the result of the pipeline calculations. The raw-data re-tiling, point-cloud data-processing and GeoTIFF-writing pipelines implement methods to retrieve input and drop output to storage services using the WebDAV protocol.

The following example shows how the example in Raw Data Re-tiling can be modified to retrieve point_cloud.LAZ from the storage facility with hostname https://webdav.hostname.com (connecting to port 8888) using the specified credentials to log in:

from laserfarm import Retiler

pipeline = Retiler(input_file="point_cloud.LAZ")
webdav_options = {
    'webdav_hostname': 'https://webdav.hostname.com:8888',
    'webdav_login': 'username',
    'webdav_password': 'password'
}
pipeline.set_wdclient(webdav_options)
input_dict = {
    'setup_local_fs': {'tmp_folder': '/path/to/local/tmp/dir'},
    'pullremote': '/remote/path/to/input',
    'set_grid': {
        'min_x': -113107.81,
        'max_x': 398892.19,
        'min_y': 214783.87,
        'max_y': 726783.87,
        'n_tiles_side': 256
    },
    'split_and_redistribute': {},
    'validate': {},
    'pushremote': '/remote/path/to/output',
    'cleanlocalfs': {}
}
pipeline.config(input_dict)
pipeline.run()

Laserfarm will create two directories for input and output as sub-folders of tmp_folder, download the input file point_cloud.LAZ from the path /remote/path/to/input on the WebDAV server to the input folder, perform the re-tiling as described in Raw Data Re-tiling, upload the results from the output folder to the remote path /remote/path/to/output on the WebDAV server and delete the local input and output folders.

It is also possible to set arbitrary paths for the input and output folders:

input_dict = {
    ...
    'setup_local_fs': {
        'input_folder': '/path/to/local/input/folder',
        'output_folder': '/path/to/local/output/folder'
    }
    ...
}

The point-cloud data-processing pipeline and the GeoTIFF-exporting pipeline can be configured to retrieve input files (or directories) from a storage service with WebDAV support in the very same way.

Macro-Pipelines

For a macro-ecology study where the point-cloud data is stored in multiple LAZ files, the re-tiling of all input files, the feature extraction for all tiles in which the raw data is split, and the generation of GeoTIFFs for all desired features are embarrassingly parallel tasks. The following example shows how the example in Raw Data Re-tiling can be modified to perform the re-tiling of 10 point-cloud files (point_cloud_X.LAZ, where X ranges from 0 to 9) exploiting the parallelization over input files:

from laserfarm import Retiler, MacroPipeline

macro = MacroPipeline()
input_dict = {
    'set_grid': {
        'min_x': -113107.81,
        'max_x': 398892.19,
        'min_y': 214783.87,
        'max_y': 726783.87,
        'n_tiles_side': 256
    },
    'split_and_redistribute': {},
    'validate': {}
}
filenames = ['point_cloud_{}.LAZ'.format(n) for n in range(10)]
macro.tasks = [Retiler(input_file=f, label=f).config(input_dict) for f in filenames]
macro.setup_cluster(mode='local', processes=True, n_workers=2, threads_per_worker=1)
macro.run()
macro.print_outcome(to_file='results.txt')

The parallelization is achieved using Dask, which is employed to deploy the cluster and to distribute the tasks. In the example above, the computing cluster consists of two local processes (two ‘workers’) spawning one thread each (recommended for all pipelines, and required for the feature extraction tasks that involve laserchicken). Each of the workers takes care of the execution of one task at a time until all tasks are completed.

Note

When performing macro-pipeline calculations including DataProcessing pipelines (see Point-Cloud Data Processing), it is important to include the clear_cache task in the input to avoid the cache to fill up the memory of the workers.

In order to distribute tasks to a cluster deployed over compute nodes using SSH, the script above can be modified in the following way:

...
macro.setup_cluster(mode='ssh',
                    hosts=['172.17.0.1', '172.17.0.1', '172.17.0.2'],
                    connect_options={'known_hosts': None,
                                     'username': 'username',
                                     'client_keys': '.ssh/id_rsa'}
                    worker_options={'nthreads': 1, 'nprocs': 2}
                    scheduler_options={'dashboard_address': '8787'})
...

The first address or hostname in the host list is employed for the scheduler, all the other addresses/hostnames are used for the workers. The nprocs and nthreads arguments set the number of workers running on each host and the number of threads spawned by each worker, respectively. For further information we refer to the Dask documentation.

Any other deployed Dask cluster can be used to distribute tasks within MacroPipeline if passed as an argument to the setup_cluster method, for instance:

from dask_jobqueue import SLURMCluster
...
cluster = SLURMCluster(...)
macro.setup_cluster(cluster=cluster)
macro.run()

Note

No command line support is provided in Laserfarm for macro-pipeline calculations.

Examples

The GitHub repository of Laserfarm includes a tutorial structured as a Jupyter notebook (tutorial.ipynb). The notebook illustrates how to use Laserfarm to process a subset of the Actueel Hoogtebestand Nederland (AHN3) data set, from the retrieval of an example point-cloud data file in LAZ format to the export of the extracted features to a GeoTIFF file.

A second notebook (workflow.ipynb) shows the workflow employed to process the full AHN3 data set. The notebook illustrates how the re-tiling, point-cloud data-processing and GeoTIFF-exporting tasks have been configured and distributed over the nodes of a compute cluster.

Finally, Python scripts and pipeline configuration files that have been used to test the various pipelines either on local machines or on a virtual docker-container-based cluster can be found here.

Current Limitations

This package has been tested on data provided in a metric-based 2D-projected Cartesian coordinate system. While some of the tools of Laserfarm could be applied to data in an ellipsoidal latitude/longitude coordinate system as well, this has not been tested and it is generally expected to fail.