User Manual¶
Laserfarm provides a set of tools to process massive LiDAR point-cloud data sets. In order to tackle volumes of data such as the ones produced by airborne laser scanning, Laserfarm makes use of a divide-et-impera strategy. First, the raw data is split into tiles whose size is manageable by standard computing infrastructures. Then, point-cloud properties (“features”) are extracted from the re-tiled data using the cells of a second user-defined grid as ensembles for the statistical averaging. The mesh size of this second finer grid ultimately defines the resolution (i.e. the pixel size) of the raster data. Finally, the properties computed for each sub-cell are exported in a raster format and all tile contributions merged to form a single output GeoTIFF file, ideally covering the same area as the initial raw data.
The full processing pipeline can be thus subdivided into three pipeline steps: the raw data re-tiling, the point-cloud
data processing and the raster-data merging and export. Each of these tasks can be run in a Python script that imports
the dedicated Laserfarm module or through the command line tool laserfarm
. The following sections
describes the main features of each of these pipelines and their general use. User interested in more advanced features
and in how these tasks can be implemented for large-scale macro-ecology calculations can have a look at the Jupyter
notebooks provided (see Examples).
Raw Data Re-tiling¶
Point-cloud data from airborne laser scanning are typically provided as compressed files in LAZ format. The following
script illustrates a minimal example in which a data set provided as a LAZ file (point_cloud.LAZ
) is re-tiled
according to a regular grid:
from laserfarm import Retiler
pipeline = Retiler(input_file="point_cloud.LAZ")
input_dict = {
'set_grid': {
'min_x': -113107.81,
'max_x': 398892.19,
'min_y': 214783.87,
'max_y': 726783.87,
'n_tiles_side': 256
},
'split_and_redistribute': {},
'validate': {}
}
pipeline.config(input_dict)
pipeline.run()
The re-tiling pipeline is configured using a dictionary, whose keys identify the pipeline sub-tasks that need to be run and the values the corresponding arguments.
Note
The input dictionary elements can be provided in any order. The order in which the tasks are executed is defined in
a dedicated list (the pipeline
attribute of the Retiler
class), which can be inspected using the
command-line tool: laserfarm retiling pipeline
.
Alternatively, if the input dictionary is serialized to JSON format and stored in a file, the pipeline can be directly configured in the following way:
pipeline.config(from_file="retiling_config.json")
The example above entails three sub-tasks: setting up the grid, running the splitting and validating the result.
In the first sub-task, the grid is defined using its bounding box and the number of tiles in each direction (a square
re-tiling scheme is assumed). In the following sub-task (which takes no input argument) the splitting is carried out
and the sub-sets are saved in LAZ format. The output is organized in subdirectories that are labeled with two integer
numbers corresponding to the X
and Y
indices in the chosen tiling scheme (each index run from 0
to
n_tiles_side - 1
). Note that the PDAL library is employed for the splitting, thus taking advantage of a low-level
(C++) implementation. Finally, the result of the splitting is validated by checking that the generated LAZ files contain
the same number of points as the parent file.
The same calculation can be run using the command-line tool in the following way:
laserfarm retiling --input_file=point_cloud.LAZ - set_grid --min_x=-113107.81 --max_x=398892.19 --min_y=214783.87 --max_y=726783.87 --n_tiles_side=256 - split_and_redistribute - validate
Note that the various tasks with the corresponding arguments are chained using hyphens. If the tasks and arguments are provided as a JSON configuration file, the pipeline can be executed in the following way:
laserfarm retiling --input_file=point_cloud.LAZ - config --from_file=retiling_config.json - run
Point-Cloud Data Processing¶
Once the raw data is split into tiles whose volume can be handled by the infrastructure available to the user,
point-cloud-based properties can be extracted. Laserfarm implements a wrapper to laserchicken, which is the
engine employed to parse and process point-cloud data. The following example Python script processes a LAZ file that
contains the point-cloud subset corresponding to the (X=0, Y=0)
tile in the chosen tiling scheme:
from laserfarm import DataProcessing
pipeline = DataProcessing(input="tile.LAZ", tile_index=(0, 0))
input_dict = {
'load': {},
'normalize': {'cell_size': 1},
'generate_targets': {
'min_x': -113107.81,
'max_x': 398892.19,
'min_y': 214783.87,
'max_y': 726783.87,
'n_tiles_side': 256,
'tile_mesh_size' : 10.,
'validate' : True,
},
'extract_features': {
'volume_type': 'cell',
'volume_size': 10.,
'feature_names': ['point_density']
},
'export_targets': {}
}
pipeline.config(input_dict)
pipeline.run()
Also here a dictionary is employed to configure the pipeline (a JSON file could be used exactly as in Raw Data Re-tiling).
The command-line tool can also be used to run the data processing pipeline (the data_processing
command is issued
here):
laserfarm data_processing --input=tile.LAZ --tile_index=[0,0] - load - generate_targets --min_x=-113107.81 --max_x=398892.19 --min_y=214783.87 --max_y=726783.87 --n_tiles_side=256 --tile_mesh_size=10. --validate - extract_features --feature_names=[point_density] - export_targets
or, if the configuration dictionary is serialized in the data_processing.json
file:
laserfarm data_processing --input=tile.LAZ --tile_index=[0,0] - config --from_file=data_processing.json - run
The full (ordered) list of tasks that can be executed within the data processing pipeline can be inspected from the command line:
laserfarm data_processing pipeline
The example pipeline above entails five steps. First, the point-cloud data is loaded into memory. Note that the input
path provided can point to either a file or a directory, in which case all files in a point-cloud format that is known
to laserchicken
are considered. In order to reduce the memory requirements, one can load only the attributes that are
necessary for further data processing from the input LAZ file(s). These attributes can be provided using the optional
argument attributes
of the DataProcessing
’s load
method:
input_dict = {
...
'load': {'attributes': ['intensity', 'gps_time']}
...
}
If no attribute other than the (X, Y, Z) coordinates of the points is required, one can assign attributes
with an
empty list.
The second step of the pipeline consists in the point-cloud heights’ normalization, which is required for the extraction
of some of the features (see the laserchicken
manual. Square cells are employed for this purpose, and the length
of the cell sides (in meters) is set with the cell_size
argument.
In order to extract statistical properties from the data, the point cloud must be subdivided into partitions that
represent the ensembles over which the properties are calculated. Such partitions (the ‘neighborhoods’) can be defined
using contiguous square cells, and the properties computed over each neighborhood assigned to the cells’ centroids
(see also the laserchicken
manual). For a given tile the full set of centroids, i.e. the target points, is
generated by the generate_targets
method, which requires information about the tiling scheme and the desired mesh
size of the target grid (tile_mesh_size
, in meters). Note that tile_mesh_size
ultimately sets the desired
resolution of the raster maps, since it corresponds to the pixel size in the final GeoTIFFs. If validate
is set to
true, the points belonging to the input point cloud are checked to lie within the boundaries of the tile for which
target points are generated (recommended).
Once the target point set is generated, the desired properties of the input point cloud can be computed. The example
above will calculate a single feature, i.e. point_density
, but multiple features can be extracted in a single run.
volume_type
and volume_size
define the neighborhoods employed for the extraction of properties: by assigning
them with cell
and the value employed for tile_mesh_size
, respectively, the neighborhoods are defined as the
cells the centroids of which are the generated target points.
Statistical properties can be computed over a subset of points in each neighborhoods (for instance, to mimic data
sets with lower point densities). This is achieved by specifying the sample_size
argument to the extract_features
method, which defines the number of randomly-selected points considered in each cell (all points are considered for
cells that include \(N\leq\) sample_size
points).
Finally, the target points and the associated properties are written to disk. By default, the polygon (PLY) format
is employed, with one output file including all extracted features. However, single-feature files can also be exported
by setting the multi_band_files
argument to false.
Additional steps that can be optionally included in the data-processing pipeline allows the user to generate
parametrized features using the extractors available in laserchicken
(see the manual) and to select a subset of
the input point cloud for the feature extraction. Specific information on the required arguments can be obtained from
the corresponding command line helpers:
laserfarm data_processing add_custom_feature --help
and:
laserfarm data_processing apply_filter --help
Note
laserchicken
computes and caches the k-d tree of the point cloud in order to efficiently querying point-cloud
points in the filter (with polygons), normalization and feature extraction tasks. The cache can be cleared using the
clear_cache
method in the point-cloud data processing pipeline, e.g. by setting:
input_dict = {
...
'clear_cache: {},
...
}
GeoTIFF Export¶
In the last step of the full processing pipeline the properties extracted from the raw input point cloud in a tile-wise
fashion are tiled back together and exported as raster maps. The following example illustrates how to generate
a single-band GeoTIFF file for the point_density
feature from a set of PLY files containing the target points for
all the tiles in which an initial LAZ file has been split:
from laserfarm import GeotiffWriter
pipeline = GeotiffWriter(input_dir="/path/to/PLY/files", bands='point_density')
input_dict = {
'parse_point_cloud': {},
'data_split': {'xSub': 1, 'ySub': 1},
'create_subregion_geotiffs': {'output_handle': 'geotiff'}
}
pipeline.config(input_dict)
pipeline.run()
Similarly to the re-tiling and point-cloud data-processing pipelines, the config
and run
methods are employed
to configure and run the pipeline, respectively. The same pipeline can be run via the command line as:
laserfarm geotiff_writer --input_dir=/path/to/PLY/files --bands=point_density - parse_point_cloud - data_split --xSub=1 --ySub=1 - create_subregion_geotiffs --output_handle=geotiff
As for the other pipelines, JSON files can be used to configure the pipeline as well. This example pipeline entails the following steps. First, the list of PLY files to be parsed is constructed and a representative file is parsed in order to obtain information on the number or target points per tile and the spacing between target points.
Note
All tiles are assumed to be square and to include the same number of target points with the same target mesh size.
For data sets with large lateral extend or very large resolution (i.e. very fine target meshes), a single GeoTIFF file
could be difficult to handle with standard GIS tools. It is thus possible to partition the area covered by the tiles
into (xSub
\(\times\) ySub
) sub-regions and to generate a GeoTIFF for each of the sub-regions. In the example above,
xSub = ySub = 1
sets a single GeoTIFF file to cover all tiles.
Note
The sub-region dimensions should be multiple of the corresponding tile dimensions.
Finally, Laserfarm generates the GeoTIFF file(s) using GDAL (output_handle
is employed as file-name
handle).
Point Classification¶
Laserfarm allows to classify the points belonging to a point-cloud data set using (multi-)polygons defined in
a set of files in shapefile format (.shp
). For macro-ecology applications, this can be useful, for instance, to
classify points as part of water-bodies, buildings, vegetation, etc. In this example, the target points in
the PLY file tile.ply
are classified using the shapefiles provided at a given path:
from laserfarm import Classification
pipeline = Classification(input_file="tile.ply")
input_dict = {
'locate_shp': {'shp_dir': '/path/to/dir/with/shp/files'},
'classification': {'ground_type': 1},
'export_point_cloud' : {}
}
pipeline.config(input_dict)
pipeline.run()
To run the same pipeline using the command-line tool:
laserfarm classification --input_file=tile.ply - locate_shp --shp_dir=/path/to/dir/with/shp/files - classification --ground_type=1 - export_point_cloud
As for all the other pipelines, JSON files can be used to configure the pipeline as well.
The first task in the pipeline consists in identifying which among all shapefiles provided are relevant for the given
point-set (this is determined by checking whether any of the polygons intersect the point-cloud bounding box). Then,
the points are classified: for the points falling within the polygons, the feature ground_type
is updated to 1
(the feature is added if not already present). Finally, the point-cloud data set is written to disk.
Pipelines with Remote Data¶
LiDAR-based macro-ecology studies could easily involve several TBs of raw point-cloud data. These data volumes are difficult to handle on standard local machines. In addition, the data should also be accessible to the infrastructure(s) where the processing takes place (e.g. to all the nodes of a compute cluster). In order to avoid data duplication and to limit the disk-space requirement of the processing unit(s), a remote storage infrastructure can be used to dump the raw data and the result of the pipeline calculations. The raw-data re-tiling, point-cloud data-processing and GeoTIFF-writing pipelines implement methods to retrieve input and drop output to storage services using the WebDAV protocol.
The following example shows how the example in Raw Data Re-tiling can be modified to retrieve point_cloud.LAZ
from the
storage facility with hostname https://webdav.hostname.com
(connecting to port 8888
) using the specified
credentials to log in:
from laserfarm import Retiler
pipeline = Retiler(input_file="point_cloud.LAZ")
webdav_options = {
'webdav_hostname': 'https://webdav.hostname.com:8888',
'webdav_login': 'username',
'webdav_password': 'password'
}
pipeline.set_wdclient(webdav_options)
input_dict = {
'setup_local_fs': {'tmp_folder': '/path/to/local/tmp/dir'},
'pullremote': '/remote/path/to/input',
'set_grid': {
'min_x': -113107.81,
'max_x': 398892.19,
'min_y': 214783.87,
'max_y': 726783.87,
'n_tiles_side': 256
},
'split_and_redistribute': {},
'validate': {},
'pushremote': '/remote/path/to/output',
'cleanlocalfs': {}
}
pipeline.config(input_dict)
pipeline.run()
Laserfarm will create two directories for input and output as sub-folders of tmp_folder
, download the
input file point_cloud.LAZ
from the path /remote/path/to/input
on the WebDAV server to the input folder,
perform the re-tiling as described in Raw Data Re-tiling, upload the results from the output folder to the remote path
/remote/path/to/output
on the WebDAV server and delete the local input and output folders.
It is also possible to set arbitrary paths for the input and output folders:
input_dict = {
...
'setup_local_fs': {
'input_folder': '/path/to/local/input/folder',
'output_folder': '/path/to/local/output/folder'
}
...
}
The point-cloud data-processing pipeline and the GeoTIFF-exporting pipeline can be configured to retrieve input files (or directories) from a storage service with WebDAV support in the very same way.
Macro-Pipelines¶
For a macro-ecology study where the point-cloud data is stored in multiple LAZ files, the re-tiling of all input files,
the feature extraction for all tiles in which the raw data is split, and the generation of GeoTIFFs for all desired
features are embarrassingly parallel tasks. The following example shows how the example in Raw Data Re-tiling can be
modified to perform the re-tiling of 10 point-cloud files (point_cloud_X.LAZ
, where X
ranges from 0 to 9)
exploiting the parallelization over input files:
from laserfarm import Retiler, MacroPipeline
macro = MacroPipeline()
input_dict = {
'set_grid': {
'min_x': -113107.81,
'max_x': 398892.19,
'min_y': 214783.87,
'max_y': 726783.87,
'n_tiles_side': 256
},
'split_and_redistribute': {},
'validate': {}
}
filenames = ['point_cloud_{}.LAZ'.format(n) for n in range(10)]
macro.tasks = [Retiler(input_file=f, label=f).config(input_dict) for f in filenames]
macro.setup_cluster(mode='local', processes=True, n_workers=2, threads_per_worker=1)
macro.run()
macro.print_outcome(to_file='results.txt')
The parallelization is achieved using Dask, which is employed to deploy the cluster and to distribute the tasks. In
the example above, the computing cluster consists of two local processes (two ‘workers’) spawning one thread each
(recommended for all pipelines, and required for the feature extraction tasks that involve laserchicken
). Each of
the workers takes care of the execution of one task at a time until all tasks are completed.
Note
When performing macro-pipeline calculations including DataProcessing
pipelines (see Point-Cloud Data Processing), it
is important to include the clear_cache
task in the input to avoid the cache to fill up the memory of the
workers.
In order to distribute tasks to a cluster deployed over compute nodes using SSH, the script above can be modified in the following way:
...
macro.setup_cluster(mode='ssh',
hosts=['172.17.0.1', '172.17.0.1', '172.17.0.2'],
connect_options={'known_hosts': None,
'username': 'username',
'client_keys': '.ssh/id_rsa'}
worker_options={'nthreads': 1, 'nprocs': 2}
scheduler_options={'dashboard_address': '8787'})
...
The first address or hostname in the host list is employed for the scheduler, all the other addresses/hostnames are
used for the workers. The nprocs
and nthreads
arguments set the number of workers running on each host and the
number of threads spawned by each worker, respectively. For further information we refer to the Dask documentation.
Any other deployed Dask cluster can be used to distribute tasks within MacroPipeline
if passed as an argument to the
setup_cluster
method, for instance:
from dask_jobqueue import SLURMCluster
...
cluster = SLURMCluster(...)
macro.setup_cluster(cluster=cluster)
macro.run()
Note
No command line support is provided in Laserfarm for macro-pipeline calculations.
Examples¶
The GitHub repository of Laserfarm includes a tutorial structured as a Jupyter notebook
(tutorial.ipynb
). The notebook illustrates how to use Laserfarm to process a subset of the
Actueel Hoogtebestand Nederland (AHN3) data set, from the retrieval of an example point-cloud data file in LAZ
format to the export of the extracted features to a GeoTIFF file.
A second notebook (workflow.ipynb
) shows the workflow employed to process the full AHN3 data set. The
notebook illustrates how the re-tiling, point-cloud data-processing and GeoTIFF-exporting tasks have been configured
and distributed over the nodes of a compute cluster.
Finally, Python scripts and pipeline configuration files that have been used to test the various pipelines either on local machines or on a virtual docker-container-based cluster can be found here.
Current Limitations¶
This package has been tested on data provided in a metric-based 2D-projected Cartesian coordinate system. While some of the tools of Laserfarm could be applied to data in an ellipsoidal latitude/longitude coordinate system as well, this has not been tested and it is generally expected to fail.