Introduction

General

This user guide part of scikit-gstat’s documentation is meant to be an user guide to the functionality offered by the module along with a more general introduction to geostatistical concepts. The main use case is to hand this description to students learning geostatistics, whenever scikit-gstat is used. But before introducing variograms, the more general question what geostatistics actually are has to be answered.

Note

This user guide is meant to be an introduction to geostatistics. In case you are already familiar with the topic, you can skip this section.

What is geostatistics?

The basic idea of geostatistics is to describe and estimate spatial covariance, or correlation, in a set of point data. While the main tool, the semi-variogram, is quite easy to implement and use, a lot of important assumptions are underlying it. The typical application is geostatistics is an interpolation. Therefore, although using point data, a basic concept is to understand this point data as a sample of a (spatially) continuous variable that can be described as a random field \(rf\), or to be more precise, a Gaussian random field in many cases. The most fundamental assumption in geostatistics is that any two values \(x_i\) and \(x_{i + h}\) are more similar, the smaller \(h\) is, which is a separating distance on the random field. In other words: close observation points will show higher covariances than distant points. In case this most fundamental conceptual assumption does not hold for a specific variable, geostatistics will not be the correct tool to analyse and interpolate this variable.

One of the most easiest approaches to interpolate point data is to use IDW (inverse distance weighting). This technique is implemented in almost any GIS software. The fundamental conceptual model can be described as:

\[Z_u = \frac{\sum_{i}^{N} w_i * Z(i)}{N}\]

where \(Z_u\) is the value of \(rf\) at a non-observed location with \(N\) observations around it. These observations get weighted by the weight \(w_i\), which can be calculated like:

\[w_i = \frac{1}{||\overrightarrow{ux_i}||}\]

where \(u\) is the unobserved point and \(x_i\) is one of the sample points. Thus, \(||\overrightarrow{ux_i}||\) is the 2-norm of the vector between the two points: the Euclidean distance in the coordinate space (which by no means has to be limited to the \(\mathbb{R}^2\) case).

This basically describes a concept, where a value of the random field is estimated by a distance-weighted mean of the surrounding points. As close points shall have a higher impact, the inverse distance is used and thus the name of inverse distance weighting.

In the case of geostatistics this basic model still holds, but is extended. Instead of depending the weights exclusively on the separating distance, a weight will be derived from a variance over all values that are separated by a similar distance. This has the main advantage of incorporating the actual (co)variance found in the observations and basing the interpolation on this (co)variance, but comes at the cost of some strict assumptions about the statistical properties of the sample. Elaborating and assessing these assumptions is one of the main challenges of geostatistics.

Geostatistical Tools

Geostatistics is a wide field spanning a wide variety of disciplines, like geology, biology, hydrology or geomorphology. Each discipline defines their own set of tools, and apparently definitions, and progress is made until today. It is not the objective of scikit-gstat to be a comprehensive collection of all available tools. The objective is more to offer some common and also more sophisticated tools for variogram analysis. Thus, when useing scikit-gstat, you typically need another library for the actual application, like interpolation. In most cases that will be gstools. However, one can split geostatistics into three main fields, each of it with its own tools:

  • variography: with the variogram being the main tool, the variography focuses on describing, visualizing and modelling covariance structures in space and time.

  • kriging: is a family of interpolation methods, that utilize a variogram to estimate the kriging weights as sketched above.

  • geostatistical simulation: is aiming on generate random fields that fit a given set of observations or a pre-defined variogram or covariance function.

Note

I am not planning to implement tools from all three fields. You can rather use one of the interfaces, like Variogram.to_gstools to export a variogram to another library, that covers kriging and spatial random field generation in great detail.

How to use this guide

The main idea behind the user-guide is to introduce geostatistics at the example of SciKit-GStat. The module has a growing collection of data examples, that are used throughout the documentation. They can be loaded from the data submodule. Each function will return a dictionary of the actual sample and a brief description.

Note

Any data sample included has an origin and an owner. While they are all distributed under open licenses, you have to check the description for data ownership as all used licenses force you to attribute the owner.

In [1]: import skgstat as skg

In [2]: skg.data.aniso(N=20)
Out[2]: 
{'sample': (array([[475, 386],
         [365, 358],
         [365, 100],
         [397, 487],
         [436, 419],
         [280, 380],
         [ 28,  64],
         [311,  44],
         [131, 327],
         [305, 256],
         [247, 216],
         [485,  42],
         [ 96, 225],
         [137, 429],
         [ 43,  47],
         [432, 367],
         [ 12, 393],
         [212, 219],
         [333, 348],
         [115, 263]]),
  array([168, 163, 148, 156, 161, 176, 125,  98, 198, 182, 172, 132,  99,
         212, 146, 169, 177, 204, 146,  68], dtype=uint8)),
 'origin': 'Random field greyscale image with geometric anisotropy.\n    The anisotropy in North-East direction has a factor of 3. The random\n    field was generated using gstools.\n    Copyright Mirko Mälicke, 2020. If you use this data,\n    cite SciKit-GStat: https://doi.org/10.5281/zenodo.1345584\n    '}

These samples contain a coordinate and a value array.