{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# 5 - Lag classes\nThis tutorial focuses the estimation of lag classes. It is one of the most\nimportant, maybe **the** most important step for estimating variograms.\nUsually, lag class generation, or binning, is not really focused in\ngeostatistical literature. The main reason is, that usually, the same method is\nused. A user-set amount of equidistant lag classes is formed with ``0`` as lower\nbound and ``maxlag`` as upper bound. Maxlag is often set to the median or 60%\npercentile of all pairwise separating distances. \n\nIn SciKit-GStat this is also the default behavior, but only one of dozen of\ndifferent implemented methods. Thus, we want to shed some light onto the other\nmethods here. SciKit-GStat implements methods of two different kinds. The\nfirst kind are the methods, that take a fixed `N`, the number of lag classes,\naccessible through the :func:`Variogram.n_lags <skgstat.Variogram.n_lags>`\nproperty. These methods are ``['even', 'uniform', 'kmeans', 'ward']``.\nThe other kind is often used in histogram estimation and will apply a (simple)\nrule to figure out a suitable `N` themself. Using one of these methods will\noverwrite the :func:`Variogram.n_lags <skgstat.Variogram.n_lags>` property.\nTHese methods are: ``['sturges', 'scott', 'fd', 'sqrt', 'doane']``.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import skgstat as skg\nimport numpy as np\nimport pandas as pd\nfrom imageio import imread\nimport plotly.graph_objects as go\nfrom plotly.subplots import make_subplots\n\nskg.plotting.backend('plotly')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 5.1 Sample data\nLoads a data sample and draws `n_samples` from the field.\nFor sampling the field, random samples from a gamma distribution with a fairly\nhigh scale are drawn, to ensure there are some outliers in the samle. The\nvalues are then re-scaled to the shape of the random field and the values\nare extracted from it.\nYou can use either of the next two cell to work either on the pancake or\nthe Meuse dataset.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "N = 80\npan = skg.data.pancake_field().get('sample')\ncoords, vals = skg.data.pancake(N=80, seed=1312).get('sample')\nfig = make_subplots(1,2,shared_xaxes=True, shared_yaxes=True)\nfig.add_trace(\n    go.Scatter(x=coords[:,0], y=coords[:,1], mode='markers', marker=dict(color=vals,cmin=0, cmax=255), name='samples'),\n    row=1, col=1\n)\nfig.add_trace(go.Heatmap(z=pan, name='field'), row=1, col=2)\nfig.update_layout(width=900, height=450, template='plotly_white')\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>You need to comment the next cell to use the pancake dataset. This cell will\n  will overwrite the ``coords`` and ``vals`` array create in the last cell.</p></div>\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "coords, vals = skg.data.meuse().get('sample')\nvals = vals.flatten()\nfig = go.Figure(go.Scatter(x=coords[:,0], y=coords[:,1], mode='markers', marker=dict(color=vals), name='samples'))\nfig.update_layout(width=450, height=450, template='plotly_white')\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 5.2 Lag class binning - fixed ``N``\nApply different lag class binning methods and visualize their histograms.\nIn this section, the distance matrix between all point pair combinations\n``(NxN)`` is binned using each method. The plots visualize the histrogram of\nthe distance matrix of the variogram, **not** the variogram lag classes\nthemselves.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "N = 15\n\n# use a nugget\nV = skg.Variogram(coords, vals, n_lags=N, use_nugget=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.2.1 default :func:`'even' <skgstat.binning.even_width_lags>` lag classes\nThe default binning method will find ``N`` equidistant bins. This is the\ndefault behavior and used in almost all geostatistical publications.\nIt should not be used without a ``maxlag`` (like done in the plot below).\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# apply binning\nbins, _ = skg.binning.even_width_lags(V.distance, N, None)\n\n# get the histogram\ncount, _ = np.histogram(V.distance, bins=bins)\n\nfig = go.Figure(\n    go.Bar(x=bins, y=count),\n    layout=dict(template='plotly_white', title=r\"$\\texttt{'even'}~~binning$\")\n)\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.2.2 :func:`'uniform' <skgstat.binning.uniform_count_lags>` lag classes\nThe histogram of the :func:`'uniform' <skgstat.binning.uniform_count_lags>`\nmethod will adjust the lag class widths to have the same sample size for each\nlag class. This can be used, when there must not be any empty lag classes on\nsmall data samples, or comparable sample sizes are desireable for the\nsemi-variance estimator. \n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# apply binning\nbins, _ = skg.binning.uniform_count_lags(V.distance, N, None)\n\n# get the histogram\ncount, _ = np.histogram(V.distance, bins=bins)\n\nfig = go.Figure(\n    go.Bar(x=bins, y=count),\n    layout=dict(template='plotly_white', title=r\"$\\texttt{'uniform'}~~binning$\")\n)\n\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.2.3 :func:`'kmeans' <skgstat.binning.kmeans>` lag classes\nThe distance matrix is clustered by a K-Means algorithm.\nThe centroids are used as lag class centers. Each lag class is then formed\nby taking half the distance to each sorted neighboring centroid as a bound. \nThis will most likely result in non-equidistant lag classes.\n\nOne important note about K-Means clustering is, that it is not a\ndeterministic method, as the starting points for clustering are taken randomly.\nThus, the decision was made to seed the random start values. Therefore, the\nK-Means implementation in SciKit-GStat is deterministic and will always\nreturn the same lag classes for the same distance matrix.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# apply binning\nbins, _ = skg.binning.kmeans(V.distance, N, None)\n\n# get the histogram\ncount, _ = np.histogram(V.distance, bins=bins)\n\nfig = go.Figure(\n    go.Bar(x=bins, y=count),\n    layout=dict(template='plotly_white', title=r\"$\\texttt{'K-Means'}~~binning$\")\n)\n\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.2.4 :func:`'ward' <skgstat.binning.ward>` lag classes\nThe other clustering algorithm is a hierarchical clustering algorithm.\nThis algorithm groups values together based on their similarity, which is\nexpressed by Ward's criterion. \nAgglomerative algorithms work iteratively and deterministic, as at first\niteration each value forms a cluster on its own. Each cluster is then merged\nwith the most similar other cluster, one at a time, until all clusters are\nmerged, or the clustering is interrupted. \nHere, the clustering is interrupted as soon as the specified number of lag\nclasses is reached. The lag classes are then formed similar to the K-Means\nmethod, either by taking the cluster mean or median as center.\n\nWard's criterion defines the one other cluster as the closest, that results\nin the smallest intra-cluster variance for the merged clusters. \nThe main downside is the processing speed. You will see a significant\ndifference for ``'ward'`` and should not use it on medium and large datasets.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# apply binning\nbins, _ = skg.binning.ward(V.distance, N, None)\n\n# get the histogram\ncount, _ = np.histogram(V.distance, bins=bins)\n\nfig = go.Figure(\n    go.Bar(x=bins, y=count),\n    layout=dict(template='plotly_white', title=r\"$\\texttt{'ward'}~~binning$\")\n)\n\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 5.3 Lag class binning - adjustable ``N``\n\n### 5.3.1 :func:`'sturges' <skgstat.binning.auto_derived_lags>` lag classes\nSturge's rule is well known and pretty straightforward. It's the defaul\n method for histograms in R. The number of equidistant lag classes is defined like:\n\n\\begin{align}n =log_2 (x + 1)\\end{align}\n\nSturge's rule works good for small, normal distributed datasets.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# apply binning\nbins, n = skg.binning.auto_derived_lags(V.distance, 'sturges', None)\n\n# get the histogram\ncount, _ = np.histogram(V.distance, bins=bins)\n\nfig = go.Figure(\n    go.Bar(x=bins, y=count),\n    layout=dict(template='plotly_white', title=r\"$\\texttt{'sturges'}~~binning~~%d~classes$\" % n)\n)\n\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.3.2 :func:`'scott' <skgstat.binning.auto_derived_lags>` lag classes\nScott's rule is another quite popular approach to estimate histograms.\nThe rule is defined like:\n\n\\begin{align}h = \\sigma \\frac{24 * \\sqrt{\\pi}}{x}^{\\frac{1}{3}}\\end{align}\n\nOther than Sturge's rule, it will estimate the lag class width from the\nsample size standard deviation. Thus, it is also quite sensitive to outliers. \n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# apply binning\nbins, n = skg.binning.auto_derived_lags(V.distance, 'scott', None)\n\n# get the histogram\ncount, _ = np.histogram(V.distance, bins=bins)\n\nfig = go.Figure(\n    go.Bar(x=bins, y=count),\n    layout=dict(template='plotly_white', title=r\"$\\texttt{'scott'}~~binning~~%d~classes$\" % n)\n)\n\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.3.3 :func:`'sqrt' <skgstat.binning.auto_derived_lags>` lag classes\nThe only advantage of this method is its speed. The number of lag classes \nis simply defined like:\n\n\\begin{align}n = \\sqrt{x} $$\\end{align}\n\nThus, it's usually not really a good choice, unless you have a lot of samples.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# apply binning\nbins, n = skg.binning.auto_derived_lags(V.distance, 'sqrt', None)\n\n# get the histogram\ncount, _ = np.histogram(V.distance, bins=bins)\n\nfig = go.Figure(\n    go.Bar(x=bins, y=count),\n    layout=dict(template='plotly_white', title=r\"$\\texttt{'sqrt'}~~binning~~%d~classes$\" % n)\n)\n\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.3.4 :func:`'fd' <skgstat.binning.auto_derived_lags>` lag classes\n\nThe Freedman-Diaconis estimator can be used to derive the number of lag\nclasses again from an optimal lag class width like: \n\n\\begin{align}h = 2\\frac{IQR}{x^{1/3}}\\end{align}\n\nAs it is based on the interquartile range (IQR), it is very robust to outlier.\nThat makes it a suitable method to estimate lag classes on non-normal distance\nmatrices. On the other side it usually over-estimates the $n$ for small\ndatasets. Thus it should only be used on medium to small datasets.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# apply binning\nbins, n = skg.binning.auto_derived_lags(V.distance, 'fd', None)\n\n# get the histogram\ncount, _ = np.histogram(V.distance, bins=bins)\n\nfig = go.Figure(\n    go.Bar(x=bins, y=count),\n    layout=dict(template='plotly_white', title=r\"$\\texttt{'fd'}~~binning~~%d~classes$\" % n)\n)\n\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.3.5 :func:`'doane' <skgstat.binning.auto_derived_lags>` lag classes\n\nDoane's rule is an extension to Sturge's rule that takes the skewness of the\ndistance matrix into account. It was found to be a very reasonable choice on\nmost datasets where the other estimators didn't yield good results.\n\nIt is defined like:\n\n\\begin{align}\\begin{split}\n        n = 1 + \\log_{2}(s) + \\log_2\\left(1 + \\frac{|g|}{k}\\right) \\\\\n            g = E\\left[\\left(\\frac{x - \\mu_g}{\\sigma}\\right)^3\\right]\\\\\n            k = \\sqrt{\\frac{6(s - 2)}{(s + 1)(s + 3)}}\n    \\end{split}\\end{align}\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# apply binning\nbins, n = skg.binning.auto_derived_lags(V.distance, 'doane', None)\n\n# get the histogram\ncount, _ = np.histogram(V.distance, bins=bins)\n\nfig = go.Figure(\n    go.Bar(x=bins, y=count),\n    layout=dict(template='plotly_white', title=r\"$\\texttt{'doane'}~~binning~~%d~classes$\" % n)\n)\n\nfig\n\n\n# 5.4 Variograms\n# --------------\n# The following section will give an overview on the influence of the chosen\n# binning method on the resulting variogram. All parameters will be the same for\n# all variograms, so any change is due to the lag class binning. The variogram\n# will use a maximum lag of ``200`` to get rid of the very thin last bins at\n# large distances.\n# \n# The ``maxlag`` is very close to the effective range of the variogram, thus you\n# can only see differences in sill. But the variogram fitting is not at the\n# focus of this tutorial. You can also change the parameter and fit a more\n# suitable spatial model\n\n# use a exponential model\nV.set_model('spherical')\n\n# set the maxlag\nV.maxlag = 'median'"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.4.1 :func:`'even' <skgstat.binning.even_width_lags>` lag classes\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# set the new binning method\nV.bin_func = 'even'\n\n# plot\nfig = V.plot(show=False)\nprint(f'\"{V._bin_func_name}\" - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')\nfig.update_layout(template='plotly_white')\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.4.2 :func:`'uniform' <skgstat.binning.uniform_count_lags>` lag classes\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# set the new binning method\nV.bin_func = 'uniform'\n\n# plot\nfig = V.plot(show=False)\nprint(f'\"{V._bin_func_name}\" - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')\nfig.update_layout(template='plotly_white')\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.4.3 :func:`'kmeans' <skgstat.binning.kmeans>` lag classes\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# set the new binning method\nV.bin_func = 'kmeans'\n\n# plot\nfig = V.plot(show=False)\nprint(f'\"{V._bin_func_name}\" - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')\nfig.update_layout(template='plotly_white')\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.4.4 :func:`'ward' <skgstat.binning.ward>` lag classes\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# set the new binning method\nV.bin_func = 'ward'\n\n# plot\nfig = V.plot(show=False)\nprint(f'\"{V._bin_func_name}\" - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')\nfig.update_layout(template='plotly_white')\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.4.5 :func:`'sturges' <skgstat.binning.auto_derived_lags>` lag classes\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# set the new binning method\nV.bin_func = 'sturges'\n\n# plot\nfig = V.plot(show=False)\nprint(f'\"{V._bin_func_name}\" adjusted {V.n_lags} lag classes - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')\nfig.update_layout(template='plotly_white')\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.4.6 :func:`'scott' <skgstat.binning.auto_derived_lags>` lag classes\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# set the new binning method\nV.bin_func = 'scott'\n\n# plot\nfig = V.plot(show=False)\nprint(f'\"{V._bin_func_name}\" adjusted {V.n_lags} lag classes - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')\nfig.update_layout(template='plotly_white')\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.4.7 :func:`'fd' <skgstat.binning.auto_derived_lags>` lag classes\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# set the new binning method\nV.bin_func = 'fd'\n\n# plot\nfig = V.plot(show=False)\nprint(f'\"{V._bin_func_name}\" adjusted {V.n_lags} lag classes - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')\nfig.update_layout(template='plotly_white')\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.4.8 :func:`'sqrt' <skgstat.binning.auto_derived_lags>` lag classes\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# set the new binning method\nV.bin_func = 'sqrt'\n\n# plot\nfig = V.plot(show=False)\nprint(f'\"{V._bin_func_name}\" adjusted {V.n_lags} lag classes - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')\nfig.update_layout(template='plotly_white')\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5.4.9 :func:`'doane' <skgstat.binning.auto_derived_lags>` lag classes\n\nIn[23]:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# set the new binning method\nV.bin_func = 'doane'\n\n# plot\nfig = V.plot(show=False)\nprint(f'\"{V._bin_func_name}\" adjusted {V.n_lags} lag classes - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')\nfig.update_layout(template='plotly_white')\nfig"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}