# A Fair Distance Function

When developing democratic proposition development and decision making software, the processing of user-generated geospatial data poses certain questions when it comes to the ordering of search results by location-based relevance. One aspect of location-based relevance is the distance of a geographical object from/to a given point (e.g. a search center). While we usually mean the euclidean or spheroidal surface distance [Note: The spheroidal surface distance is the shortest possible distance on the reference spheroid used to model Earth.] when we speak of “distances”, other mathematical distance functions are thinkable and useful, as will be shown in this article. Beside proposing a particular distance function, we will also consider practical aspects of database indexing when using such a distance function.

## Mathematic preliminaries

We will use the term “distance function” for a mathematical function f that maps two geographical objects G and S (each being a subset of Earth's surface E) to a non-negative real number f(G, S) ∈ ℝ+0. Within this article, however, we will only consider those functions where the second argument S is a singular point (which we will refer to as the “search center”), while the first argument G may be any other geographical shape (e.g. a point, path, or polygon). Therefore, the functions considered within this article are not “metrics” because a metric maps two objects of the same set to a non-negative real number. Furthermore, we will allow a distance function to return zero even if the two objects are not equal; e.g. the distance between a path and a point on that path may be zero even if the point and the path are two distinct (non-equal) objects.

A distance of zero will denote the highest possible relevance for an object G in regard to a certain search center S; higher numbers will denote less relevant matches.

Note that even when we write G ⊆ E, we only consider those G which are finite unions of basic geographical objects, and not, for example, fractals or non-measurable sets.

## Nearest-neighbor searches and distance functions

Distance functions are a necessity for database queries in the form of “Show me the object(s) which are closest to my location”. While the distance function is the only mandatory prerequisite for such a nearest-neighbor search (and totally sufficient for a linear search), a fast indexing system requires further support functions to speed up nearest-neighbor searches. A working set of such support functions can be found at the GiST framework that is used by PostgreSQL. [GiST] At first sight, we will focus on the distance function though, keeping in mind that a final solution will require further considerations in order to implement fast indexing techniques which are necessary for scalable applications.

## The trivial approach

The easiest function to be used for nearest-neighbor searches is to determine the shortest possible spheroidal surface distance d (or euclidean distance in case of flat maps) between a search center S and the geographical object G.

A geographical object ‘G’ of a random shape next to a point ‘S’. The shortest distance between ‘S’ and ‘G’ is ‘d’.
Figure 1: The shortest way (euclidean distance) from a geographical object ‘G’ to a search center ‘S’

## Challenge I: Object size

While, at a first glance, the trivial approach of simply determining the shortest possible distance might seem to be straight forward and fair, a closer look reveals a serious problem: objects that cover large areas have a higher chance to return a shorter distance (or even zero if they cover the search center). If these objects are part of user-generated content (e.g. created by an initiator of a LiquidFeedback initiative), then users might be tempted to create intentionally oversized geometric objects in order to optimize search results for their content. Honest users aiming to select exactly those regions or locations that are really affected by their initiatives would be at disadvantage to users entering oversized and thus incorrect geometries. Even if it is still possible that some other users penalize such attempts by giving respective bad ratings to proposals with wrong or slightly wrong geospatial metadata, it would still be possible that the overall data quality is compromised due to tactical considerations of certain users. As shown in the section “Requirements for a fair distance function” below, we can compensate the advantage of big objects in a proper way.

Two geographical objects ‘G_small’ and ‘G_big’, of which the first is smaller than the second one and contained in the second one. The shortest distance from three points ‘S_1’, ‘S_2’, and ‘S_3’ outside those two objects to ‘G_small’ is always greater than the shortest distance of these three points to ‘G_big’.
Figure 2: Big objects having advantages in comparison to small objects

## Challenge II: Relevance

Following the demand of a democratically driven proposition development process that is moderated by the users (i.e. collectively moderated), location-based relevance will not only depend on geographical properties but also on the ratings of the users (i.e. voters). Depending on the kind of search, we would also need to take the voter or supporter situation (e.g. “likes”) into account when sorting data by (location-based) relevance. One method to include user ratings for the location-based relevance of objects would be to simply divide the geographical distance by the number of votes such that an object that is 90 times as far away but has 100 times more votes than another object will appear first in the list of entries based on location-based relevance. While this may affect the efficiency of index lookups inside a database (and thus require further consideration, see section “Weighted nearest-neighbor searches” below), it is otherwise easy to convert any distance function to a function which will take the weight of an object into account (e.g. by including a simple division as explained before).

It should be noted that in many cases, the number of votes isn't suitable to be directly taken into consideration as weight. This is because clone-proofness is an important property of proposition development and decision making systems. [PLF, section 4.11] In case of competing proposals with geographical metadata, the “Harmonic Weighting” algorithm (compare [PLF, subsection 4.10.1]) might be a more suitable approach.

In the context of multi-winner elections, the counting scheme used by “Harmonic Weighting” has also been described by Thiele in 1895. [Janson] [Skowron] For the sake of weighting geographical objects, however, we require more than a selection of candidates (like in multi-winner voting systems) and even more than a ranking of all candidates (which is a natural by-product of sequentially working multi-winner voting systems): we require the system to return a weight (i.e. a real number) instead of just a rank for each candidate. See [PLF, Appendix B] for an example (column “Harmonic Weight” in table on [PLF, p.177]).

## Requirements for a fair distance function

Our aim is that increasing the size of a geographical object doesn't give a general advantage to appear earlier in nearest-neighbor searches. It should be noted that the “size” in this context doesn't necessarily mean “area” or “length”. For example, neither a huge polygon nor a set of thousand points or a long line (note that the latter two have an area of zero) should gain an advantage over a single point. When we speak of “general advantage”, we must define this in mathematical terms since changing the size or location of an object can always optimize the distance in regard to a particular search center. With “general advantage” we rather mean that the distance f(G, S) (with G being a geographical object and S being a search center point) can't be optimized in regard to all possible search centers, or by demanding that the average result of the distance function f (or the squared distance function, see Figure 3) over all possible search points is constant.

For all geographical objects G in { x subset of or equal to E | x not empty }: Surface integral (S in E) [ (f(G, S))^2 ] dA = const, where E is the set of all points on Earth's surface and dA is the infinitesimal surface area.
Figure 3: Constancy of average of square of distance

The requirement in Figure 3, however, is not sufficient for a fair distance function because f(G, S) is not bounded: raising f(G, S) to a very high value for some search centers would allow lowering f(G, S) for many other search centers.

A stricter requirement would be that the area for which the distance function yields values equal to or less than a limit L is independent of G and must only depend on L (e.g. the area must be equal to π · L2 as demanded by the formula given in Figure 4).

Surface integral (S in {x in E | f(G, x) <= L}) [ 1 ] dA = pi * L^2
Figure 4: Strict requirement for the area where the distance function yields values smaller than a certain limit

This stricter requirement is automatically fulfilled for all L ∈ ℝ+0 if we only consider singular points as geographical objects on a flat map and if f is the euclidean distance function between the two points G and S, because the area around a point G where the distance to that point is less than a limit L is exactly π · L2. [Note: The surface area of a circle with radius r is π · r2.] If we are able to fulfill this requirement also for other geographical objects G which are not singular points (e.g. polygons of any size or multiple points), no general advantage could be gained by adding locations or areas to G, because the size of the area, where a search center S could be located to return an f(G, S) lower than a certain value, would be independent of G.

Unfortunately the requirement in Figure 4 is generally infeasible for geographical objects that have an area greater than zero because it would conflict with treating all points inside the object's area equally. We can still demand the requirement depicted in Figure 4 to be true for all L ∈ ℝ+0 when the geographical object G has an area of zero (i.e. only consist of points and paths). For all other objects, we propose to violate this criterion. Notwithstanding, our proposal will still fulfill the requirement in the limiting case where small geographical objects (in terms of covered area) in relation to their distance from the search point are being considered.

Let |G| be the size of the area covered by G (e.g. |G| = 0 if G only consists of singular points), then there exists a factor c_0 in R+ such that for all geographical objects G subset of or equal to E and all all L >= c_0 * |G|: Surface integral (S in {x in E | f(G, x) <= L}) [ 1 ] dA = pi * L^2.
Figure 5: Fulfillment of formula from Figure 4 for all ‘L’ being large enough

## Proposal for a fair distance function

Considering what has been discussed above, we propose the algorithm described in Figure 6 to serve as a “fair distance function” which is parameterized with a search center point S and a geographical object G on the spheroid. An example of its application is shown in Figure 7.

Let ‘c_1’ and ‘c_2’ be two constants, where 1/2 < c_1 <= 1 and c_2 = (c_1 ^ 2) / (2 * c_1 - 1). 1. Calculate the surface area ‘|G|’ of the geographical object ‘G’ (zero if ‘G’ only consists of points and paths but not, for example, filled polygons). 2. Calculate the shortest spheroidal surface distance ‘d’ between the geographical object ‘G’ and the search center point ‘S’ (zero if ‘S’ is located inside or touching ‘G’). 3. Calculate the surface area ‘|G_extra|’ covered by all points on the spheroid which are not in ‘G’ but whose shortest spheroidal surface distance to ‘G’ is less than ‘d’. 4. Let R = min ( c_1 * |G| + c_2 * |G_extra| , A + E ). 5. The result of the fair distance function is: f(G, S) = sqrt ( R / pi ).
Figure 6: Proposed algorithm
4 black geographical objects marked as ‘G’ are surrounded by a gray margin that follows the shape of the four objects and is just thick enough to touch a point ‘S’ that is outside of ‘G’. f(G, S) = sqrt ( R / pi ). R = min ( c_1 * |G| + c_2 * |G_extra| , A + E ).
Figure 7: Example calculation where ‘G’ is a union of 4 basic geographical objects (line, triangle, two points), and where ‘S ∉ G’

The choice of c1 directly corresponds to a penalty for search center points S that are located inside or touch the geographical object G: the result of the fair distance function equals to sqrt(c1 · |G| / π) in this case (which is proportional to the radius of a circle with the same surface area than the geographical object G). In order to fulfill the demand stated in Figure 3, the second constant c2 is chosen dependent on c1 such that the penalties for search center points S inside the geographical object G are compensated on the outside (see Figure 9). An optimal value for c1 would be 1/2 because in case of a geographical object which consists of a huge number of singular points randomly scattered over a certain surface area, the statistical average for |Gextra| is half of that area.

A shape is filled with black points in a regular pattern. An additional point ‘S’ is also within that shape. Each black point is surrounded by a gray margin that is just thick enough to touch that point ‘S’.
Figure 8: For an ‘S’ randomly placed inside the outer shape, the grey area ‘Gextra’ will cover half of the area within the outer shape in the average case

Therefore, no (statistical) advantage could be gained by replacing filled areas of a geographical object with a huge number of singular points covering that surface area.

Setting c1 = 1/2, however, would cause a discontinuity because lim [c1→1/2] c2 = ∞. A reasonable compromise seems to be (c1, c2) = (2/3, 4/3). This way, search center points that are located inside the geographical object are slightly over-penalized, but the discontinuity is solved smoothly.

A graph with ‘S ordered by f(G, S)’ on the x-axis and ‘R proportional to f(G, S)^2’ on the y-axis. A dashed line marks a linear relation. A solid line is constant and equal to ‘c_1 * |G|’ for ‘S in G’. This part of the solid line is labeled with ‘treat all S in G equal’. Right of it, the solid line has a slope of ‘c_2’ until ‘c_0 * |G|’ on the x-axis. Afterwards the solid line overlaps with the dashed line. The areas between the solid line and the dashed line are labeled ‘I_1’, ‘I_2’, and ‘I_3’. ‘I_1’ is above the dashed line, while ‘I_2’ and ‘I_3’ are below the dashed line. I_1 - I_2: excess penalty for S in G. I_3: compensation for extra penalty (‘c_2’ is chosen such that ‘I_3 = I_1 - I_2’). c_0 := (c_2 - c_1) / (c_2 - 1). Surface integral (S in E) [ (f(G, S))^2 ] dA = const. The integral is independent of ‘|G|’ if ‘c_0 * |G|’ is not exceeding Earth's surface, because ‘I_1 - I_2 - I_3 = 0’ and the integral will then cover ‘I_1’ through ‘I_3’.
Figure 9: Picking ‘c2’ such that the formula in Figure 3 is fulfilled

In the limiting case where small geographical objects in relation to their distance from the search point S are being considered, we know that |Gextra| ≫ |G| from which follows that R = |G| + |Gextra|. Then R is equal to the area of all points with spheroidal surface distance to the geographical object G equal to or smaller than d (see Figure 7). The area for which the distance function yields values below a limit L is therefore equal to π · L2 because from f(G, S) ≤ sqrt(R/π) = L follows that R ≤ π · L2. Therefore, the demand stated in Figure 4 is fulfilled in this limiting case (see Figure 5 for the precise condition; factor c0 is given in Figure 9).

## Implementation as part of the PostgreSQL extension “pgLatLon”

PostgreSQL is an open-source database management system. [PostgreSQL] The fair distance function has been implemented as part of a PostgreSQL extension named “pgLatLon”, [pgLatLon] which was originally contributed as part of LiquidFeedback. The implementation uses numerical integration (similar to the Monte Carlo method) in order to determine the area |G| and extended area |Gextra| on the Earth spheroid. While the area of a polygon can also be calculated more easily and more accurately, the calculation of the extended area |Gextra| seems to be rather difficult if numerical calculation was to be avoided.

The sample points for numerical intergration on the spheroid (as calculated by the “pgl_sample_points” function) are generated by using a spiral with sample points occurring at the golden angle, similar to a pattern found in many plants, e.g. sunflowers.

t is element of { 0.5, 1.5, 2.5, 3.5, …, n-0.5 }. x = sqrt(t) * cos(alpha * t). y = sqrt(t) * sin(alpha * t). alpha = (3 - sqrt(5)) * 180 degrees = approx. 137.51 degrees. A spiral starts in the center and turns counter-clockwise. Points are placed on the spiral, and the angle between two successive points in regard to the beginning of the spiral in the center is alpha.
Figure 10: Using the golden angle to create sample points for numerical integration
Figure 11: C implementation in ‘pgLatLon’

## Completing the GiST interface with a distance estimator function

While the distance function is a mandatory prerequisite for a nearest-neighbor search, further support functions are needed to speed up nearest-neighbor searches when using database indices. PostgreSQL provides the GiST interface to enable fast nearest-neighbor searches for custom functions and operators. “pgLatLon” includes facilities to create indices on geographical objects, including support for nearest-neighbor search using (a) the spheroidal surface distance or (b) the “fair distance” function as defined above. However, the corresponding GiST distance estimator function implemented by pgLatLon uses the spheroidal surface distance function in both cases. It is still possible to do nearest-neighbor searches for the “fair distance” function because the distance estimator function of the GiST framework is always allowed to return smaller values, though the performance may be less optimal.

## Considering the voting weight

The second challenge mentioned above is to not only consider geographical properties but also the ratings of others users (i.e. voters) when performing a search. For a pair of a single search center point S and a geographical object G, this can be easily achieved by dividing the value R (as calculated in the algorithm explained above) by a number representing the strength of support by the voters. In the easiest case, this could be the total number of votes. However, in order to create a clone-proof process, vote counting mechanisms should be used where clone-proofness is ensured (e.g. restrict voters to vote only for one object or use a clone-proof counting scheme such as Harmonic Weighting).

## Weighted nearest-neighbor searches

As previously noted, the distance estimator function of the GiST framework is allowed to return distances shorter than the actual distances. While the penalty of the fair distance function only increases the returned distance (at least in case of a flat map or, by approximation, in case of short distances on the spheroid [Note: The difference between a flat map and a spheroid used to model Earth tends towards zero for small distances. “pgLatLon” version 0.10 ensures that the fair distance function never returns values smaller than the actual spheroidal surface distance (see last if-else-clause in Figure 11, for example), which introduces a tiny error (that tends towards zero for small distances) but ensures that the GiST functions behave consistently which is required for proper index operation.]), considering voting weight might decrease a returned distance. Therefore, it is no longer feasible to use the spheroidal surface distance as an estimation for distances that have been re-weighted according to the number of votes. Whenever an additional weight of an object is taken into account, the index should store such a weight in the index tree so that the corresponding GiST distance estimator can consider this weight by decreasing the returned estimation for the distance accordingly. In the same fashion, the estimated distance can be increased when a geographical object has an area |G| > 0. While the latter is not necessary (since it is always allowed to return distances shorter than the actual distance), it may speed up calculation. Both adjustments, however, will require to store additional data in the index tree.

Just storing this data in this index tree (and returning the worst case in case of a non-leaf node) doesn't allow for a fast index operation yet: the tree would also need to be organized in a way that considers the additional data. There are many possibilities to achieve this (R-trees, kd-trees, fractals, etc.), and the choice of index structure goes beyond the scope of this article. In either case, a performant implementation would not be trivial.

Nonetheless, it is possible to work around this issue in SQL by performing SELECTs with an (exponentially) increasing LIMIT clause within a custom function. This approach is less optimal than using a specialized index, but an example is given in pgLatLon's source code (version 0.10) in lines 27 thorugh 85 of the file “create_test_db.schema.sql”.

## Acknowledgements

The development of “pgLatLon”, including the fair distance function, was contributed to LiquidFeedback by FlexiGuided GmbH, Berlin and co-funded by the European Union's Horizon 2020 research and innovation programme under grant agreement n° 693514 (“WeGovNow”).

“pgLatLon” is available as an open-source software under the terms of the MIT-License and may be downloaded at the project page of the Public Software Group e.V.: http://www.public-software-group.org/pgLatLon

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 693514.

[Evolution] Jan Behrens: The Evolution of Proportional Representation in LiquidFeedback. In “The Liquid Democracy Journal on electronic participation, collective moderation, and voting systems”, Issue 1, March 20, 2014, pp. 32-41. ISSN 2198-9532. Published by Interaktive Demokratie e. V., available at http://www.liquid-democracy-journal.org/issue/1/The_Liquid_Democracy_Journal-Issue001-04-The_evolution_of_proportional_representation_in_LiquidFeedback.html
[GiST] PostgreSQL 9.6.2 Documentation, Chapter VII (Internals), Section 61 (GiST). Available at https://www.postgresql.org/docs/9.6/static/gist-intro.html (referenced at: a)
[Janson] Svante Janson: “Phragmén's and Thiele's election methods”, November 27, 2016, subsection 4.3 (Thiele's elimination method) on page 15. Available at https://arxiv.org/abs/1611.08826v1 (referenced at: a)
[pgLatLon] “pgLatLon”, a spatial database extension for PostgreSQL. Homepage at http://www.public-software-group.org/pgLatLon (referenced at: a)
[PLF] Behrens, Kistner, Nitsche, Swierczek: “The Principles of LiquidFeedback”. ISBN 978-3-00-044795-2. Published January 2014 by Interaktive Demokratie e. V., available at http://principles.liquidfeedback.org/ (referenced at: a b c d)
[PostgreSQL] The open source object-relational database system “PostgreSQL”. Homepage at https://www.postgresql.org/ (referenced at: a)
[Skowron] Skowron, Lackner, Brill, Peters, Elkind: “Proportional Rankings”, December 5, 2015, ranking rule Reverse SeqPAV on page 4. Published on arXiv.org at https://arxiv.org/abs/1612.01434v1 (referenced at: a)