The Liquid Democracy Journal

on electronic participation, collective moderation, and voting systems Issue 5

2017-05-11

on electronic participation, collective moderation, and voting systems Issue 5

2017-05-11

When developing democratic proposition development and decision making software, the processing of user-generated geospatial data poses certain questions when it comes to the ordering of search results by location-based relevance. One aspect of location-based relevance is the *distance* of a geographical object from/to a given point (e.g. a search center). While we usually mean the euclidean or spheroidal surface distance [Note: The spheroidal surface distance is the shortest possible distance on the reference spheroid used to model Earth.] when we speak of “distances”, other mathematical distance functions are thinkable and useful, as will be shown in this article. Beside proposing a particular distance function, we will also consider practical aspects of database indexing when using such a distance function.

We will use the term “distance function” for a mathematical function *f* that maps two geographical objects *G* and *S* (each being a subset of Earth's surface *E*) to a non-negative real number *f(G, S) ∈ ℝ ^{+}_{0}*. Within this article, however, we will only consider those functions where the second argument

A distance of zero will denote the highest possible relevance for an object *G* in regard to a certain search center *S*; higher numbers will denote less relevant matches.

Note that even when we write *G ⊆ E*, we only consider those *G* which are finite unions of basic geographical objects, and not, for example, fractals or non-measurable sets.

Distance functions are a necessity for database queries in the form of “Show me the object(s) which are closest to my location”. While the distance function is the only mandatory prerequisite for such a nearest-neighbor search (and totally sufficient for a linear search), a fast indexing system requires further support functions to speed up nearest-neighbor searches. A working set of such support functions can be found at the GiST framework that is used by PostgreSQL. [GiST] At first sight, we will focus on the distance function though, keeping in mind that a final solution will require further considerations in order to implement fast indexing techniques which are necessary for scalable applications.

The easiest function to be used for nearest-neighbor searches is to determine the shortest possible spheroidal surface distance *d* (or euclidean distance in case of flat maps) between a search center *S* and the geographical object *G*.

A geographical object ‘G’ of a random shape next to a point ‘S’.
The shortest distance between ‘S’ and ‘G’ is ‘d’.

Figure 1: The shortest way (euclidean distance) from a geographical object ‘G’ to a search center ‘S’

While, at a first glance, the trivial approach of simply determining the shortest possible distance might seem to be straight forward and fair, a closer look reveals a serious problem: objects that cover large areas have a higher chance to return a shorter distance (or even zero if they cover the search center). If these objects are part of user-generated content (e.g. created by an initiator of a LiquidFeedback initiative), then users might be tempted to create intentionally oversized geometric objects in order to optimize search results for their content. Honest users aiming to select exactly those regions or locations that are really affected by their initiatives would be at disadvantage to users entering oversized and thus incorrect geometries. Even if it is still possible that some other users penalize such attempts by giving respective bad ratings to proposals with wrong or slightly wrong geospatial metadata, it would still be possible that the overall data quality is compromised due to tactical considerations of certain users. As shown in the section “Requirements for a fair distance function” below, we can compensate the advantage of big objects in a proper way.

Two geographical objects ‘G_small’ and ‘G_big’, of which the first is
smaller than the second one and contained in the second one.
The shortest distance from three points ‘S_1’, ‘S_2’, and ‘S_3’ outside
those two objects to ‘G_small’ is always greater than the shortest distance
of these three points to ‘G_big’.

Figure 2: Big objects having advantages in comparison to small objects

Following the demand of a democratically driven proposition development process that is moderated by the users (i.e. collectively moderated), location-based relevance will not only depend on geographical properties but also on the ratings of the users (i.e. voters). Depending on the kind of search, we would also need to take the voter or supporter situation (e.g. “likes”) into account when sorting data by (location-based) relevance. One method to include user ratings for the location-based relevance of objects would be to simply divide the geographical distance by the number of votes such that an object that is 90 times as far away but has 100 times more votes than another object will appear first in the list of entries based on location-based relevance. While this may affect the efficiency of index lookups inside a database (and thus require further consideration, see section “Weighted nearest-neighbor searches” below), it is otherwise easy to convert any distance function to a function which will take the weight of an object into account (e.g. by including a simple division as explained before).

It should be noted that in many cases, the number of votes isn't suitable to be directly taken into consideration as weight. This is because clone-proofness is an important property of proposition development and decision making systems. [PLF, section 4.11] In case of competing proposals with geographical metadata, the “Harmonic Weighting” algorithm (compare [PLF, subsection 4.10.1]) might be a more suitable approach.

In the context of multi-winner elections, the counting scheme used by “Harmonic Weighting” has also been described by Thiele in 1895. [Janson] [Skowron] For the sake of weighting geographical objects, however, we require more than a selection of candidates (like in multi-winner voting systems) and even more than a ranking of all candidates (which is a natural by-product of sequentially working multi-winner voting systems): we require the system to return a weight (i.e. a real number) instead of just a rank for each candidate. See [PLF, Appendix B] for an example (column “Harmonic Weight” in table on [PLF, p.177]).

Our aim is that increasing the size of a geographical object doesn't give a general advantage to appear earlier in nearest-neighbor searches. It should be noted that the “size” in this context doesn't necessarily mean “area” or “length”. For example, neither a huge polygon nor a set of thousand points or a long line (note that the latter two have an area of zero) should gain an advantage over a single point. When we speak of “general advantage”, we must define this in mathematical terms since changing the size or location of an object can always optimize the distance in regard to a *particular* search center. With “general advantage” we rather mean that the distance *f(G, S)* (with *G* being a geographical object and *S* being a search center point) can't be optimized in regard to *all* possible search centers, or by demanding that the average result of the distance function *f* (or the squared distance function, see Figure 3) over all possible search points is constant.

For all geographical objects G in
{ x subset of or equal to E | x not empty }:
Surface integral (S in E) [ (f(G, S))^2 ] dA = const,
where E is the set of all points on Earth's surface
and dA is the infinitesimal surface area.

Figure 3: Constancy of average of square of distance

The requirement in Figure 3, however, is not sufficient for a fair distance function because *f(G, S)* is not bounded: raising *f(G, S)* to a very high value for some search centers would allow lowering *f(G, S)* for many other search centers.

A stricter requirement would be that the area for which the distance function yields values equal to or less than a limit *L* is independent of *G* and must only depend on *L* (e.g. the area must be equal to *π · L ^{2}* as demanded by the formula given in Figure 4).

Surface integral (S in {x in E | f(G, x) <= L}) [ 1 ] dA = pi * L^2

Figure 4: Strict requirement for the area where the distance function yields values smaller than a certain limit

This stricter requirement is automatically fulfilled for all *L ∈ ℝ ^{+}_{0}* if we only consider singular points as geographical objects on a flat map and if

Unfortunately the requirement in Figure 4 is generally infeasible for geographical objects that have an area greater than zero because it would conflict with treating all points inside the object's area equally. We can still demand the requirement depicted in Figure 4 to be true for all *L ∈ ℝ ^{+}_{0}* when the geographical object

Let |G| be the size of the area covered by G (e.g.
|G| = 0 if G only consists of singular points),
then there exists a factor c_0 in R+ such that for all
geographical objects G subset of or equal to E and all
all L >= c_0 * |G|:
Surface integral (S in {x in E | f(G, x) <= L}) [ 1 ] dA = pi * L^2.

Figure 5: Fulfillment of formula from Figure 4 for all ‘L’ being large enough

Considering what has been discussed above, we propose the algorithm described in Figure 6 to serve as a “fair distance function” which is parameterized with a search center point *S* and a geographical object *G* on the spheroid. An example of its application is shown in Figure 7.

Let ‘c_1’ and ‘c_2’ be two constants, where
1/2 < c_1 <= 1 and c_2 = (c_1 ^ 2) / (2 * c_1 - 1).
1. Calculate the surface area ‘|G|’ of the geographical object ‘G’
(zero if ‘G’ only consists of points and paths but not, for
example, filled polygons).
2. Calculate the shortest spheroidal surface distance ‘d’ between
the geographical object ‘G’ and the search center point ‘S’
(zero if ‘S’ is located inside or touching ‘G’).
3. Calculate the surface area ‘|G_extra|’ covered by all points on
the spheroid which are not in ‘G’ but whose shortest spheroidal
surface distance to ‘G’ is less than ‘d’.
4. Let R = min ( c_1 * |G| + c_2 * |G_extra| , A + E ).
5. The result of the fair distance function is:
f(G, S) = sqrt ( R / pi ).

Figure 6: Proposed algorithm

4 black geographical objects marked as ‘G’ are surrounded by a gray margin
that follows the shape of the four objects and is just thick enough to
touch a point ‘S’ that is outside of ‘G’.
f(G, S) = sqrt ( R / pi ).
R = min ( c_1 * |G| + c_2 * |G_extra| , A + E ).

Figure 7: Example calculation where ‘G’ is a union of 4 basic geographical objects (line, triangle, two points), and where ‘S ∉ G’

The choice of *c _{1}* directly corresponds to a penalty for search center points

A shape is filled with black points in a regular pattern. An additional
point ‘S’ is also within that shape. Each black point is surrounded by a
gray margin that is just thick enough to touch that point ‘S’.

Figure 8: For an ‘S’ randomly placed inside the outer shape, the grey area ‘G_{extra}’ will cover half of the area within the outer shape in the average case

Therefore, no (statistical) advantage could be gained by replacing filled areas of a geographical object with a huge number of singular points covering that surface area.

Setting *c _{1} = 1/2*, however, would cause a discontinuity because

A graph with ‘S ordered by f(G, S)’ on the x-axis and ‘R proportional to
f(G, S)^2’ on the y-axis. A dashed line marks a linear relation. A solid
line is constant and equal to ‘c_1 * |G|’ for ‘S in G’. This part of the
solid line is labeled with ‘treat all S in G equal’. Right of it, the
solid line has a slope of ‘c_2’ until ‘c_0 * |G|’ on the x-axis.
Afterwards the solid line overlaps with the dashed line.
The areas between the solid line and the dashed line are labeled ‘I_1’,
‘I_2’, and ‘I_3’. ‘I_1’ is above the dashed line, while ‘I_2’ and ‘I_3’
are below the dashed line.
I_1 - I_2: excess penalty for S in G.
I_3: compensation for extra penalty (‘c_2’ is chosen such that
‘I_3 = I_1 - I_2’).
c_0 := (c_2 - c_1) / (c_2 - 1).
Surface integral (S in E) [ (f(G, S))^2 ] dA = const.
The integral is independent of ‘|G|’ if ‘c_0 * |G|’ is not exceeding
Earth's surface, because ‘I_1 - I_2 - I_3 = 0’ and the integral will then
cover ‘I_1’ through ‘I_3’.

Figure 9: Picking ‘c_{2}’ such that the formula in Figure 3 is fulfilled

In the limiting case where small geographical objects in relation to their distance from the search point *S* are being considered, we know that *|G _{extra}| ≫ |G|* from which follows that

PostgreSQL is an open-source database management system. [PostgreSQL] The fair distance function has been implemented as part of a PostgreSQL extension named “pgLatLon”, [pgLatLon] which was originally contributed as part of LiquidFeedback. The implementation uses numerical integration (similar to the Monte Carlo method) in order to determine the area *|G|* and extended area *|G _{extra}|* on the Earth spheroid. While the area of a polygon can also be calculated more easily and more accurately, the calculation of the extended area

The sample points for numerical intergration on the spheroid (as calculated by the “pgl_sample_points” function) are generated by using a spiral with sample points occurring at the golden angle, similar to a pattern found in many plants, e.g. sunflowers.

t is element of { 0.5, 1.5, 2.5, 3.5, …, n-0.5 }.
x = sqrt(t) * cos(alpha * t).
y = sqrt(t) * sin(alpha * t).
alpha = (3 - sqrt(5)) * 180 degrees = approx. 137.51 degrees.
A spiral starts in the center and turns counter-clockwise. Points are
placed on the spiral, and the angle between two successive points in
regard to the beginning of the spiral in the center is alpha.

Figure 10: Using the golden angle to create sample points for numerical integration

/* half of (spherical) earth's surface area */
#define PGL_HALF_SURFACE (PGL_RADIUS * PGL_DIAMETER * M_PI)
/* golden angle in radians */
#define PGL_GOLDEN_ANGLE (M_PI * (sqrt(5) - 1.0))
/* create a list of sample points covering a bounding circle
and return covered area */
static double pgl_sample_points(
pgl_point *center, /* center of bounding circle */
double radius, /* radius of bounding circle */
int samples, /* number of sample points (MUST be positive!) */
pgl_point *result /* pointer to result array */
) {
double double_share = 2.0; /* double of covered share of earth's surface */
double double_share_div_samples; /* double_share divided by sample count */
int i;
double t; /* parameter of spiral laid on (spherical) earth's surface */
double x, y, z; /* normalized coordinates of point on non-rotated spiral */
double sin_phi; /* sine of sph. coordinate of point of non-rotated spiral */
double lambda; /* other sph. coordinate of point of non-rotated spiral */
double rot = (0.5 - center->lat / 180.0) * M_PI; /* needed rot. (in rad) */
double cos_rot = cos(rot); /* cosine of rotation by latitude */
double sin_rot = sin(rot); /* sine of rotation by latitude */
double x_rot, z_rot; /* normalized coordinates of point on rotated spiral */
double center_lon = center->lon; /* second rotation in degree */
/* add safety margin to bounding circle because of spherical approximation */
radius *= PGL_SPHEROID_A / PGL_RADIUS;
/* if whole earth is covered, use initialized value, otherwise calculate
share of covered area (multiplied by 2) */
if (radius < PGL_MAXDIST) double_share = 1.0 - cos(radius / PGL_RADIUS);
/* divide double_share by sample count for later calculations */
double_share_div_samples = double_share / samples;
/* generate sample points */
for (i=0; i<samples; i++) {
/* use an offset of 1/2 to avoid too dense clustering at spiral center */
t = 0.5 + i;
/* calculate normalized coordinates of point on non-rotated spiral */
z = 1.0 - double_share_div_samples * t;
sin_phi = sqrt(1.0 - z*z);
lambda = t * PGL_GOLDEN_ANGLE;
x = sin_phi * cos(lambda);
y = sin_phi * sin(lambda);
/* rotate spiral by latitude value of bounding circle */
x_rot = cos_rot * x + sin_rot * z;
z_rot = cos_rot * z - sin_rot * x;
/* set resulting sample point in result array */
/* (while performing second rotation by bounding circle longitude) */
result[i].lat = 180.0 * (atan(z_rot / fabs(x_rot)) / M_PI);
result[i].lon = center_lon + 180.0 * (atan2(y, x_rot) / M_PI);
}
/* return covered area */
return PGL_HALF_SURFACE * double_share;
}
/* fair distance between point and cluster (see README file for explanation) */
/* NOTE: sample count passed as third argument MUST be positive */
static double pgl_fair_distance(
pgl_point *point, pgl_cluster *cluster, int samples
) {
double distance; /* shortest distance from point to cluster */
pgl_point *points; /* sample points for numerical integration */
double area; /* area covered by sample points */
int i;
int inner = 0; /* number of sample points within cluster */
int outer = 0; /* number of sample points outside cluster but
within cluster enlarged by distance */
double result;
/* calculate shortest distance from point to cluster */
distance = pgl_point_cluster_distance(point, cluster);
/* if cluster consists of a single point or has no bounding circle with
positive radius, simply return distance */
if (
(cluster->nentries==1 && cluster->entries[0].entrytype==PGL_ENTRY_POINT) ||
!(cluster->bounding.radius > 0)
) return distance;
/* if cluster consists of two points which are twice as far apart, return
distance between point and cluster multiplied by square root of two */
if (
cluster->nentries == 2 &&
cluster->entries[0].entrytype == PGL_ENTRY_POINT &&
cluster->entries[1].entrytype == PGL_ENTRY_POINT &&
pgl_distance(
PGL_ENTRY_POINTS(cluster, 0)[0].lat,
PGL_ENTRY_POINTS(cluster, 0)[0].lon,
PGL_ENTRY_POINTS(cluster, 1)[0].lat,
PGL_ENTRY_POINTS(cluster, 1)[0].lon
) >= 2.0 * distance
) {
return distance * M_SQRT2;
}
/* otherwise create sample points for numerical integration and determine
area covered by sample points */
points = palloc(samples * sizeof(pgl_point));
area = pgl_sample_points(
&cluster->bounding.center,
cluster->bounding.radius + distance, /* pad bounding circle by distance */
samples,
points
);
/* perform numerical integration */
if (distance > 0) {
/* point (that was passed as argument) is outside cluster */
for (i=0; i<samples; i++) {
/* count sample points within cluster */
if (pgl_point_in_cluster(points+i, cluster, true)) inner++;
/* count sample points outside of cluster but within cluster enlarged by
distance between point (that was passed as argument) and cluster */
else if (
pgl_point_cluster_distance(points+i, cluster) < distance
) outer++;
}
} else {
/* if point is within cluster, just count sample points within cluster */
for (i=0; i<samples; i++) {
if (pgl_point_in_cluster(points+i, cluster, true)) inner++;
}
}
/* release memory for sample points needed for numerical integration */
pfree(points);
/* if enlargement was less than doubling the area, then combine inner and
outer sample point counts with different weighting */
/* (ensures fairness in such a way that the integral of the squared result
over all possible point parameters is independent of the cluster) */
if (outer < inner) result = (2*inner + 4*outer) / 3.0;
/* otherwise weigh inner and outer points the same */
else result = inner + outer;
/* convert area into distance (i.e. radius of a circle with the same area) */
result = sqrt(area * (result / samples) / M_PI);
/* return result only if it is greater than the distance between point and
cluster to avoid unexpected results because of errors due to limited
precision */
if (result > distance) return result;
/* otherwise return distance between point and cluster */
else return distance;
}

Figure 11: C implementation in ‘pgLatLon’

While the distance function is a mandatory prerequisite for a nearest-neighbor search, further support functions are needed to speed up nearest-neighbor searches when using database indices. PostgreSQL provides the GiST interface to enable fast nearest-neighbor searches for custom functions and operators. “pgLatLon” includes facilities to create indices on geographical objects, including support for nearest-neighbor search using (a) the spheroidal surface distance or (b) the “fair distance” function as defined above. However, the corresponding GiST distance estimator function implemented by pgLatLon uses the spheroidal surface distance function in both cases. It is still possible to do nearest-neighbor searches for the “fair distance” function because the distance estimator function of the GiST framework is always allowed to return smaller values, though the performance may be less optimal.

The second challenge mentioned above is to not only consider geographical properties but also the ratings of others users (i.e. voters) when performing a search. For a pair of a single search center point *S* and a geographical object *G*, this can be easily achieved by dividing the value *R* (as calculated in the algorithm explained above) by a number representing the strength of support by the voters. In the easiest case, this could be the total number of votes. However, in order to create a clone-proof process, vote counting mechanisms should be used where clone-proofness is ensured (e.g. restrict voters to vote only for one object or use a clone-proof counting scheme such as Harmonic Weighting).

As previously noted, the distance estimator function of the GiST framework is allowed to return distances shorter than the actual distances. While the penalty of the fair distance function only increases the returned distance (at least in case of a flat map or, by approximation, in case of short distances on the spheroid [Note: The difference between a flat map and a spheroid used to model Earth tends towards zero for small distances. “pgLatLon” version 0.10 ensures that the fair distance function never returns values smaller than the actual spheroidal surface distance (see last if-else-clause in Figure 11, for example), which introduces a tiny error (that tends towards zero for small distances) but ensures that the GiST functions behave consistently which is required for proper index operation.]), considering voting weight might decrease a returned distance. Therefore, it is no longer feasible to use the spheroidal surface distance as an estimation for distances that have been re-weighted according to the number of votes. Whenever an additional weight of an object is taken into account, the index should store such a weight in the index tree so that the corresponding GiST distance estimator can consider this weight by decreasing the returned estimation for the distance accordingly. In the same fashion, the estimated distance can be increased when a geographical object has an area *|G| > 0*. While the latter is not necessary (since it is always allowed to return distances shorter than the actual distance), it may speed up calculation. Both adjustments, however, will require to store additional data in the index tree.

Just storing this data in this index tree (and returning the worst case in case of a non-leaf node) doesn't allow for a fast index operation yet: the tree would also need to be organized in a way that considers the additional data. There are many possibilities to achieve this (R-trees, kd-trees, fractals, etc.), and the choice of index structure goes beyond the scope of this article. In either case, a performant implementation would not be trivial.

Nonetheless, it is possible to work around this issue in SQL by performing SELECTs with an (exponentially) increasing LIMIT clause within a custom function. This approach is less optimal than using a specialized index, but an example is given in pgLatLon's source code (version 0.10) in lines 27 thorugh 85 of the file “create_test_db.schema.sql”.

The development of “pgLatLon”, including the fair distance function, was contributed to LiquidFeedback by FlexiGuided GmbH, Berlin and co-funded by the European Union's Horizon 2020 research and innovation programme under grant agreement n° 693514 (“WeGovNow”).

“pgLatLon” is available as an open-source software under the terms of the MIT-License and may be downloaded at the project page of the Public Software Group e.V.: http://www.public-software-group.org/pgLatLon

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 693514.

[Evolution] Jan Behrens: The Evolution of Proportional Representation in LiquidFeedback. In “The Liquid Democracy Journal on electronic participation, collective moderation, and voting systems”, Issue 1, March 20, 2014, pp. 32-41. ISSN 2198-9532. Published by Interaktive Demokratie e. V., available at http://www.liquid-democracy-journal.org/issue/1/The_Liquid_Democracy_Journal-Issue001-04-The_evolution_of_proportional_representation_in_LiquidFeedback.html

[GiST] PostgreSQL 9.6.2 Documentation, Chapter VII (Internals), Section 61 (GiST). Available at https://www.postgresql.org/docs/9.6/static/gist-intro.html (referenced at: a)

[Janson] Svante Janson: “Phragmén's and Thiele's election methods”, November 27, 2016, subsection 4.3 (*Thiele's elimination method*) on page 15. Available at https://arxiv.org/abs/1611.08826v1 (referenced at: a)

[pgLatLon] “pgLatLon”, a spatial database extension for PostgreSQL. Homepage at http://www.public-software-group.org/pgLatLon (referenced at: a)

[PLF] Behrens, Kistner, Nitsche, Swierczek: “The Principles of LiquidFeedback”. ISBN 978-3-00-044795-2. Published January 2014 by Interaktive Demokratie e. V., available at http://principles.liquidfeedback.org/ (referenced at: a b c d)

[PostgreSQL] The open source object-relational database system “PostgreSQL”. Homepage at https://www.postgresql.org/ (referenced at: a)

[Skowron] Skowron, Lackner, Brill, Peters, Elkind: “Proportional Rankings”, December 5, 2015, ranking rule *Reverse SeqPAV* on page 4. Published on arXiv.org at https://arxiv.org/abs/1612.01434v1 (referenced at: a)