Enhancing geospatial SPARQL query times with SILK
Soil Moisture Portal
The UK must report Greenhouse Gas (GHG) emissions as part of the Kyoto protocol and the release of one of those gases from UK soils, N2O (nitrous oxide), is very dependent on moisture content.
Soil moisture can be influenced by land cover and land use, agricultural practice and also climatic conditions so spatial and even temporal disaggregation of soil moisture data and land cover data is very useful when creating more accurate GHG emissions estimates and quantifying uncertainty.
As part of the MELODIES project, we have assembled an historic in-situ soil moisture database (1970 to present day) from over 40 sites in the UK, most of which have multiple individual locations (see Figure 1). The data was rescued from numerous sources, including magnetic tape, and transferred to an Oracle relational database after extensive checking for errors, omissions, consistency and the creation of metadata.
Figure 1: A map of in-situ soil moisture sites (red dots) across the UK.
The site data exist as static points (with British National Grid coordinates) with associated metdata such as identification number, vegetation type, altitude etc while another data table stores depth specific soil moisture measurements (that exist without geometry) across the entire date range. The site identifier is the primary key. Figure 2 shows the distribution of the soil moisture measurements across time.
Figure 2: Temporal distribution of current in-situ soil moisture data. Click for full-size view.
Soil moisture data were extracted from Oracle into csv sheets and then converted to RDF using Python scripts. Using a pre-determined ontology, the triples relating to a site hold information, including the geometrical attributes, as a series of objects that are tied to subjects via a defined relationship (a predicate) (Figure 3).
Figure 3: N- Triple pattern for a single soil moisture location. Click for full-size view.
By creating separate triples for a feature and for its associated geometry, SPARQL queries allow one geometry to be linked to many features (simply by referring to the single geometry many times). The geometry is stored as a literal representation of an X~Y coordinate in Well-Known Text (WKT) to allow conformity to OGC standards (more from OGC and Linked Open Data here). Soil moisture measurement data does not hold geometrical data and is linked to a site via a site ID, thus giving it geometrical properties through relationships defined in the ontology. All of the N-Triples are stored in Strabon, which is a spatiotemporal RDF store.
The soil moisture data will be made freely available as Linked Open Data (LOD) via a portal designed to query, search and display LOD. Furthermore, modelled estimates of soil moisture (e.g. by Grid-to-Grid, JULES) are converted and stored in Strabon and historical comparisons with the in-situ data can provide error estimations in emissions calculations and also assessment of Earth Observation (EO) products.
The portal will allow the user to discover soil moisture spatially and temporally by utilizing the visualization tool Sextant. Figure 4 is an example of comparing Grid-to-Grid modelled data against in-situ point data through time.
Figure 4: Comparison of modelled and in-situ soil moisture data in 1979. Click for full-size view.
Introducing more spatial datasets
As mentioned, land cover type can play an important role in soil moisture and localised changes in land cover and land use can have large effects on soil moisture values that cannot be extracted from gridded EO or modelled datasets. To allow soil moisture data (modelled and in-situ) to be searched and compared within areas of interest, or soil moisture data to be studied against land cover types, polygon data for UK river catchments and also from the Land Cover Map 2007 were converted to N-Triples in much the same way as the soil moisture sites.
Sextant can visualize spatial content by analysing the WKT literals of multiple geometries that span over SPARQL endpoints, which follow the GeoSPARQL or stSPARQL (a spatiotemporal extension to SPARQL used by Strabon) ontologies. Functions (performed in SPARQL filters) such as ‘intersect’ and ‘within’ return Boolean answers to two given geometries which allow GIS-type analysis and display. Figure 5 is a simplified example of an stSPARQL query to return soil moisture locations with their associated river catchments and LCM polygons while Figure 6 shows the results visualised in Sextant (zoomed into part of the Severn catchment in Wales).
Figure 5: stSPARQL query to return all sites with catchment and LCM data. Click for full-size view.
Figure 6: Visualisation of figure 5, zoomed into Plynlimon area of Severn catchment. Click for full-size view.
In N-Triple format, the geometrical WKT literals for a polygon feature are much the same as those for a point feature except polygon geometries are made of many coordinates pairs as opposed to just one, to represent every vertex of the polygon. In some cases in our data, like complex river catchments, these coordinates pairs can number in the hundreds while for LCM polygons it may be in the tens. Figure 7 shows an example of a point geometry in N-Triple form and also that of a reasonably simple polygon of 15 vertices;
Figure 7: Point and polygon geometry in WKT Literal format. Click for full-size view.
To assess simple geometrical relationships as shown in figures 5 and 6 is perhaps not a problem as only a single point is being compared to other complex geometries but if you imagine that soil moisture readings may exist many hundreds (if not thousands) of times at a given location through time and that soil moisture locations are made up of several individual sampling sites, then to query this many spatial relationships and then, crucially, display them all into a web browser is computationally expensive and potentially time consuming. For every result of soil moisture, there needs to be point geometry (as the data exists as a point in time and should not be displayed ‘continuously’ in Sextant) and also geometry of all other related spatial data, meaning complicated geometries can potentially be called many hundreds of times.
Given the nature of the portal is to discover and query data ‘real time’, and not a download request service, performance is a key issue.
SILK; SILK is a tool that allows for the pre-computation of geometrical relationships in RDF triple data, creating new triples that describe the relationship between the concerned features that are then ingested into Strabon. The heavy processing stage associated with long geometrical descriptions in the RDF triple data is thus done only once (before ingestion into the triple store), so the user-based query will only look at much simpler pre-defined spatial relationships.
SILK analyses the spatial relationship of two datasets based on a given command and returns triples that reflect the relationship. In the example below (figure 8), SILK was instructed to look for intersection between the geometries of the soil moisture sites and the geometries of the river catchments.
Figure 8: New geometry triples for soil moisture sites and river catchments from SILK. Click for full-size view.
Now there is a subject-predicate-object relationship for two distinct geometrical datasets that can be ingested into Strabon and queried like any other triple, completely removing the need for spatial filtering.
UNION; when producing query results for soil moisture data, each data record is joined to a site via an ID and therefore each soil moisture result created is given a geometry from the site (which in turn is spatially filtered against more spatial datasets). Instead of performing so many geometrical comparisons, it would be beneficial to utilise the UNION function of the SPARQL language. UNION is a disjunctive operation that returns results from any of two (or more) graph patterns stipulated in the whole query.
This could be very useful if we want to return many soil moisture records in relation to their site ID and also run disjunctive queries that return the spatial polygons that intersect only one instance of the site (instead of every single soil moisture record). Every further spatial dataset analysed would be a new disjunctive query in SPARQL, meaning only one instance of that geometry would be called. If there are 5 sites with 500 measurements per site that only lie in one catchment, there is no point calling the catchment geometry 2,500 times; as long as the attributes required (e.g. catchment name) are attached to each measurement (computationally very cheap) then the geometry need only be called once for display purposes. Figure 9 shows the original SPARQL query, using traditional spatial filtering, converted to incorporate the new geometry triples created in SILK and also a structure utilising disjunctive queries (UNION).
Figure 9: Changing traditional stSPARQL query with filtering to SILK & UNION disjunctive query. Click for full-size view.
If the geometrical relationships generated in SILK had not been ingested into Strabon, this UNION-based query would need to perform traditional spatial filters in every disjunctive part of the query (though at a much less intensive level than the query on the left of Figure 9) possibly rendering the query to complex to be used in a quick and reliable manner.
Rearrange: We can try to ensure the query is constructed in a sensible way. If the initial question is driven by an area of interest – likely to be the most geometrically complex element – then filter out all other complex geometries of that type and then perform spatial queries. For example, if you are looking for data in the Severn river catchment, filter all other catchments out.
Figure 10 shows some stSPARQL query execution times performed in Strabon. Here, the data analysed were restricted to a set of 9 sites, the soil moisture measurements were averaged to monthly means (crude mean for testing) and the spatial intersections performed were between site location, river catchment and LCM polygon. The amount of months of data (increasing the amount of returns) was the variable adjusted. Execution times include both the time taken to perform a SPARQL query against the triple store and the time taken to load the results (including geometries) back into a webpage.
Figure 10: Query execution times in Strabon for three separate methods on the same data. Click for full-size view.
As can be seen, the traditional filtering performed very slowly, quickly becoming untenable for a portal. Data queried this way typically returned results in 50% of the total time while the other 50% was spent loading the data into the web page (due to very large complex geometries).
SILK cut up to 40% off the total execution time compared to normal filtering. However, very little time was needed to perform the query itself and most of the time (up to 90% in some cases) was required to load the data back onto the webpage, again due to complex geometries in WKT.
Finally, constructing the query using SILK and UNION methods massively reduced the query execution time. There was no dramatic rise from the minimum data returned at 1 month (2.28 seconds) to the maximum data returned at 24 months (3.11 seconds). This was largely due to the massive reduction in what needed to be loaded into the webpage: only 1 instance of each complex geometry needed to be returned.
5. Conclusions and Further consideration
Utilising pre-computed spatial relationships in SILK can cut down query execution time dramatically, as can the proper use of SPARQL syntax such as UNION. Together, these two solutions could allow for wide-ranging data queries, both temporally and spatially, within an acceptable time in a portal such as Sextant.
The construction of the query and the construction of an ontology are also important considerations when working with larger datasets that could entail many differing, complex geometries.
A key aim of this portal is the introduction of modelled and EO soil moisture data to Strabon in N-Triple form, which creates a new dimension of filtering: temporal. Not only will in-situ soil moisture data need to be compared spatially (against a raster cell) but also temporally. Using the same time period and spatial location for both in-situ and gridded soil moisture data is necessary to make comparisons in the datasets.
This has the potential to hugely increase computational demands as many in-situ data records are compared to various polygons/grids in both space and time (Figure 11), a sort of double filtering process that occurs in parallel for one location. Figure 12 is an example of how these comparison may look.
Figure 11: Complex spatiotemporal relationships when comparing modelled and measured soil moisture data. Click for full-size view.
Figure 12: An example of model and measurement comparison in Sextant. Click for full-size view.
Tools and methods such as SILK and UNION will be very useful, as will further thoughts about how the ontology is constructed (and thus the N-Triples) to allow us to study soil moisture emissions estimates through time.
NOTE: Hasan (2014) has some interesting ideas regarding predicting query execution times that could be used as a flag system when users demand too much data to be displayed in Sextant.