5. Matching different data sources

One common task when working with spatio-temporal data is to match nearby sites. For example, we may want to verify the location of an old list of stations with current stations, or we may want to match the data from different data sources. In this vignette, we will introduce the spatial and temporal matching in cubble using an example on matching river level data with precipitation in Victoria, Australia.

In cubble, spatial and temporal matching are performed using the functions match_spatial() and match_temporal(). The match_spatial() function calculates the spatial distance between observations in two cubble objects. Various distance measures are available (check sf::st_distance). Analysts can specify the number of matched groups to output using the spatial_n_group argument (default to 4 groups) and the number of matches per group using the spatial_n_group argument (default to 1, one-to-one matching). The syntax to use match_spatial() is:

match_spatial(<cubble_obj1>, <cubble_obj2>, ...)

The function match_temporal() calculates the similarity between time series within spatially matched groups. Two identifiers are required: one for separating each matched group (match_id) and one for separating the two data sources (data_id). The argument temporal_by uses the by syntax from dplyr’s *_join to specify the temporal variables to match.

The similarity score between two time series is calculated using a matching function, which can be customised by the analysts. The matching function takes two time series as a list and returns a single numerical score. This allows for flexibility in using existing time series distance calculation implementation. By default, cubble implements a simple peak matching algorithm (match_peak) that counts the number of peaks in two time series that fall within a specified temporal window. The syntax to use match_temporal() is

match_temporal(
  <cubble_obj_from_match_spatial>, 
  data_id = , match_id = , 
  temporal_by = c("..." = "...")
)

Spatial matching

Now let’s consider an example of matching water data from river gauges with precipitation. The water level data, collected by the Bureau of Meteorology, can be compared with the precipitation since rainfall can directly impact water level in river. Here is the location of available weather stations and water gauges in Victoria, Australia:

Both climate_vic and river are cubble objects, and we can obtain a summary of the 10 closest pairs between them:

(res_sp <- match_spatial(climate_vic, river, spatial_n_group = 10))
#> # A tibble: 10 × 4
#>    from        to      dist group
#>    <chr>       <chr>    [m] <int>
#>  1 ASN00088051 406213 1838.     1
#>  2 ASN00084145 222201 2185.     2
#>  3 ASN00085072 226027 3282.     3
#>  4 ASN00080015 406704 4034.     4
#>  5 ASN00085298 226027 4207.     5
#>  6 ASN00082042 405234 6153.     6
#>  7 ASN00086038 230200 6167.     7
#>  8 ASN00086282 230200 6928.     8
#>  9 ASN00085279 224217 7431.     9
#> 10 ASN00080091 406756 7460.    10

The result can also be returned as cubble objects by setting the argument return_cubble = TRUE. The output is be a list where each element is a paired cubble object. To combine all the results into a single cubble, you can use bind_rows(). In the case when a site in the second cubble (the river data here) is matched to two stations in the first cubble (climate_vic here), the binding may not be successful since cubble requires unique rows in the nested form. In the summary table above, the river station 226027 is matched to more than one weather station: ASN00085072 (group 3) and ASN00085298 (group 5). Similarly, the river station 230200 is matched in group 7 and 8). In such cases, you can either deselect one pair before combining, or work with the list output with the purrr::map syntax:

res_sp <- match_spatial(climate_vic, river, spatial_n_group = 10, return_cubble = TRUE)
str(res_sp, max.level = 0)
#> List of 10
res_sp[[1]]
#> # cubble:   key: id [2], index: date, nested form, [sf]
#> # spatial:  [144.52, -37.02, 144.54, -37.02], WGS 84 (CRS84)
#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
#>   id      long   lat  elev name  wmo_id ts       type              geometry
#>   <chr>  <dbl> <dbl> <dbl> <chr>  <dbl> <list>   <chr>          <POINT [°]>
#> 1 ASN00…  145. -37.0   290 rede…  94859 <tibble> clim…  (144.5203 -37.0194)
#> 2 406213  145. -37.0    NA CAMP…     NA <tibble> river (144.5403 -37.01512)
#> # ℹ 2 more variables: group <int>, dist [m]
(res_sp <- res_sp[-c(5, 8)] |> bind_rows())
#> # cubble:   key: id [16], index: date, nested form, [sf]
#> # spatial:  [144.52, -38.14, 148.47, -36.13], WGS 84 (CRS84)
#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
#>    id     long   lat  elev name  wmo_id ts       type              geometry
#>    <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <list>   <chr>          <POINT [°]>
#>  1 ASN0…  145. -37.0 290   rede…  94859 <tibble> clim…  (144.5203 -37.0194)
#>  2 4062…  145. -37.0  NA   CAMP…     NA <tibble> river (144.5403 -37.01512)
#>  3 ASN0…  148. -37.7  62.7 orbo…  95918 <tibble> clim…  (148.4667 -37.6922)
#>  4 2222…  148. -37.7  NA   SNOW…     NA <tibble> river  (148.451 -37.70739)
#>  5 ASN0…  147. -38.1   4.6 east…  94907 <tibble> clim…  (147.1322 -38.1156)
#>  6 2260…  147. -38.1  NA   LA T…     NA <tibble> river (147.1278 -38.14491)
#>  7 ASN0…  145. -36.2  96   echu…  94861 <tibble> clim…  (144.7642 -36.1647)
#>  8 4067…  145. -36.1  NA   DEAK…     NA <tibble> river (144.7693 -36.12866)
#>  9 ASN0…  146. -36.8 502   stra…  95843 <tibble> clim…  (145.7308 -36.8472)
#> 10 4052…  146. -36.9  NA   SEVE…     NA <tibble> river (145.6828 -36.88701)
#> 11 ASN0…  145. -37.7  78.4 esse…  95866 <tibble> clim…  (144.9066 -37.7276)
#> 12 2302…  145. -37.7  NA   MARI…     NA <tibble> river (144.8365 -37.72771)
#> 13 ASN0…  148. -37.9  49.4 bair…  94912 <tibble> clim…  (147.5669 -37.8817)
#> 14 2242…  148. -37.8  NA   MITC…     NA <tibble> river   (147.5722 -37.815)
#> 15 ASN0…  145. -36.3 105   kyab…  95833 <tibble> clim…   (145.0638 -36.335)
#> 16 4067…  145. -36.3  NA   MOSQ…     NA <tibble> river (144.9809 -36.32871)
#> # ℹ 2 more variables: group <int>, dist [m]

Temporal matching

For temporal matching, we match the variable Water_course_level from the river data to prcp in the weather station data. The variable group and types identify the matching group and the two datasets:

(res_tm <- res_sp |> 
  match_temporal(
    data_id = type, match_id = group,
    temporal_by = c("prcp" = "Water_course_level")))
#> # A tibble: 8 × 2
#>   group match_res
#>   <int>     <dbl>
#> 1     1        30
#> 2     2         5
#> 3     3        14
#> 4     4        20
#> 5     6        23
#> 6     7        26
#> 7     9        21
#> 8    10        14

Similarly, the cubble output can be returned using the argument return_cubble = TRUE. Here we select the four pairs with the highest number of matching peaks:

res_tm <- res_sp |> 
  match_temporal(
    data_id = type, match_id = group,
    temporal_by = c("prcp" = "Water_course_level"),
    return_cubble = TRUE)
(res_tm <- res_tm |> bind_rows() |> filter(group %in% c(1, 7, 6, 9)))
#> # cubble:   key: id [8], index: date, nested form, [sf]
#> # spatial:  [144.52, -37.88, 147.57, -36.85], WGS 84 (CRS84)
#> # temporal: date [date], matched [dbl]
#>   id         long   lat  elev name  wmo_id type              geometry group
#>   <chr>     <dbl> <dbl> <dbl> <chr>  <dbl> <chr>          <POINT [°]> <int>
#> 1 ASN00088…  145. -37.0 290   rede…  94859 clim…  (144.5203 -37.0194)     1
#> 2 406213     145. -37.0  NA   CAMP…     NA river (144.5403 -37.01512)     1
#> 3 ASN00082…  146. -36.8 502   stra…  95843 clim…  (145.7308 -36.8472)     6
#> 4 405234     146. -36.9  NA   SEVE…     NA river (145.6828 -36.88701)     6
#> 5 ASN00086…  145. -37.7  78.4 esse…  95866 clim…  (144.9066 -37.7276)     7
#> 6 230200     145. -37.7  NA   MARI…     NA river (144.8365 -37.72771)     7
#> 7 ASN00085…  148. -37.9  49.4 bair…  94912 clim…  (147.5669 -37.8817)     9
#> 8 224217     148. -37.8  NA   MITC…     NA river   (147.5722 -37.815)     9
#> # ℹ 3 more variables: dist [m], ts <list>, match_res <dbl>

And then we can visualise them in space or across time: