Semantic Similarity

20 minutes

In this notebook, we will get our hands dirty with Python to unpack the “magic” that the similarity tool shows to find similar areas. At the end of the day, it is all just math.

If you’re coming here fresh, it might be worth quickly reviewing these slides and checking out the explorer tool.

Setup

import geopandas
import pandas

For this session, we will be using the 2024 version of our embeddings for London LSOA’s data product. If you have not downloaded it yet, have a look at the data section in Materials.

We’ll go ahead and read the file with geometries and the embeddings:

db = geopandas.read_file('uk_lsoa_london_embeds_2024.geojson')

ERROR 1: PROJ: proj_create_from_database: Open of /opt/conda/envs/gds/share/proj failed

Since we will be working quite a bit with the embedding part of the table only, we will set it apart:

embeddings = db.loc[:, 'A00_mean': 'A63_mean']

Building semantic distances

Embeddings are compressed representations of the information in the imagery used to build them. Their dimensions (64 in this case), contain as much statistical variation as possible to describe what the pixels contain. We can think of them as coordinates in a multi-dimensional space that maps each area to a location that corresponds with the attributes and characteristics recorded in their embedding. As such, if we think of embeddings as coordinates, we can see these locations in a relational way: the areas corresponding to two nearby embeddings will be similar, while those relating to two distant embeddings will tend to be more dissimilar. Thinking about embeddings as semantic locations (where you are relates to who you are in terms of your characteristics, not to physical geography) then allows us to think of distances between coordinates as a way to quantitatively measure similarity. That is exactly the approach we’ll take here.

There are two key differences we’ll observe between calculating standard Euclidean distances in a two-dimensional (e.g., geographical) space and computing semantic distances using embeddings. The first is that we have more than two dimensions (X and Y) to locate a given area. In our case, we actually have 64, the number of dimensions the embedding model that produced our dataset uses. This is more complicated to understand conceptually that practically because, in practical terms, the distance equations we’ll use extend naturally to more than two dimensions. The second is that the Euclidean formula is widely understood as not the best one to use. Instead, we will rely on a different type of distance, cosine. Unpacking why this is the case is beyond today but, if you’re curious, it is to do with the properties the cosine equation has, which are much more desirable for this use case.

OK, enough talk, let’s code. Even though our data set is not particularly large, it is big enough that calculating all the potential pairwise distances is not a good idea. Instead, we will use an algorithm called nearest neighbours, which uses clever math to only calculate distances between observations that are most closely to each other. These will make the computation much more feasible. We can easily implement this using the famous library scikit-learn, and its NearestNeighbors class:

from sklearn.neighbors import NearestNeighbors

Now, we want to calculate the K nearest neighbours for every area, based on the embeddings. Practically speaking, this is equivalent to fit in the nearest neighbours algorithm to the embedding table and then using it to extract those for each area. Here’s how you can do it in a couple of lines of python:

nn = NearestNeighbors(
    n_neighbors=6,
    metric='cosine',
)

1: Set the number of nearest neighbors we want to find out about.
2: Specifically set to use cosine distances.

The above defines the nearest neighbour method, but it does not compute anything yet. For that we have to fit it to our data:

nn.fit(embeddings)

NearestNeighbors(metric='cosine', n_neighbors=6)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

With this, we can now extract the list of \(k\) (in this case, \(k=5\)) nearest neighbors for each area:

dist, neigh = nn.kneighbors(embeddings)

This produces two outputs: dist stores that distances between each area and its nearest neighbours, and neigh contains the ids of such neighbours. For this example, we want to use the neighbour ids rather than the distances. The output object here is an array, which is a bit more cumbersome to work with, and it is positionally indexed. Instead, we would ideally want to have the table index on the area codes. With a little bit of Python arithmetic, we can make this happen (note that we also remove the first column because it is, by construction, the area itself):

neigh_codes = (
    pandas.DataFrame(neigh[:, 1:], index=db['data_zone_code'])
    .map(lambda i: db['data_zone_code'][i])
)

1: Create a data frame with the neighbours, excluding the area itself.
2: Replace, element by element, the positional index for the area code in represents.

Visualising semantic neighbors

With this, we’re good to go to explore semantic similarity. Let’s pick the E01035718, the area that contains Hyde Park:

db.query('data_zone_code=="E01035718"').geometry.explore()

Make this Notebook Trusted to load map: File -> Trust Notebook

Let’s extract the nearest neighbours of this area (we return the original array into the data frame to make searching easier):

neigh_codes.loc["E01035718", :]

0    E01004727
1    E01004483
2    E01003375
3    E01004223
4    E01003892
Name: E01035718, dtype: object

Now we can compare the two. To get a sense of what the area looks like, we can swap the basemap to satellite imagery¹:

from IPython.display import display, HTML

m1 = (
    db
    .query('data_zone_code=="E01035718"')
    .geometry
    .explore(
        tiles='esriworldimagery',
        color='#03CEA3',
        zoom_control=False
    )
)

m2 = (
    db
    .query('data_zone_code=="E01004727"')
    .geometry
    .explore(
        tiles='esriworldimagery',
        color='#FF8F42',
        zoom_control=False
    )
)

# Side-by-side display
display(HTML(
    "<div style='display:flex'>" +
    f"<div style='width:50%'>{m1._repr_html_()}</div>" +
    f"<div style='width:50%'>{m2._repr_html_()}</div></div>"
))

1: Create map on the left
2: Render web map from selected polygon
3: Use satellite imagery for the basemap
4: Green for the left polygon
5: Remove zoom buttons
6: Build map on the right in the same way as for the right one
7: Stitch both maps with a (hacky) bit of HTML

Make this Notebook Trusted to load map: File -> Trust Notebook

Try for yourself!!!

If this has made sense to you, see if you can reproduce the process for the nearest neighbor of other areas. Here are some examples for you to play around (if you’re not familiar with them, Google their location, identify their LSOA with db[['data_zone_code', 'geometry']].explore()):

Canary Wharf
Euston station
Heathrow airport

Footnotes

This is a “good enough” strategy in that we do not control the time when the imagery was captured. In most cases, it is a good proxy for what the embeddings represents; in cases where the temporal gap between the time the imagery and was captured and that used for the embedding, and there was substantial change in the area, the basemap imagery will not be a good approximation.↩︎

	n_neighbors n_neighbors: int, default=5 Number of neighbors to use by default for :meth:`kneighbors` queries.	6
	radius radius: float, default=1.0 Range of parameter space to use by default for :meth:`radius_neighbors` queries.	1.0
	algorithm algorithm: {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto' Algorithm used to compute the nearest neighbors: - 'ball_tree' will use :class:`BallTree` - 'kd_tree' will use :class:`KDTree` - 'brute' will use a brute-force search. - 'auto' will attempt to decide the most appropriate algorithm based on the values passed to :meth:`fit` method. Note: fitting on sparse input will override the setting of this parameter, using brute force.	'auto'
	leaf_size leaf_size: int, default=30 Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.	30
	metric metric: str or callable, default='minkowski' Metric to use for distance computation. Default is "minkowski", which results in the standard Euclidean distance when p = 2. See the documentation of `scipy.spatial.distance < docs.scipy.org doc scipy reference spatial.distance.html>`_ and the metrics listed in :class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric values. If metric is "precomputed", X is assumed to be a distance matrix and must be square during fit. X may be a :term:`sparse graph`, in which case only "nonzero" elements may be considered neighbors. If metric is a callable function, it takes two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors. This works for Scipy's metrics, but is less efficient than passing the metric name as a string.	'cosine'
	p p: float (positive), default=2 Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.	2
	metric_params metric_params: dict, default=None Additional keyword arguments for the metric function.	None
	n_jobs n_jobs: int, default=None The number of parallel jobs to run for neighbors search. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None