import geopandas
import pandasSemantic Similarity
20 minutes
In this notebook, we will get our hands dirty with Python to unpack the “magic” that the similarity tool shows to find similar areas. At the end of the day, it is all just math.
If you’re coming here fresh, it might be worth quickly reviewing these slides and checking out the explorer tool.
Setup
For this session, we will be using the 2024 version of our embeddings for London LSOA’s data product. If you have not downloaded it yet, have a look at the data section in Materials.
We’ll go ahead and read the file with geometries and the embeddings:
db = geopandas.read_file('uk_lsoa_london_embeds_2024.geojson')ERROR 1: PROJ: proj_create_from_database: Open of /opt/conda/envs/gds/share/proj failed
Since we will be working quite a bit with the embedding part of the table only, we will set it apart:
embeddings = db.loc[:, 'A00_mean': 'A63_mean']Building semantic distances
Embeddings are compressed representations of the information in the imagery used to build them. Their dimensions (64 in this case), contain as much statistical variation as possible to describe what the pixels contain. We can think of them as coordinates in a multi-dimensional space that maps each area to a location that corresponds with the attributes and characteristics recorded in their embedding. As such, if we think of embeddings as coordinates, we can see these locations in a relational way: the areas corresponding to two nearby embeddings will be similar, while those relating to two distant embeddings will tend to be more dissimilar. Thinking about embeddings as semantic locations (where you are relates to who you are in terms of your characteristics, not to physical geography) then allows us to think of distances between coordinates as a way to quantitatively measure similarity. That is exactly the approach we’ll take here.
There are two key differences we’ll observe between calculating standard Euclidean distances in a two-dimensional (e.g., geographical) space and computing semantic distances using embeddings. The first is that we have more than two dimensions (X and Y) to locate a given area. In our case, we actually have 64, the number of dimensions the embedding model that produced our dataset uses. This is more complicated to understand conceptually that practically because, in practical terms, the distance equations we’ll use extend naturally to more than two dimensions. The second is that the Euclidean formula is widely understood as not the best one to use. Instead, we will rely on a different type of distance, cosine. Unpacking why this is the case is beyond today but, if you’re curious, it is to do with the properties the cosine equation has, which are much more desirable for this use case.
OK, enough talk, let’s code. Even though our data set is not particularly large, it is big enough that calculating all the potential pairwise distances is not a good idea. Instead, we will use an algorithm called nearest neighbours, which uses clever math to only calculate distances between observations that are most closely to each other. These will make the computation much more feasible. We can easily implement this using the famous library scikit-learn, and its NearestNeighbors class:
from sklearn.neighbors import NearestNeighborsNow, we want to calculate the K nearest neighbours for every area, based on the embeddings. Practically speaking, this is equivalent to fit in the nearest neighbours algorithm to the embedding table and then using it to extract those for each area. Here’s how you can do it in a couple of lines of python:
nn = NearestNeighbors(
n_neighbors=6,
metric='cosine',
)- 1
- Set the number of nearest neighbors we want to find out about.
- 2
- Specifically set to use cosine distances.
The above defines the nearest neighbour method, but it does not compute anything yet. For that we have to fit it to our data:
nn.fit(embeddings)NearestNeighbors(metric='cosine', n_neighbors=6)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
With this, we can now extract the list of \(k\) (in this case, \(k=5\)) nearest neighbors for each area:
dist, neigh = nn.kneighbors(embeddings)This produces two outputs: dist stores that distances between each area and its nearest neighbours, and neigh contains the ids of such neighbours. For this example, we want to use the neighbour ids rather than the distances. The output object here is an array, which is a bit more cumbersome to work with, and it is positionally indexed. Instead, we would ideally want to have the table index on the area codes. With a little bit of Python arithmetic, we can make this happen (note that we also remove the first column because it is, by construction, the area itself):
neigh_codes = (
pandas.DataFrame(neigh[:, 1:], index=db['data_zone_code'])
.map(lambda i: db['data_zone_code'][i])
)- 1
- Create a data frame with the neighbours, excluding the area itself.
- 2
- Replace, element by element, the positional index for the area code in represents.
Visualising semantic neighbors
With this, we’re good to go to explore semantic similarity. Let’s pick the E01035718, the area that contains Hyde Park:
db.query('data_zone_code=="E01035718"').geometry.explore()Let’s extract the nearest neighbours of this area (we return the original array into the data frame to make searching easier):
neigh_codes.loc["E01035718", :]0 E01004727
1 E01004483
2 E01003375
3 E01004223
4 E01003892
Name: E01035718, dtype: object
Now we can compare the two. To get a sense of what the area looks like, we can swap the basemap to satellite imagery1:
from IPython.display import display, HTML
m1 = (
db
.query('data_zone_code=="E01035718"')
.geometry
.explore(
tiles='esriworldimagery',
color='#03CEA3',
zoom_control=False
)
)
m2 = (
db
.query('data_zone_code=="E01004727"')
.geometry
.explore(
tiles='esriworldimagery',
color='#FF8F42',
zoom_control=False
)
)
# Side-by-side display
display(HTML(
"<div style='display:flex'>" +
f"<div style='width:50%'>{m1._repr_html_()}</div>" +
f"<div style='width:50%'>{m2._repr_html_()}</div></div>"
))- 1
- Create map on the left
- 2
- Render web map from selected polygon
- 3
- Use satellite imagery for the basemap
- 4
- Green for the left polygon
- 5
- Remove zoom buttons
- 6
- Build map on the right in the same way as for the right one
- 7
- Stitch both maps with a (hacky) bit of HTML
Try for yourself!!!
If this has made sense to you, see if you can reproduce the process for the nearest neighbor of other areas. Here are some examples for you to play around (if you’re not familiar with them, Google their location, identify their LSOA with db[['data_zone_code', 'geometry']].explore()):
- Canary Wharf
- Euston station
- Heathrow airport
Footnotes
This is a “good enough” strategy in that we do not control the time when the imagery was captured. In most cases, it is a good proxy for what the embeddings represents; in cases where the temporal gap between the time the imagery and was captured and that used for the embedding, and there was substantial change in the area, the basemap imagery will not be a good approximation.↩︎