This lab provides a practical overview of how satellite image embeddings, compact numerical representations of imagery, can be used as a general-purpose way of describing geographic space.
The lab begins with unsupervised analysis, where embeddings are used to explore and structure geographic space without predefined labels. This demonstrates how they support exploratory analysis and the identification of place-based patterns. The same embeddings are then reused in a predictive context, showing how a single representation can support multiple analytical tasks without rebuilding the feature pipeline.
The emphasis throughout is on reusability and judgement:
how embeddings enable faster and more flexible exploration,
how they integrate into standard analytical workflows,
when they meaningfully improve predictive performance,
The lab is not about training complex models, but about understanding how and when embeddings add value in spatial analysis.
Learning outcomes
By the end of the lab, participants will be able to:
Interpret satellite image embeddings as representations of geographic space.
Develop and evaluate predictive models using embeddings.
Critically assess when embeddings improve analytical or predictive performance.
Data
Embedding data
Satellite image embeddings derived from Google imagery for London (2024).
Embeddings are aggregated to Lower-layer Super Output Area (LSOA) level using the mean value for each embedding dimension.
The final dataset contains 64 embedding variables (A00_mean to A63_mean).
Provided as a GeoPackage: uk_lsoa_london_embeds_2024.gpkg.
Socio-demographic data
Index of Multiple Deprivation (IMD) – IMD 2025 provides a relative measure of deprivation across 32,844 LSOAs in England. – Values are reported in deciles (10% bands from most to least deprived). – Source: Ministry of Housing, Communities and Local Government GeoPortal: https://communitiesopendata-communities.hub.arcgis.com/
Socio-economic and population characteristics (Census 2021) – Percentage of population aged 16+ with no educational qualifications – Percentage reporting bad or very bad health – Population density (persons per km²) – Percentage of single-parent households – Source: NOMIS: https://www.nomisweb.co.uk/sources/census_2021_bulk
All socio-demographic variables have been linked at LSOA level and provided in Socioeconomic.csv, located in the data folder
Import Libraries
Before working with the data, a number of Python libraries are required to support data handling, spatial analysis, and modelling tasks in this lab.
These libraries provide functionality for: - reading and manipulating tabular and spatial data - performing numerical and statistical operations - visualising spatial patterns - applying clustering and predictive models
All code in this lab assumes that these libraries are available in the working environment and have been imported prior to the analytical steps.
Code
# -----------------------------------------# Core libraries# -----------------------------------------# Spatial data handlingimport pandas as pd # Tabular data manipulationimport geopandas as gpd # Vector GIS data and geometry handlingimport json # JSON handling# -----------------------------------------# Local functions# -----------------------------------------# Import specific functions from local modulefrom tools import ( filter_table, get_embedding_cols, kmeans_clustering, show_cluster_labels, plot_simple_map, parse_reference_points, make_webmap_general, closest_lsoas_to_cluster, map_closest_lsoas, plot_embedding_distances, run_rf_classifier, plot_feature_importance, classify_for_mapping, show_two_maps_side_by_side,)
Load data
Import the dataset into the working environment so it can be processed and analysed
Code
# Load the embedding data from filefile_path ="uk_lsoa_london_embeds_2024.geojson"# Read the GeoJSON file and convert it into a GeoDataFramewithopen(file_path, "r") as f: gdf = gpd.GeoDataFrame.from_features(json.load(f))
Display information about data
Check that the data has loaded successfully by viewing basic information and sample records.
Code
# Display coordinate reference system (CRS)# This confirms the spatial reference used by the datasetprint(f"The CRS for this dataset is: {gdf.crs}")# Preview the data# Display the first few rows to inspect structure and attributesgdf.head()
Preprocessing
Reduce the dataset to only the embedding variables and key identifiers (e.g. LSOA codes/names),while retaining geometry. This ensures the analysis focuses on the embedding featuresand keeps the data suitable for spatial mapping.
Code
# Filter dataset to retain only relevant variables# Keeps identifier fields, embedding features, and geometrygdf = filter_table(gdf)# Preview the filtered dataset# Confirm that only the required columns remaingdf.head()
The filtered dataset no longer includes unnecessary attributes (e.g. ‘dzcode’). This reduces the number of variables and ensures the analysis focuses on the embedding features and key identifiers.
Predictive modelling
Uses existing data to train a model that estimates or classifies an outcome for new locations, allowing patterns learned from embeddings and other features to be applied beyond the observed data.
Load socioeconomic data
This dataset includes the Index of Multiple Deprivation (IMD) deciles, along with additional variables identified in the literature as influencing deprivation. IMD deciles group areas into ten categories, where lower deciles indicate higher levels of deprivation and higher deciles indicate lower levels of deprivation.
Code
# Load datadf = pd.read_csv('Socioeconomic.csv')
Display the data
Code
df
Add socioeconomic data to our embedding data
The two datasets are linked using the common identifier LSOA21CD, allowing the embedding variables to be combined with the socioeconomic variables for each LSOA.
Code
# Join embedding data with socioeconomic data# Merge on the common LSOA identifier to combine both datasetsgdf2 = gdf.merge( df, on="LSOA21CD", # Shared key between datasets how="left"# Keep all rows from gdf (embeddings), add matching socioeconomic data)
Display updated data
Confirm that embedding and socioeconomic variables have been successfully merged.
Code
# Display updated datagdf2
Display the IMD data
Visualising the IMD data on a map helps reveal the spatial distribution of deprivation across London and makes patterns easier to interpret than in tabular form.
Code
# -----------------------------------------# Interactive map of IMD deciles# -----------------------------------------# Add reference points (Name, latitude, longitude)# These provide spatial context and help orient the mappois =""""Westminster", 51.4975, -0.1357"City of London", 51.5155, -0.0922"Canary Wharf", 51.5054, -0.0235"King's Cross", 51.5308, -0.1238"Heathrow Airport", 51.4700, -0.4543"WE ARE HERE :)", 51.4962, -0.1298"""# Ensure the GeoDataFrame has a coordinate reference system (CRS)# Use EPSG:4326 only if the coordinates are already longitude/latitudeif gdf2.crs isNone: gdf2 = gdf2.set_crs(epsg=4326)# Create interactive map# Visualises IMD deciles across LSOAs with tooltips and optional basemapsm = make_webmap_general( focus_gdf=gdf2, focus_col="IMD_decile", focus_name="LSOA IMD Deciles", focus_tooltip_cols=("LSOA21CD", "LSOA21NM", "IMD_decile"), focus_categorical=True, focus_legend=True, focus_cmap="YlOrRd_r", focus_style_kwds={"fillOpacity": 0.6,"weight": 0.2,"color": "black", }, context_gdf=None, pois=pois, fit_to="focus", zoom_start=10,)# Display mapm
Create a model to predict IMD using ONLY socioeconomic variables
The model uses variables identified in previous studies as predictors of deprivation, including: - Percentage with no qualifications (age 16+) - Percentage reporting bad or very bad health - Population density (per km²) - Percentage of lone-parent households
A Random Forest model is used, which is an ensemble of decision trees. It can capture non-linear relationships and interactions between variables without requiring assumptions such as proportional odds used in ordinal logistic regression.
Code
# -----------------------------------------# Run Random Forest model (socioeconomic variables only)# -----------------------------------------# Train a classifier to predict IMD decile using selected socioeconomic predictorsresults = run_rf_classifier( data=gdf2, y_col="IMD_decile", # Target variable (deprivation decile) x_cols=["Percent no qualifications 16 and over","Percent bad and very band health","Population density per km","Percent lone family household", ], # Predictor variables)
Create a model to predict IMD using ONLY embedding variables
Code
# -----------------------------------------# Run Random Forest model (embedding features only)# -----------------------------------------# Define embedding feature columns (A00_mean to A63_mean)feature_cols = [f"A{i:02d}_mean"for i inrange(64)]# Train classifier to predict IMD decile using embedding featuresresults = run_rf_classifier( data=gdf2, y_col="IMD_decile", # Target variable (deprivation decile) x_cols=feature_cols, # Embedding predictors test_size=0.2, # Proportion of data used for testing random_state=42, # Ensures reproducibility n_estimators=500, # Number of trees in the forest class_weight="balanced", # Handle class imbalance)
Create a model to predict IMD using both socioeconomic and embedding variables
Code
# -----------------------------------------# Run Random Forest model (embeddings + socioeconomic variables)# -----------------------------------------# Define predictor variables# Combine embedding features with selected socioeconomic indicatorsfeature_cols = ( [f"A{i:02d}_mean"for i inrange(64)] + ["Percent no qualifications 16 and over","Percent bad and very band health","Population density per km","Percent lone family household", ])# Train classifier to predict IMD decileresults = run_rf_classifier( data=gdf2, y_col="IMD_decile", # Target variable (deprivation decile) x_cols=feature_cols, # Combined predictors test_size=0.2, # Proportion used for testing random_state=42, # Reproducibility n_estimators=500, # Number of trees class_weight="balanced", # Handle class imbalance)
Plotting the top 15 variables
Code
# -----------------------------------------# Plot feature importance# -----------------------------------------# Visualise the most important predictors from the Random Forest modelplot_feature_importance( results["importances"], # Feature importance scores from the model top_n=15, # Display top 15 features title="Top 15 Features Predicting IMD Decile",)
Compare variables - side by side
Code
# -----------------------------------------# Compare two variables side by side on maps# -----------------------------------------# Visualise IMD deciles alongside a socioeconomic variable# Both variables are displayed using 10 quantile-based classes for comparabilityshow_two_maps_side_by_side( gdf2, left_var="IMD_decile", # Deprivation (target variable) right_var="Percent no qualifications 16 and over", # Socioeconomic predictor tooltip_cols=("LSOA21CD", "LSOA21NM"), # Information shown on hover scheme="quantile", # Classification method k=10, # Number of classes)
Code
# -----------------------------------------# Compare two variables using equal intervals# -----------------------------------------# Visualise IMD deciles alongside a socioeconomic variable# Values are grouped into 10 equal-width intervals for comparisonshow_two_maps_side_by_side( gdf2, left_var="IMD_decile", # Deprivation (target variable) right_var="Percent no qualifications 16 and over", # Socioeconomic predictor scheme="equal", # Equal-interval classification k=10, # Number of classes tooltip_cols=("LSOA21NM",), # Information shown on hover)
Code
# -----------------------------------------# Compare two variables using natural breaks# -----------------------------------------# Visualise IMD deciles alongside a socioeconomic variable# Values are grouped using natural breaks (Jenks), which highlight# inherent groupings in the datashow_two_maps_side_by_side( gdf2, left_var="IMD_decile", # Deprivation (target variable) right_var="Percent no qualifications 16 and over", # Socioeconomic predictor scheme="natural", # Natural breaks (Jenks) k=10, # Number of classes tooltip_cols=("LSOA21NM",), # Information shown on hover)
YOUR TURN: Choose the most influential embedding variable and visually compare it to the IMD variable to explore their spatial relationship, and briefly interpret any patterns or contrasts observed.