pgRouting on (very) large graphs
Posted on March 23, 2023 in Dev • 4 min read
I recently performed routing and graph analysis on a very large graph built on top of some OpenData file. The underlying graph had about 40M edges, which is about the size of the road network mapped by OpenStreetMap in North America. Not that much, but still a bit tricky to handle.
The main issue was that the source OpenData files where geometric representation
of the features (GeoJSON features, actually bunches of GeoJSON Point
and
LineString
), not an actual graph network. This is very different from
starting directly from OpenStreetMap data which are directly (and correctly)
noded out of th ebox: contrary to OSM data where ways
are built on top of
nodes
, so that connected ways would share a common node
and graph
structure would come up naturally, I had a collection of LineString
objects,
each built upon its own local set of Point
which should be reconciled first.
See this
illustration
from pgRouting
doc for an illustration of these two situations.
Given these constraints,
- The usual OSM routing stack (
OSRM
,GraphHopper
) could not be used out of the box, apart from handling a sophisticated data transform (noding the graph). - The usual Python stack (
networkx
and derived libraries) could not be used (graph was too large to be stored in-memory) and would not give meaningful routing results until the graph was correctly noded.
So, I ended up using pgRouting
. pgRouting
is a
perfect match for this task:
* It has a very detailed documentation
* It has a ready-to-use function to build a graph
topology out
of a collection of LineString
which is the precise transformation I needed
to be able to route on my graph.
* It comes with many ready-to-use functions to route with various algorithms,
analyze the graph (holes, connected components, etc.).
Given these features and its ease of use together with the usual PostgreSQL/PostGIS/GDAL stack, it is very straightforward to quickly get a minimal processing pipeline on a small subset of your graph.
Minimal processing pipeline on a small subset of your graph
First, import your GeoJSON features into a PostgreSQL/PostGIS database and filter out on some bbox to have a small subset of your graph which is fine (performance-wise) for tuning your processing pipeline:
ogr2ogr -f "PostgreSQL" PG:"dbname=$PGDATABASE" ./opendata.geojson -explodecollections
psql -c 'CREATE UNLOGGED TABLE opendata_partial AS SELECT * FROM opendata WHERE wkb_geometry && ST_MakeEnvelope(left,bottom,right,top, 4326);'
Note: I’m creating an UNLOGGED
table here since I do not worry about data
consistency after a server crash and this comes with performance
improvements.
At this point, you can easily design your pre-processing workflow on this
opendata_partial
table. This includes:
- Fixing holes in data
- Filtering out broken features/geometries
- etc.
At this point, creating your graph topology on top of your collection of
features is as simple as running
pgr_createTopology
:
\set tolerance 1e-6
ALTER TABLE opendata_partial_preprocessed ADD COLUMN source integer;
ALTER TABLE opendata_partial_preprocessed ADD COLUMN target integer;
SELECT pgr_createTopology(
'opendata_partial_preprocessed',
:tolerance,
'wkb_geometry',
'ogc_fid',
'source',
'target'
);
assuming your features lie in a opendata_partial_preprocessed
table, with a
geometry column named wkb_geometry
, a unique index column id
. This call
will populate the newly created source
and target
columns which stores the
actual graph (each row of the table represents an edge from node id source
to node id target
).
Topology is created up to the given tolerance
(nodes are merged up to this
distance). Beware this is expressed in the same coordinates system as your
geometries (probably EPSG:4326
, 1e-6
would roughly be about 10cm in France).
You can then route
on this graph (directed or undirected, depending on the value you assign to
reverse_cost
) and compute analysis on the graphs (find dead
ends or
extract connected
components).
For example, to compute routing:
\set start_lon TODO
\set start_lat TODO
\set end_lon TODO
\set end_lat TODO
WITH
-- Project start point on the closest point on the graph
-- source is picked up in an arbitrary way here (could have been target)
start AS (
SELECT source FROM opendata_partial_preprocessed ORDER BY wkb_geometry <-> ST_SetSRID( ST_GeomFromText('POINT (' || :'start_lon' || ' ' || :'start_lat' || ')'), 4326) LIMIT 1
),
-- Project destination point on the closest point on the graph
destination AS (
SELECT source FROM opendata_partial_preprocessed ORDER BY wkb_geometry <-> ST_SetSRID( ST_GeomFromText('POINT (' || :'end_lon' || ' ' || :'end_lat' || ')'), 4326) LIMIT 1
)
-- Route with Dijkstra and merge with the edges table (opendata_partial_preprocessed) to get the geometries for each step of the route
SELECT
id,
pt.wkb_geometry
FROM pgr_dijkstra(
'SELECT id, source, target, ST_Length(wkb_geometry::geography) AS cost FROM opendata_partial_preprocessed',
array(SELECT source FROM start),
array(SELECT source FROM destination),
directed := false
) AS di
JOIN opendata_partial_preprocessed AS pt
ON di.edge = pt.id;
Going towards processing your entire graph
First, running the same pipeline will likely
fail
on a very large graph. The solution
is to make partial/incremental calls to pgr_createTopology
.
Then, computing Dijkstra on such a large graph will be really slow since the SQL query in the previous section would recompute all costs on the fly. Instead, you can precompute all costs once and for all to speed up any subsequent routing queries:
ALTER TABLE opendata ADD COLUMN IF NOT EXISTS "cost" float;
UPDATE opendata SET cost=ST_Length(wkb_geometry::geography);
And then use this new column in calls to Dijkstra:
SELECT
id,
pt.wkb_geometry
FROM pgr_dijkstra(
'SELECT id, source, target, cost AS cost, cost AS reverse_cost FROM opendata_partial_preprocessed',
array(SELECT source FROM start),
array(SELECT source FROM destination),
directed := false
) AS di
JOIN opendata_partial_preprocessed AS pt
ON di.edge = pt.id;
Techniques to efficiently process the whole graph
Here are a few additional tips for processing very large graphs:
-
Topology creation can sometimes be further optimized by replacing
pgr_createTopology
bypgr_extractVertices
. An example for creating a routing topology usingpgr_extractVertices
is provided in pgRouting doc. -
Whenever you want to run extra computation on the whole graph (
pgr_connectedComponents
for instance), you might run into memory errors. This is because PostgreSQL has some hard limits on how large a data structure might be (about 1GB) and you might end up having transient data structures larger than this. A quick workaround is to run tiled or grid computations by superimposing a grid on your network. An example forpgr_connectedComponents
is provided in this pgrouting-users ML thread.