pgRouting on (very) large graphs

Posted on March 23, 2023 in Dev • 4 min read

I recently performed routing and graph analysis on a very large graph built on top of some OpenData file. The underlying graph had about 40M edges, which is about the size of the road network mapped by OpenStreetMap in North America. Not that much, but still a bit tricky to handle.

The main issue was that the source OpenData files where geometric representation of the features (GeoJSON features, actually bunches of GeoJSON Point and LineString), not an actual graph network. This is very different from starting directly from OpenStreetMap data which are directly (and correctly) noded out of th ebox: contrary to OSM data where ways are built on top of nodes, so that connected ways would share a common node and graph structure would come up naturally, I had a collection of LineString objects, each built upon its own local set of Point which should be reconciled first. See this illustration from pgRouting doc for an illustration of these two situations.

Given these constraints,

The usual OSM routing stack (OSRM, GraphHopper) could not be used out of the box, apart from handling a sophisticated data transform (noding the graph).
The usual Python stack (networkx and derived libraries) could not be used (graph was too large to be stored in-memory) and would not give meaningful routing results until the graph was correctly noded.

So, I ended up using pgRouting. pgRouting is a perfect match for this task: * It has a very detailed documentation * It has a ready-to-use function to build a graph topology out of a collection of LineString which is the precise transformation I needed to be able to route on my graph. * It comes with many ready-to-use functions to route with various algorithms, analyze the graph (holes, connected components, etc.).

Given these features and its ease of use together with the usual PostgreSQL/PostGIS/GDAL stack, it is very straightforward to quickly get a minimal processing pipeline on a small subset of your graph.

Minimal processing pipeline on a small subset of your graph

First, import your GeoJSON features into a PostgreSQL/PostGIS database and filter out on some bbox to have a small subset of your graph which is fine (performance-wise) for tuning your processing pipeline:

ogr2ogr -f "PostgreSQL" PG:"dbname=$PGDATABASE" ./opendata.geojson -explodecollections
psql -c 'CREATE UNLOGGED TABLE opendata_partial AS SELECT * FROM opendata WHERE wkb_geometry && ST_MakeEnvelope(left,bottom,right,top, 4326);'

Note: I’m creating an UNLOGGED table here since I do not worry about data consistency after a server crash and this comes with performance improvements.

At this point, you can easily design your pre-processing workflow on this opendata_partial table. This includes:

Fixing holes in data
Filtering out broken features/geometries
etc.

At this point, creating your graph topology on top of your collection of features is as simple as running pgr_createTopology:

\set tolerance 1e-6

ALTER TABLE opendata_partial_preprocessed ADD COLUMN source integer;
ALTER TABLE opendata_partial_preprocessed ADD COLUMN target integer;

SELECT  pgr_createTopology(
    'opendata_partial_preprocessed',
    :tolerance,
    'wkb_geometry',
    'ogc_fid',
    'source',
    'target'
);

assuming your features lie in a opendata_partial_preprocessed table, with a geometry column named wkb_geometry, a unique index column id. This call will populate the newly created source and target columns which stores the actual graph (each row of the table represents an edge from node id source to node id target).

Topology is created up to the given tolerance (nodes are merged up to this distance). Beware this is expressed in the same coordinates system as your geometries (probably EPSG:4326, 1e-6 would roughly be about 10cm in France).

You can then route on this graph (directed or undirected, depending on the value you assign to reverse_cost) and compute analysis on the graphs (find dead ends or extract connected components). For example, to compute routing:

\set start_lon TODO
\set start_lat TODO
\set end_lon TODO
\set end_lat TODO

WITH
    -- Project start point on the closest point on the graph
    -- source is picked up in an arbitrary way here (could have been target)
    start AS (
        SELECT source FROM opendata_partial_preprocessed ORDER BY wkb_geometry <-> ST_SetSRID( ST_GeomFromText('POINT (' || :'start_lon' || ' ' || :'start_lat' || ')'), 4326) LIMIT 1
    ),
    -- Project destination point on the closest point on the graph
    destination AS (
        SELECT source FROM opendata_partial_preprocessed ORDER BY wkb_geometry <-> ST_SetSRID( ST_GeomFromText('POINT (' || :'end_lon' || ' ' || :'end_lat' || ')'), 4326) LIMIT 1
    )
-- Route with Dijkstra and merge with the edges table (opendata_partial_preprocessed) to get the geometries for each step of the route
SELECT
    id,
    pt.wkb_geometry
FROM pgr_dijkstra(
    'SELECT id, source, target, ST_Length(wkb_geometry::geography) AS cost FROM opendata_partial_preprocessed',
    array(SELECT source FROM start),
    array(SELECT source FROM destination),
    directed := false
) AS di
JOIN opendata_partial_preprocessed AS pt
  ON di.edge = pt.id;

Going towards processing your entire graph

First, running the same pipeline will likely fail on a very large graph. The solution is to make partial/incremental calls to pgr_createTopology.

Then, computing Dijkstra on such a large graph will be really slow since the SQL query in the previous section would recompute all costs on the fly. Instead, you can precompute all costs once and for all to speed up any subsequent routing queries:

ALTER TABLE opendata ADD COLUMN IF NOT EXISTS "cost" float;
UPDATE opendata SET cost=ST_Length(wkb_geometry::geography);

And then use this new column in calls to Dijkstra:

SELECT
    id,
    pt.wkb_geometry
FROM pgr_dijkstra(
    'SELECT id, source, target, cost AS cost, cost AS reverse_cost FROM opendata_partial_preprocessed',
    array(SELECT source FROM start),
    array(SELECT source FROM destination),
    directed := false
) AS di
JOIN opendata_partial_preprocessed AS pt
  ON di.edge = pt.id;

Techniques to efficiently process the whole graph

Here are a few additional tips for processing very large graphs:

Topology creation can sometimes be further optimized by replacing pgr_createTopology by pgr_extractVertices. An example for creating a routing topology using pgr_extractVertices is provided in pgRouting doc.
Whenever you want to run extra computation on the whole graph (pgr_connectedComponents for instance), you might run into memory errors. This is because PostgreSQL has some hard limits on how large a data structure might be (about 1GB) and you might end up having transient data structures larger than this. A quick workaround is to run tiled or grid computations by superimposing a grid on your network. An example for pgr_connectedComponents is provided in this pgrouting-users ML thread.