Problem Solving at CPR: How we are building a better search experience.

by Harrison Pim, Head of Data Science, Climate Policy Radar

Climate Policy Radar exists because access to reliable climate data matters. The decisionmakers who use our tools have to navigate a vast and growing pool of documents - climate legislation, strategies, and court decisions to name but a few types - to answer difficult, multi-disciplinary questions.

At its core, Climate Policy Radar is a search engine. Search remains the primary way that users interact with our data and define their own research journeys through it.

However, there’s a key problem that every search engine faces: When users search for something and don't find it, they don't always know what they're missing. A researcher might conclude that no precedent exists for a policy they're designing. A country might overlook a proven approach that could have informed its climate strategy. The gap between what exists in our dataset and what our search engine can surface has meaningful consequences for our users' work, and therefore the climate.

We want to proactively address that gap, but to do so we have to be able to measure it, which is very challenging!

Why we're evaluating our search engine now

While we've been focused on other parts of the system - improving data quality, developing machine-learning-based classifiers for our knowledge graph - it’s time to refocus our attention on search relevance, to ensure the progress we’ve made doesn’t take a toll on the quality of our core offer.

We know there's room to improve, and we have ideas about how to do it. The challenge is making sure we're making the right improvements - that each change we ship actually makes things better, without inadvertently breaking something that was already working.

Why search relevance is so difficult to measure

Measuring search relevance is a subtle art. We're not only splitting our data into buckets of "relevant" and "not relevant" results. Our users need a ranked list of results, with the most relevant results at the top.

The search context also requires that we account for situations where there could be multiple "correct" responses, depending on details about the user's intent which we can't see. For example, we might see our users requesting results which match the search term "NZ" - we can't be sure whether that's a request for documents from/about New Zealand, or documents about net zero. We have lots of material on both topics, and need to decide which set is most likely to match our user's intent.

Balancing our users' ambiguous needs and the requirement for ranked results makes our search relevance evaluation task much more of a challenge.

Deciding what ‘good’ looks like

Rather than trying to solve this all at once, we're starting with something deliberately simple: a suite of pass/fail tests. Each test encodes a concrete expectation about search behaviour that we think any search engine implementation at CPR should meet.

When our tests run, we measure whether a given search engine implementation meets each expectation: when someone searches for "UK Climate Change Act," they should find the Climate Change Act 2008 near the top of the results list; when they search for "NZ," results should include both New Zealand documents and content about net zero targets; when they search for "adaptation strategy," results should actually be adaptation strategies.

To do this efficiently, we group our tests by user intent. We can't cover every possible search term, but we can spot patterns in how people search. If users regularly combine a topic with a geography, for example, we write a handful of tests for that pattern. If there's a recurring type of spelling variation or typo, a few well-chosen examples stand in for all the rest. By sampling across the full range of common search behaviours rather than chasing exhaustive coverage, we can be confident that a search engine which passes our tests will handle the vast majority of real searches well.

This approach might sound obvious, but that's exactly the point. When a colleague or user tells us "I searched for X but couldn't find Y," we hear a new search intention and an opportunity for a new test. This framework gives us the ability to turn that feedback into a permanent record of the expectation, which can be checked automatically, forever. The next time we make changes to the search engine, we'll know immediately if we've broken something that used to work. With that knowledge, we can then incrementally iterate on our search engine, progressively improving our ability to meet our users' expectations without risking inadvertent regressions.

Why we prioritise offline evaluation

Testing our system offline gives us the luxury of both speed and safety. Our suite of tests runs against a new candidate search engine implementation in seconds. Evaluating search against real users, by contrast, means waiting weeks to collect enough behavioural data to draw any conclusions - and it means risking showing real users suboptimal results while we wait.

We do also keep anonymous logs of how users interact with the live search engine, and that data remains valuable. But it works best as a complement to offline evaluation, not a substitute for it. Testing offline lets us experiment with dozens of approaches in the time a single live test would take to run. The faster we can evaluate without risk, the faster we can improve.

What comes next

Pass/fail tests are a floor, not a ceiling. They tell us whether something is broken, but not by how much one approach is better than another. The next step is building richer metrics that can capture that nuance. To build those, CPR’s data science and programmes teams will be working together to gather a new set of data on preferences instead of just requirements. We’ll be asking our domain experts to compare results from different search engines, scoring the ones which seem more useful, and allowing us to calculate more nuanced metrics.

That said, the principle stays the same: the more quickly and safely we can evaluate our systems, the faster they can improve. When the quality of our search results shapes how researchers and policymakers engage with climate evidence, that speed matters.

Next
Next

Stories of Impact: The Global Green Growth Institute