Future Sparks: What makes an excellent document summary?
In December 2025, we spent four concentrated ‘hackathon’ days taking a step back and paying attention to the questions that need to be answered outside of our major launches and business-as-usual work. As well as being a lot of fun, the results were a series of sparks that will ignite and illuminate our future path. This series therefore takes a moment to document these moments of clarity and creativity, exploring ways we might improve our app.
What makes an excellent document summary?
An interview with James Gorrie, Senior Software Engineer with insights from his cross-functional team: Harrison Pim (Head of Data Science), Kyra Prins (Policy Officer) and Alan Wright (Head of Product).
Tell me about the problem you are trying to solve?
We know that summaries are extremely valuable to our users across all of our documents.
Moreover, users appear to need summaries to serve a similar function to academic paper abstracts, enabling them to quickly peek into a document and understand its relevance to their work.
However, while a great deal of our documents have high-quality summaries written by human experts, there is a portion of our database that does not. During the hackathon, we explored using GenAI to generate summaries for documents currently without them. But with such a vast spread of document types and use cases, we did not initially know how to manually assess a summary’s ‘usefulness’ across our existing 30,000+ documents.
How are summaries currently created?
When a new document is added to a dataset, external or internal knowledge partners can provide a summary. Many of our trusted data partners have robust methodologies for writing summaries, but not all partners provide them, and there is a lack of consistency across the database.
How could we use our data science tools to streamline and improve this process?
Grand vision
For users to comfortably rely on summaries to assess the usefulness of a document and therefore accelerate their research.
What came out of the hackathon?
To effectively scale our summary function while maintaining a high standard, we must first define what a good summary is.
Like many other parts of our data science work at CPR, strong evaluation loops with expert feedback are an essential part of the process for developing useful models. Our solution in this case consisted of two tightly coupled services: a discriminator and a generator.
We started by asking our team of policy experts to label existing summaries (both human-written and machine-generated) as ‘good’ or ‘bad’, with space for notes on why particular summaries were useful or unhelpful.
We quickly learned that what constitutes a good summary depends on the type of document and its intended audience. We have a huge variety of documents in our database, and given the limited time available, we decided to focus on submissions to the UNFCCC, UNCCD, and CBD, for which we currently do not have summaries.
Using insights from the initial labelling exercise, we gave an LLM a minimal set of guidelines and asked it to judge the same set of summaries against those criteria. By measuring the agreement between the LLM and human experts, we iteratively refined the prompt. Through several rounds of refinement –examining cases where human and LLM assessments differed – we developed the discriminator: a prompt that could reliably judge summary quality.
The discriminator's criteria were then embedded into the generator prompt, which created summaries for new documents based on those guidelines. These generated summaries were sent back to our policy experts for evaluation, and the cycle continued – refining both discriminator and generator – until the vast majority of summaries were labelled as useful.
So you are now armed with a new experimental process to generate great summaries. What next?
Firstly, we will continue to work on evaluating the ‘faithfulness’ of a summary with respect to the original document. This will take evaluation and iteration to achieve our ‘grand vision’.
A huge question that sits before us now is how we divide each document type up into useful sections, so that an automated summary generator can understand what to include. Some documents have clearly defined, predictable structures that are easy to draw information from, while others do not.
For example, Nationally Determined Contributions (NDCs) submitted to the UNFCCC follow a predefined set of rules describing what chapters should be included and what topics they should cover. This makes it easier to find the parts that make a good summary for these documents, like the primary objectives, country profile, and greenhouse gas emission reduction targets. Other document types will require further investigation to help our generator understand what information should be included.
Finally, once implemented, this approach would have an additional benefit for internal teams. We could collect feedback on the usefulness of a summary, for example, through a simple pop-up asking “Was this summary helpful?”, with thumbs-up or thumbs-down responses.
Help shape what we build next. These experiments are just the beginning. Explore our public product roadmap to see what’s coming next, and join our user research programme to help shape the future of the app.