What I’m Trying to Do
I’m working on a system to automatically extract diagram images from past paper PDFs. The diagrams are usually referenced like “Fig. 1.1”, and I want to extract the figure image below that label — including any relevant labels or annotations around the diagram.
How My Extraction Works
1. Locate the “Fig. x.y” Label
I scan each page for references like "Fig. 1.1"
using regex + page.search_for
.
2. Guess a Figure Area (Bounding Box)
From the “Fig.” label position, I estimate the diagram’s bounding box by:
- Expanding a rectangle below the label
- With a max width/height (
400x250
) to keep it reasonable
- Using
dynamic_figure_box()
to fine-tune the shape
3. Avoid Overlapping Other Content
In the dynamic_figure_box()
method:
- I stop the crop early if there’s another
"Fig."
below — to avoid capturing multiple diagrams.
- I also scan nearby text blocks, and shrink the crop if it overlaps any significant body text, using a custom
is_text_block_significant()
function.
4. Render the Diagram
Once the crop region is finalized:
- I use
page.get_pixmap(clip=rect)
to render the image
- Then convert it to base64 and upload it to Supabase for storage
What Works Well
- I can detect many diagrams cleanly based on their “Fig.” reference.
- The cropping is dynamic, so I avoid grabbing too much unrelated content.
Remaining Problem
Sometimes the diagram includes important labels or annotations that are slightly outside my initial crop. I want to detect if this happens and expand the crop accordingly, but I’m still working on that logic.
What I’m Looking For
I’d appreciate any feedback or suggestions on:
- More accurate or smarter ways to estimate the figure bounding box
- Techniques to detect and include relevant labels (text near shapes/arrows)
- Efficiently preventing overlap with unrelated text
Example
An example for the input and output has been uploaded for help.
Input
Output

Thanks!
1 Like
Hi DSP_huzi,
and welcome to this place!
You provided an excellent description of your situation - I do think I understand.
The question I have:
You mentioned paper-based PDFs. Does this mean we are dealing with OCRed pages? If yes, then the OCR level only contains text - no vector graphics and no images are detectable.
If not then we are dealing with standard PDF content. This allows to use the full PyMuPDF weaponry that can extract images and vector graphics.
2 Likes
Thanks for the warm welcome!
Yes — these are native PDFs, not OCRed. I can extract text cleanly with page.get_text("dict")
, and the text has font/size info. Most diagrams are vector drawings (lines, shapes), and some include embedded raster images too. I have uploaded the whole pdf so you can check yourself 
5054_w24_qp_22 (3).pdf (290.8 KB)
That’s why I was using a dynamic box to clip around “Fig. x.y” references and expand it to catch surrounding labels. If you have better strategies to detect the full extent of diagrams or label-text relationships, I’d love your insight!
Great, thanks for the PDF. I think I managed to come up with a script that does the job.
test.py (1.5 KB)
This is its result PDF:
clusters.pdf (287.8 KB)
No regular expressions … no search … only using batteries included in PyMuPDF.
Just collating its success factors:
- The caption text makes up its own line, starts with “Fig.”, ends with a digit and the text contains exactly 2 dot (“.” ) characters.
- From caption text bottom to top coordinate of its drawing, all text is regarded to be a relevant part of the final boundary box - over the full width of the page - as long as it is horizontal text.
Thankyou so much for your solution! I tested it on another file and it almost worked on all diagrams except for which had some text like “(Not to scale)” after “Fig. 3.1” for example. Sorry for not specifying the edge case before 
Here is the file I tested it on:
5054_s23_qp_22 (1).pdf (910.1 KB)
If it only needs to cover the “(not to scale)” case, the changes are simple. However on page 6 there exists 2 different diagrams side by side. That is harder to disassemble: practically two columns are established there … let me see what can be done.
A somewhat better version is this with a trivial extension.
Or rather take this one. Then only page 6 needs to be resolved.
test.py (1.7 KB)
This one is even better …
test.py (2.1 KB)