Smart cropping of diagrams

DSP_huzi · June 26, 2025, 2:47pm

What I’m Trying to Do

I’m working on a system to automatically extract diagram images from past paper PDFs. The diagrams are usually referenced like “Fig. 1.1”, and I want to extract the figure image below that label — including any relevant labels or annotations around the diagram.

How My Extraction Works

1. Locate the “Fig. x.y” Label

I scan each page for references like "Fig. 1.1" using regex + page.search_for.

2. Guess a Figure Area (Bounding Box)

From the “Fig.” label position, I estimate the diagram’s bounding box by:

Expanding a rectangle below the label
With a max width/height (400x250) to keep it reasonable
Using dynamic_figure_box() to fine-tune the shape

3. Avoid Overlapping Other Content

In the dynamic_figure_box() method:

I stop the crop early if there’s another "Fig." below — to avoid capturing multiple diagrams.
I also scan nearby text blocks, and shrink the crop if it overlaps any significant body text, using a custom is_text_block_significant() function.

4. Render the Diagram

Once the crop region is finalized:

I use page.get_pixmap(clip=rect) to render the image
Then convert it to base64 and upload it to Supabase for storage

What Works Well

I can detect many diagrams cleanly based on their “Fig.” reference.
The cropping is dynamic, so I avoid grabbing too much unrelated content.

Remaining Problem

Sometimes the diagram includes important labels or annotations that are slightly outside my initial crop. I want to detect if this happens and expand the crop accordingly, but I’m still working on that logic.

What I’m Looking For

I’d appreciate any feedback or suggestions on:

More accurate or smarter ways to estimate the figure bounding box
Techniques to detect and include relevant labels (text near shapes/arrows)
Efficiently preventing overlap with unrelated text

Example

An example for the input and output has been uploaded for help.
Input

Output

5054_w24_qp_22.pdf_q3029_Fig.3.1

Thanks!

HaraldLieder · June 26, 2025, 2:56pm

Hi DSP_huzi,
and welcome to this place!
You provided an excellent description of your situation - I do think I understand.
The question I have:
You mentioned paper-based PDFs. Does this mean we are dealing with OCRed pages? If yes, then the OCR level only contains text - no vector graphics and no images are detectable.
If not then we are dealing with standard PDF content. This allows to use the full PyMuPDF weaponry that can extract images and vector graphics.

DSP_huzi · June 26, 2025, 3:28pm

Thanks for the warm welcome!

Yes — these are native PDFs, not OCRed. I can extract text cleanly with page.get_text("dict"), and the text has font/size info. Most diagrams are vector drawings (lines, shapes), and some include embedded raster images too. I have uploaded the whole pdf so you can check yourself
5054_w24_qp_22 (3).pdf (290.8 KB)

That’s why I was using a dynamic box to clip around “Fig. x.y” references and expand it to catch surrounding labels. If you have better strategies to detect the full extent of diagrams or label-text relationships, I’d love your insight!

HaraldLieder · June 26, 2025, 4:40pm

Great, thanks for the PDF. I think I managed to come up with a script that does the job.
test.py (1.5 KB)
This is its result PDF:
clusters.pdf (287.8 KB)

HaraldLieder · June 26, 2025, 4:41pm

No regular expressions … no search … only using batteries included in PyMuPDF.
Just collating its success factors:

The caption text makes up its own line, starts with “Fig.”, ends with a digit and the text contains exactly 2 dot (“.” ) characters.
From caption text bottom to top coordinate of its drawing, all text is regarded to be a relevant part of the final boundary box - over the full width of the page - as long as it is horizontal text.

DSP_huzi · June 26, 2025, 6:23pm

Thankyou so much for your solution! I tested it on another file and it almost worked on all diagrams except for which had some text like “(Not to scale)” after “Fig. 3.1” for example. Sorry for not specifying the edge case before
Here is the file I tested it on:
5054_s23_qp_22 (1).pdf (910.1 KB)

HaraldLieder · June 26, 2025, 8:32pm

If it only needs to cover the “(not to scale)” case, the changes are simple. However on page 6 there exists 2 different diagrams side by side. That is harder to disassemble: practically two columns are established there … let me see what can be done.
A somewhat better version is this with a trivial extension.

HaraldLieder · June 26, 2025, 8:37pm

Or rather take this one. Then only page 6 needs to be resolved.
test.py (1.7 KB)

HaraldLieder · June 26, 2025, 9:27pm

This one is even better …
test.py (2.1 KB)

Topic		Replies	Views
Why is this diagraph NOT extracted as images by pymupdf4llm.to_markdown(write_images=True) Discussions	3	17	July 18, 2025
Why is this graphic NOT extracted as images by pymupdf4llm.to_markdown(write_images=True) Discussions	5	16	July 22, 2025
Graphic wrongly placed in md file output from pymupdf4llm.to_markdown Discussions	11	13	July 22, 2025
Looking for a smart way to extract pdf pages per article Discussions	13	38	July 4, 2025
Any idea what is wrong with this PDF? Discussions	6	23	July 9, 2025