Any idea what is wrong with this PDF?

Jamie_Lemon · July 9, 2025, 1:01pm

It produces garbled text when text is extracted. e.g.

import pymupdf

doc = pymupdf.open("anon-test.pdf")
page = doc[0]
text = page.get_text()
print(text)

I think this is a PDF problem! How can we check?

anon-test.pdf (90.4 KB)

HaraldLieder · July 9, 2025, 3:03pm

The fonts in this PDF are missing back-translation information [visible glyph] ==> Unicode.
The default flags for text extraction try to circumvent this by returning glyph numbers whenever the Invalid Unicode character � is returned.

Often helps, but sometimes only increases confusion - like here.
Try using page.get_text(flags=0) to confirm that you normally would see � characters.

In any case, there is no way to improve this situation - except using OCR of course.

Jamie_Lemon · July 9, 2025, 3:05pm

How can you detect this? Is there a PyMuPDF command which you can run which gives you this info? Something like “validate PDF” or whatever.

HaraldLieder · July 9, 2025, 3:13pm

Unfortunately, the missing information may not be “universal” in the sense “not there for anything”: It may be just one font out of many on the page with the problem, or even worse: just 1 or a handful of glyphs out of many okay ones are missing this.

I am considering changing the default flags to not trying this auto-replacement. So the situation becomes detectable. I have done that in PyMuPDF4LLM - probably should do it in general.

Jamie_Lemon · July 9, 2025, 3:23pm

Hmmm, when I check it in Adobe Acrobat I get this weird font name:
Screenshot 2025-07-09 at 16.20.21

Also if I select, copy and paste the text from Adobe ( or Preview ) then I also get garbage.

So it seems this is just a badly made PDF, right? ( or deliberately “bad” to prevent easy text copying )

HaraldLieder · July 9, 2025, 3:29pm

Exactly the right check. If any time a PDF viewer can successfully copy/paste where we return rubbish, then - and only then - we have a problem.

Jamie_Lemon · July 9, 2025, 3:44pm

Thanks — Got it!

Topic		Replies	Views
How to fix code=4: no font file for digest? How To	3	13	June 30, 2025
Graphic wrongly placed in md file output from pymupdf4llm.to_markdown Discussions	11	13	July 22, 2025
Why is this diagraph NOT extracted as images by pymupdf4llm.to_markdown(write_images=True) Discussions	3	17	July 18, 2025
Unable Open Password Protected PDF File even with right password Discussions	1	7	July 23, 2025
Welcome to PyMuPDF Forum! :wave: Discussions	0	51	May 15, 2025

Any idea what is wrong with this PDF?

Related topics