Any idea what is wrong with this PDF?

It produces garbled text when text is extracted. e.g.

import pymupdf

doc = pymupdf.open("anon-test.pdf")
page = doc[0]
text = page.get_text()
print(text)

I think this is a PDF problem! How can we check?

anon-test.pdf (90.4 KB)

The fonts in this PDF are missing back-translation information [visible glyph] ==> Unicode.
The default flags for text extraction try to circumvent this by returning glyph numbers whenever the Invalid Unicode character � is returned.

Often helps, but sometimes only increases confusion - like here.
Try using page.get_text(flags=0) to confirm that you normally would see � characters.

In any case, there is no way to improve this situation - except using OCR of course.

How can you detect this? Is there a PyMuPDF command which you can run which gives you this info? Something like “validate PDF” or whatever. :slight_smile:

Unfortunately, the missing information may not be “universal” in the sense “not there for anything”: It may be just one font out of many on the page with the problem, or even worse: just 1 or a handful of glyphs out of many okay ones are missing this.

I am considering changing the default flags to not trying this auto-replacement. So the situation becomes detectable. I have done that in PyMuPDF4LLM - probably should do it in general.

Hmmm, when I check it in Adobe Acrobat I get this weird font name:
Screenshot 2025-07-09 at 16.20.21

Also if I select, copy and paste the text from Adobe ( or Preview ) then I also get garbage.

So it seems this is just a badly made PDF, right? ( or deliberately “bad” to prevent easy text copying )

Exactly the right check. If any time a PDF viewer can successfully copy/paste where we return rubbish, then - and only then - we have a problem.

Thanks — Got it! :+1: