It produces garbled text when text is extracted. e.g.
import pymupdf
doc = pymupdf.open("anon-test.pdf")
page = doc[0]
text = page.get_text()
print(text)
I think this is a PDF problem! How can we check?
anon-test.pdf (90.4 KB)
It produces garbled text when text is extracted. e.g.
import pymupdf
doc = pymupdf.open("anon-test.pdf")
page = doc[0]
text = page.get_text()
print(text)
I think this is a PDF problem! How can we check?
anon-test.pdf (90.4 KB)
The fonts in this PDF are missing back-translation information [visible glyph] ==> Unicode.
The default flags for text extraction try to circumvent this by returning glyph numbers whenever the Invalid Unicode character � is returned.
Often helps, but sometimes only increases confusion - like here.
Try using page.get_text(flags=0)
to confirm that you normally would see � characters.
In any case, there is no way to improve this situation - except using OCR of course.
How can you detect this? Is there a PyMuPDF command which you can run which gives you this info? Something like “validate PDF” or whatever.
Unfortunately, the missing information may not be “universal” in the sense “not there for anything”: It may be just one font out of many on the page with the problem, or even worse: just 1 or a handful of glyphs out of many okay ones are missing this.
I am considering changing the default flags to not trying this auto-replacement. So the situation becomes detectable. I have done that in PyMuPDF4LLM - probably should do it in general.
Hmmm, when I check it in Adobe Acrobat I get this weird font name:
Also if I select, copy and paste the text from Adobe ( or Preview ) then I also get garbage.
So it seems this is just a badly made PDF, right? ( or deliberately “bad” to prevent easy text copying )
Exactly the right check. If any time a PDF viewer can successfully copy/paste where we return rubbish, then - and only then - we have a problem.
Thanks — Got it!