Dear Forum,
While working on a script to make changes in PDF files using PyMuPDF, I ran into the following issue:
- The EN DASH (U+2013) IS being rendered as “?” (question mark) when using insert_textbox() with the
tiro
font.
However, insert_text() renders EN DASH as middle dot '·'
which is what I would like to see in order to get an output document that looks like the input document. Having a real EN DASH would off course be even better.
Can anybody explain this difference? I cannot see anything in the documentation about this difference in behaviour. This could be because I am not a PDF expert 
Ideally I would like to somehow make insert_texbox copy the behaviour from insert_text.
Thank you in advance for any clarification!
Here is a codesnippet to show:
insert_textbox() versus insert_text()
python3 -c "
# Document different behaviour of insert_textbox and insert_text
import pymupdf
orig_text = 'Fiasp skal injiceres lige inden måltidets start (0–2 minutter før), og det er muligt at injicere det op til'
print(f'Original text: {repr(orig_text)}')
print(f'EN DASH character: {repr(en_dash_text[50])} (U+{ord(orig_text[50]):04X})')
doc = pymupdf.open()
page = doc.new_page()
# Use TimesNewRomanPSMT -> tiro mapping as in the real script
page.insert_font(fontname='tiro')
# Try insert_textbox (as used in script)
bbox = pymupdf.Rect(50, 50, 500, 100)
result = page.insert_textbox(bbox, orig_text, fontname='/tiro', fontsize=12)
print(f'\\nInsert_textbox result: {result}')
# Read back what was rendered
rendered = page.get_text()
print(f'Rendered text: {repr(rendered)}')
# Check specifically for the character at position 50
if len(rendered) > 50:
rendered_char = rendered[50]
print(f'Character at position 50: {repr(rendered_char)} (U+{ord(rendered_char):04X})')
doc.close()
# TEST insert_text()
doc2 = pymupdf.open()
page2 = doc2.new_page()
# Use TimesNewRomanPSMT -> tiro mapping as in the real script
page2.insert_font(fontname='tiro')
# Using insert_text
result2 = page2.insert_text([50,50], orig_text, fontname='/tiro', fontsize=12)
print(f'\\nInsert_text result: {result2}')
# Read back what was rendered
rendered2 = page2.get_text()
print(f'Rendered text: {repr(rendered2)}')
# Check specifically for the character at position 50
if len(rendered2) > 50:
rendered_char2 = rendered2[50]
print(f'Character at position 50: {repr(rendered_char2)} (U+{ord(rendered_char2):04X})')
doc2.close()
"
Result of running the script:
sh test.sh
Original text: ‘Fiasp skal injiceres lige inden måltidets start (0–2 minutter før), og det er muligt at injicere det op til’
EN DASH character: ‘–’ (U+2013)
Insert_textbox result: 14.612001061439514
Rendered text: 'Fiasp skal injiceres lige inden måltidets start (0?2 minutter før), og det er muligt at injicere
det op til
’
Character at position 50: ‘?’ (U+003F)
Insert_text result: 1
Rendered text: 'Fiasp skal injiceres lige inden måltidets start (0·2 minutter før), og det er muligt at injicere det op til
’
Character at position 50: ‘·’ (U+00B7)
Hi @Steen_Larsen
I think the font used only supports a small set of unicode values. You would have to use a font which has a higher range of support, e.g.
font=pymupdf.Font(fontfile=“my-cool-font.ttf”)
fontname=“myfont”
page.insert_font(fontname=fontname, fontbuffer=font.buffer)
page.insert_text(point, text, fontname=fontname, …)
Or I guess you could do a search for the following set in your input text:
U+2010 HYPHEN (‐)
U+2011 NON-BREAKING HYPHEN (‑)
U+2012 FIGURE DASH (‒)
U+2013 EN DASH (–)
And then replace them all with:
U+002D HYPHEN-MINUS(-)
A hassle , but then I think you could use these inbuilt fonts and the hyphen characters would render as you might expect.
1 Like
Alternatively you could use page.insert_htmlbox I think it generally gives better results out of the box.
Many thanks!
Apart from the unexplained difference between insert_text and insert_textbox, my problem was that instead of using the intended true type font a bug caused me to fallback to the builtin fonts that have a very reduced character set.
Yeah - that unexplained difference is a strange one!