The EN DASH (U+2013) IS being rendered as "?" (question mark) when using insert_textbox() with the tiro font

Dear Forum,

While working on a script to make changes in PDF files using PyMuPDF, I ran into the following issue:

  • The EN DASH (U+2013) IS being rendered as “?” (question mark) when using insert_textbox() with the tiro font.

However, insert_text() renders EN DASH as middle dot '·' which is what I would like to see in order to get an output document that looks like the input document. Having a real EN DASH would off course be even better.

Can anybody explain this difference? I cannot see anything in the documentation about this difference in behaviour. This could be because I am not a PDF expert :joy:

Ideally I would like to somehow make insert_texbox copy the behaviour from insert_text.

Thank you in advance for any clarification!

Here is a codesnippet to show:

insert_textbox() versus insert_text()

python3 -c "
# Document different behaviour of insert_textbox and insert_text
import pymupdf

orig_text = 'Fiasp skal injiceres lige inden måltidets start (0–2 minutter før), og det er muligt at injicere det op til'
print(f'Original text: {repr(orig_text)}')
print(f'EN DASH character: {repr(en_dash_text[50])} (U+{ord(orig_text[50]):04X})')

doc = pymupdf.open()
page = doc.new_page()

# Use TimesNewRomanPSMT -> tiro mapping as in the real script
page.insert_font(fontname='tiro')

# Try insert_textbox (as used in script)
bbox = pymupdf.Rect(50, 50, 500, 100)
result = page.insert_textbox(bbox, orig_text, fontname='/tiro', fontsize=12)
print(f'\\nInsert_textbox result: {result}')

# Read back what was rendered
rendered = page.get_text()
print(f'Rendered text: {repr(rendered)}')

# Check specifically for the character at position 50
if len(rendered) > 50:
    rendered_char = rendered[50]
    print(f'Character at position 50: {repr(rendered_char)} (U+{ord(rendered_char):04X})')

doc.close()

# TEST insert_text()

doc2 = pymupdf.open()
page2 = doc2.new_page()

# Use TimesNewRomanPSMT -> tiro mapping as in the real script
page2.insert_font(fontname='tiro')

# Using insert_text

result2 = page2.insert_text([50,50], orig_text, fontname='/tiro', fontsize=12)

print(f'\\nInsert_text result: {result2}')

# Read back what was rendered
rendered2 = page2.get_text()
print(f'Rendered text: {repr(rendered2)}')

# Check specifically for the character at position 50
if len(rendered2) > 50:
    rendered_char2 = rendered2[50]
    print(f'Character at position 50: {repr(rendered_char2)} (U+{ord(rendered_char2):04X})')

doc2.close()
"

Result of running the script:

sh test.sh
Original text: ‘Fiasp skal injiceres lige inden måltidets start (0–2 minutter før), og det er muligt at injicere det op til’
EN DASH character: ‘–’ (U+2013)

Insert_textbox result: 14.612001061439514
Rendered text: 'Fiasp skal injiceres lige inden måltidets start (0?2 minutter før), og det er muligt at injicere
det op til

Character at position 50: ‘?’ (U+003F)

Insert_text result: 1
Rendered text: 'Fiasp skal injiceres lige inden måltidets start (0·2 minutter før), og det er muligt at injicere det op til

Character at position 50: ‘·’ (U+00B7)

Hi @Steen_Larsen

I think the font used only supports a small set of unicode values. You would have to use a font which has a higher range of support, e.g.

font=pymupdf.Font(fontfile=“my-cool-font.ttf”)
fontname=“myfont”
page.insert_font(fontname=fontname, fontbuffer=font.buffer)
page.insert_text(point, text, fontname=fontname, …)

Or I guess you could do a search for the following set in your input text:

U+2010 HYPHEN (‐)

U+2011 NON-BREAKING HYPHEN (‑)

U+2012 FIGURE DASH (‒)

U+2013 EN DASH (–)

And then replace them all with:
U+002D HYPHEN-MINUS(-)

A hassle , but then I think you could use these inbuilt fonts and the hyphen characters would render as you might expect.

1 Like

Alternatively you could use page.insert_htmlbox I think it generally gives better results out of the box.

Many thanks!

Apart from the unexplained difference between insert_text and insert_textbox, my problem was that instead of using the intended true type font a bug caused me to fallback to the builtin fonts that have a very reduced character set.

Yeah - that unexplained difference is a strange one!