PyMuPDF is great for replacing text in a PDF with following workflow, which is working fine so far:
- Find Placeholder-Text
- Mark Text-Area (bbox)
- Redact Placeholder-Text
- Draw new Text
In step 4 insert_textbox()
is used instead of insert_text
, to make use of the alignment features. This works fine for Arial
, but not for e.g. Avenir Next
or other fonts. In these cases the height of textbox is always too small to render text. Even if the replacement text is shorter than the placeholder! chars_written
always return negative values, meaning not enough space to render the text.
Here some screenshots to visualize the issue (left placeholder, right replacement):
Legend: black dot is text-origin
, blue rect is text-height
/fontsize
, red rect is bbox
and green rect is textbox
, rounded corners for better visibility.
When measuring text-height and box-height, text is always slightly larger than box (only decimal places):
Font-Height: 35.96999979019165
(bigger)
Box-Height: 35.969970703125
(smaller)
So my assumption to fit the text in: increase the box-height by 0.1 which should be sufficient.
But the result of chars_written
is -2.13, meaning no text is drawn (negative value).
Does the box-height needs some (undocumented?) extra margin of +2.14 height???
When doing so (box-height + 2.14), the text ist rendered, but this cannot be the wanted behaviour? Every change in font-size, font-type, etc. results in non-deterministically changes of this required margin. Just giving an amount of +5 margin or so is not an option.
When looking closer to the last screenshot it’s remarkable that text is not starting at the origin, that might be a hint to the culprit?
Has anyone else experienced this, or has an idea whats going on, or how to handle this?
BTW: The problem only concerns height
, if width of text and box are matching it works (no margin needed).
This is how I calculate font-height:
font_size = span.get("size")
font_asc = span.get("ascender")
font_desc = span.get("descender")
line_height = font_asc + (-font_desc)
font_height = line_height * font_size
Here how box-height is determined:
bbox = span.get("bbox")
box_height = get_box_height(bbox)
def get_box_height(rect: fitz.Rect) -> float:
return rect.y1 - rect.y0
Here is how the text is drawn, the embedded font is extracted from PDF and reused as font-file:
chars_written = page.insert_textbox(
replacement.box, # FontBox: (87.50, 484.63, 209.10, 520.60)
replacement.text, # replaced
color=color, # (0,0,0,1)
align=alignment, # 1 (centered)
fontsize=font.size, # 22.0
fontname=font.name, # AvenirNext-Medium
fontfile=font_path, # /tmpfile/AvenirNext-Medium.cff
)