Underlines not handled by pymupdf4llm.to_markdown

I am using pymupdf4llm.to_markdown to extract patient leaflet information inside PDF documents from the European Medicines Agency.

The PDF I am using has underlined text which does not appear underlined in the outputted md file. On one occurrence the md file contains 9 tabs followed by the text that should have been underlined and in other circumstances I simply get the text without any formatting at all. I would have expected underlined text using the the <ins></ins> markup

Snippet from page 2 in the PDF looks like this:

The markup output is this:

**2.** **QUALITATIVE AND QUANTITATIVE COMPOSITION**

1 mL of the solution contains 100 units of insulin aspart* (equivalent to 3.5 mg).

                                    Fiasp 100 units/mL FlexTouch solution for injection in pre filled pen

Each pre-filled pen contains 300 units of insulin aspart in 3 mL solution.

Fiasp 100 units/mL Penfill solution for injection in cartridge

Is this a bug or a feature? :slight_smile:

The file is : https://www.ema.europa.eu/en/documents/product-information/fiasp-epar-product-information_en.pdf

I use the demo code below.

import pymupdf4llm

md_text = pymupdf4llm.to_markdown(“/pleaflet/samples/fiasp-epar-product-information_en.pdf”, write_images=True, force_text=False, image_size_limit=0))
import pathlib
pathlib.Path(“noutput.md”).write_bytes(md_text.encode())

Not sure in the results which it italicised the Fiasp 100 units/mL FlexTouch … , so I would say that is a bug.

There is no native underline tag in Markdown though, so we would need to consider extending the method to insert HTML like syntax such as <u> if we were to support this feature.

There exists no way to render underlines in Markdown!

Yes and no. My understanding is that markdown has a murky past where certain perl scripts were the spec. and that some topics are sort of greyzones and undefined even today.

Regarding underline:

  1. Github specifies <ins> for underline here :
    Basic writing and formatting syntax - GitHub Docs - I also tested it in a README.md file where it displayes with underline on github.

  2. This platform handles <ins> as this

  3. Interestingly this spec does not mention any support for underline. GitHub Flavored Markdown Spec

  4. My visual studio code markdown previewer supports <ins> and shows it as underlined

The HTML tag <u> does not work on github and this platform. It only works in my VSC markup previewer.

IMHO it would make sense to support the <ins> tag for underline in markup.

PS.

Sadly _text_ gives italic in markup because the original author preferred to reserve underline for links. This made sense when markup was used for web only. But today it is used for many non-web purposes.

Well, of course we could support the <ins>. But I don’t want to do that.
Reasons:

  • Support among renderers is limited because not part of the official syntax. LLMs may not / will not understand what they are getting.
  • Text volume may be bloated considerably by the long strings <ins> / </ins>.
  • The purpose is purely esthetical. It plays no role in LLM workflows. If at all we might consider outputting it as “bold”.
  • There is a fundamental uncertainty WRT to detection: Text “accidentally” above horizontal lines cannot be told apart from being truly underlined. Examples:
    • page separator lines (e.g. page headers separated from page body)
    • text in tables having grid lines: would cause all cell text to be underlined.

Thanks for your answer! I really appreciate you taking time for this. Even if I don’t agree fully.

As I wrote above it seems most renderers support <ins>. All 3 I use did. Should somebody not support it they will simply ignore the tag and no harm will be done.

I really don’t see that as an issue. If you really think that is an issue why don’t you see a problem supporting bold text? The difference between ** for bold and the <ins> tag for an underline is only 7 bytes. Not an issue in any LLM context IMHO. (Unless some idiot tries to individually underline every character in a document :grinning_face: )

My use case is using LLMs to edit documents. Extracting relevant text and reformatting it. Something which replaces a lot of cumbersome manual work. For this use case the formatting is very important. I am sure other people do this too. If you don’t think aesthetics count then why are we bothering with formatting at all?

That is a good point. As a first simple step you could support the simple case where the PDF font itself is underlined. That should be pretty simple and cover 90% of use cases.
Talking about text and nearby lines I actually saw a problem in your table handling where text inside the table was detected as strike through I will look for this and describe it in a separate post.

I could see an argument for an extra parameter in to_markdown which could be “support unofficial MD syntax”, support_extended_md maybe?
If this was set to True ( False by default) then it could try and parse the document with regards to looking for underlines and providing <ins> insertions.
But I don’t think this should be default behaviour.

Regarding the detection issue - I think we would likely get quite a few false positives if text is super close to line drawn tables. Additionally as I understand it , there is no such thing as “an underlined font” - so the idea of a PDF font which is underlined doesn’t make sense to me here!