Looking for a smart way to extract pdf pages per article

Hi there!

I have a OCR-scanned PDF of a dictionary. The only way the article headings are marked are their position and bold text (which tesseract OCR doesn’t capture). And I want to extract articles from this PDF per article. What would be a suitable approach to 1) mark the article headings and 2) make pdfs/images per article.

Hi LemonJuice,

welcome to our Forum!
This is a hard task. With few exceptions, OCR engines are no good in extracting text details like font, font size or properties or text color. Neither do you usually find vector graphic recognition.
While Tesseract OCR is to our knowledge the best free OCR product (which is why we have chosen it), it certainly has its drawbacks like the mentioned ones.
In addition, it does not differentiate between mono-spaced and proportional fonts: its internal “GlyphLessFont” is mono-spaced. Therefore, the best you can expect is precision down to the word level. Single character coordinates are never correctly recognized.

Keeping all that in mind, the only way to identify article breaks is to watch out for unusually large inter-line gaps.

Once you have identified the boundary box of an article, you can make a Pixmap of that area pix = page.get_pixmap(clip=bbox). Then save the pixmap to an image via e.g. pix.save("article.jpg").

It will be interesting to find the appropriate Tesseract language support: It seems you will need Scandinavian plus IPA …

Thanks for the reply. It is indeed hard. But the is actually OCR is pretty decent, after some training. I think I got the character error rate down to around 3% with pretty minimal training data (equalling 18 pages of ground truth files).

But the problem remains, how should I make a script that recognizes articles. I like your idea to look for spacing as an identifier, unfortunately I don’t see that the spacing above article headers are different from spacing between paragraphs.

Maybe, I could manually add some kind of character (like “@@@”) to the OCRed text in a text file, but not directly to the PDF text, as far as I know. Then the question becomes different. How should you write a script, that can take the text file with articles separated by “@@@” and map those to the PDF and then extract the articles.

Are you saying that after the OCR process that the bold font weights are no longer recognised? (e.g. doing json = page.get_text("json") doesn’t show any different kind of fonts for the bold text ? )

Are you saying that after the OCR process that the bold font weights are no longer recognised?

Exactly. Apart from the text as itself (and even that with some luck … and a fortunate choice of language support), no text metadata are recognized: no italic, bold, color, mono or serifed.
Even font size must be indirectly computed / deducted from the bbox height.

Tesseract uses one and only one font for all detected text: GlyphLessFont. This is a mono-space, normal weight and style font.
Hence even character bboxes (-> “rawdict”) are fake … at best.

I don’t see how you could do that - given that you would need to know geometrical information for this. Something that hope to find out - after analyzing the detected text positions.
What might work is computing all line bottom coordinates (based on “dict” output), make a statistical analysis: average, maximum, minimum vertical line distances.
Then assume an article break before each line with an overly large distance to its predecessor.

Got it - we need OCR+ it seems! :slight_smile:

Tesseract-OCR is still the best free tool.
ABBYY is much better but expensive.

So all we can do is simply reduce expectations we put into OCR.
I don’t remember how often I had to repeat the mantra: “OCR always means forgetting information - never gaining any.”

With ChatGPT I wrote a script that finds spacing (or gaps) between boxes to identify article headings, but it gives a lot of false positives. In the page I sent earlier for example, the paragraphs beginning with Ieur. also were identified by the script as article headings. I can filter out most of the false positives, but nonetheless it takes time and isn’t fool-proof.

Let’s say I try another manual approach, like highlighting each of the PDF’s article headings, could I then do text extraction and get PDF-pages/images for each highlighted article heading? I read your article here: Advanced Text Manipulation Using PyMuPDF | by Harald Lieder | Medium. It seems like PyMuPDF could extract the highlighted words. But would it be able to extract text from highlighted word X1 to the following highlighted word X2, in other words capture the whole article?

Yes: if there are area markers like highlights, then text extraction can restrict output to those areas.
But the original problem recurs just in a different disguise: how do you determine the coordinates of those highlights? Insert them manually?

Chat GPT gave me this snippet:

— Gather headings and bboxes —

headings =
for page_num, page in enumerate(doc):
for annot in page.annots() or :
if annot.type[0] == 8:
verts = annot.vertices
bboxes =
highlight_text =
for i in range(0, len(verts), 4):
quad = verts[i:i+4]
rect = fitz.Quad(quad).rect
text = page.get_textbox(rect).strip()
if text:
highlight_text.append(text)
bboxes.append(list(rect))
text = " ".join(highlight_text)
if text:
headings.append({
‘headword’: text,
‘page_idx’: page_num,
‘pdf_page’: page_num + 1,
‘bboxes’: bboxes,
})

Sort by PDF page and vertical position

headings = sorted(headings, key=lambda h: (h[‘page_idx’], h[‘bboxes’][0][1] if h[‘bboxes’] else 0))

articles =
for i, h in enumerate(headings):
start_page = h[‘page_idx’]
start_y2 = h[‘bboxes’][0][3] if h[‘bboxes’] else 0 # bottom of heading
if i + 1 < len(headings):
end_page = headings[i + 1][‘page_idx’]
end_y1 = headings[i + 1][‘bboxes’][0][1] if headings[i + 1][‘bboxes’] else float(‘inf’)
else:
end_page = len(doc) - 1
end_y1 = float(‘inf’)

# --- Article Text Extraction ---
article_text = ""
for p in range(start_page, end_page + 1):
    page = doc[p]
    blocks = page.get_text("blocks")
    if p == start_page and p == end_page:
        # Single page article: between heading bottom and next heading top
        region_blocks = [b for b in blocks if b[1] >= start_y2 and b[3] <= end_y1]
    elif p == start_page:
        # First page: below heading
        region_blocks = [b for b in blocks if b[1] >= start_y2]
    elif p == end_page:
        # Last page: above next heading
        region_blocks = [b for b in blocks if b[3] <= end_y1]
    else:
        region_blocks = blocks  # Middle pages: all
    # Sort blocks by (y1, x1) for reading order
    region_blocks = sorted(region_blocks, key=lambda b: (b[1], b[0]))
    article_text += "".join(b[4] for b in region_blocks)

It works pretty good. But I need to tweak it somewhat

Yep - ChatGPT & other LLMs like Claude are pretty adept at generating pretty good results if you can understand some of the finer details then a few tweaks and it is usually good to go.
Just watch out for the odd hallucination. Also , the fact that they still call PyMuPDF “Fitz” ( a historical name for the library ) tells me that they were perhaps trained on documentation from over a year ago. Once they start producing code which does import pymupdf instead of import fitz I will feel happier!

FYI: Here is a fun blog piece about “fitz” about a year ago now - enjoy! PyMuPDF 1.24.3 and Farewell to “Fitz” | Artifex