OCR and Scanned PDFs: How Text Recognition Works in the Browser

What is a Scanned PDF?

A scanned PDF is created when a physical document — such as a letter, a contract, or an archive document — is scanned with a scanner and saved as a PDF file. Unlike a “born-digital” PDF created directly in Word or InDesign, a scanned PDF contains no machine-readable text — only an image layer.

This means you cannot select, search, or copy text from the document in the usual way. For assistive technologies like screen readers, the document appears as a blank page. This is where OCR technology comes in.

What is OCR?

OCR is an abbreviation for Optical Character Recognition. The technology analyses an image and attempts to identify the characters and words displayed. Modern OCR systems use machine learning and neural networks to achieve high accuracy, even with varying fonts, text directions, and backgrounds.

OCR is not perfect. Accuracy depends on scan quality, font type, document age, and layout complexity. A clear, well-lit scan of a modern document with standard typefaces can achieve an accuracy rate of over 99% — while handwritten documents, faxes, or low-resolution scans can produce significantly worse results.

Open Source OCR Directly in the Browser

PDFAccess uses an open-source OCR engine that runs directly in your browser via WebAssembly. No data is sent to a server — all processing happens locally on your device.

Factors Affecting OCR Accuracy

The accuracy of OCR results depends on a number of factors that are useful to understand when working with scanned documents:

Scan resolution: A minimum of 300 DPI is recommended for good OCR. Lower resolution makes recognition conditions more difficult.
Contrast and lighting: Even lighting without shadows and good contrast between text and background gives the best results.
Font type: Standard fonts such as Times New Roman and Arial are recognised better than decorative or handwritten typefaces.
Document age and condition: Older documents with yellowed pages, ink bleed, or tears present greater challenges.
Language model: The OCR engine must use the correct language model for optimal accuracy. PDFAccess supports Danish and English.

Multilingual OCR and Danish Document Processing

Danish is a relatively complex language for OCR, particularly due to the special characters æ, ø, and å. PDFAccess’s OCR engine includes a dedicated Danish language model trained on large amounts of Danish text and handles these special characters correctly.

PDFAccess automatically downloads the required language model the first time the OCR function is used. The model is approximately 10 MB and is needed for correct recognition of Danish text. On subsequent use, the model is cached in the browser and does not need to be downloaded again.

Practical Tips for Better OCR Results

If you want to improve the quality of OCR output from your scanned documents, here are the most important recommendations:

Scan at a minimum of 300 DPI — preferably 400-600 DPI for documents with fine print.
Use black/white or greyscale scanning rather than colour, unless the document contains important colour information.
Avoid scanning at angles — the document should lie flat and straight.
Clean the scanner glass regularly to remove dust and fingerprints.
Consider rescanning documents with low DPI if OCR results are poor.