OCR February 17, 2026 · 7 min read

Multilingual OCR: How to Extract Text From PDFs in Any Language

Guide to running OCR on non-English documents — Arabic, Chinese, Japanese, Russian, and more — with the best free and paid tools.

Multilingual OCR: How to Extract Text From PDFs in Any Language
AT

AltoUnlockPDF Team

PDF Tools Expert

The majority of documents in the world are not in English. If you’re working with German legal contracts, French medical records, Chinese business documents, or Arabic government forms, you need an OCR tool that truly handles that language.

This guide covers multilingual OCR across the major language families.


Why Language Matters for OCR

OCR engines work by matching pixel patterns against trained character models. For this to work:

  1. The engine must have training data for that language’s alphabet/characters
  2. It must understand how characters combine into words (language model)
  3. For complex scripts (Arabic, Thai, Devanagari), it needs special handling of ligatures and diacritics

Using an English OCR engine on French text: passable (mostly the same alphabet, with errors on é, à, ü, etc.)

Using an English OCR engine on Arabic: completely unusable (entirely different script)


Supported Languages by Tool

Tesseract (100+ Languages)

Tesseract is the most multilingual free OCR engine. Language packs must be installed separately:

# Install language packs (Ubuntu)
sudo apt install tesseract-ocr-deu  # German
sudo apt install tesseract-ocr-fra  # French
sudo apt install tesseract-ocr-ara  # Arabic
sudo apt install tesseract-ocr-chi-sim  # Chinese Simplified
sudo apt install tesseract-ocr-jpn  # Japanese

# macOS (Homebrew)
brew install tesseract-lang

# List installed languages
tesseract --list-langs

# Run OCR with specific language
tesseract arabic_doc.jpg output -l ara
tesseract chinese_doc.jpg output -l chi_sim
tesseract multilingual.jpg output -l fra+eng
Multilingual document OCR processing

Languages by Difficulty

Latin Script Languages (Easy)

English, French, German, Spanish, Italian, Portuguese — all use the same basic alphabet with minor variations. Tesseract handles these excellently.

Cyrillic Script (Moderate)

Russian, Ukrainian, Bulgarian, Serbian — well-supported by Tesseract and most OCR tools. Key Tesseract codes: rus, ukr, bul.

Arabic / Hebrew (Challenging — RTL)

Arabic and Hebrew are right-to-left scripts with complex joining rules. Dedicated models are needed:

import pytesseract
from PIL import Image

# Arabic OCR
text = pytesseract.image_to_string(Image.open('arabic.jpg'), lang='ara')

# Configuration for Arabic (right-to-left)
custom_config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(Image.open('arabic.jpg'), lang='ara', config=custom_config)

For Arabic documents, ABBYY FineReader significantly outperforms Tesseract.

Chinese / Japanese / Korean (CJK — Complex)

CJK scripts have thousands of characters. Dedicated models are required:

  • Tesseract: chi_sim (Simplified Chinese), chi_tra (Traditional Chinese), jpn (Japanese), kor (Korean)
  • PaddleOCR (by Baidu) is generally better for CJK than Tesseract:
from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang='ch')  # Chinese
result = ocr.ocr('chinese_document.jpg', cls=True)
for line in result[0]:
    print(line[1][0])  # extracted text

Cloud OCR APIs for Multilingual Documents

When accuracy is critical and volume justifies cost:

Google Cloud Vision API

  • Supports 50+ languages
  • Excellent for CJK and Arabic
  • $1.50 per 1,000 pages
from google.cloud import vision

client = vision.ImageAnnotatorClient()
with open('document.jpg', 'rb') as f:
    image = vision.Image(content=f.read())

# Specify language hints
image_context = vision.ImageContext(language_hints=['zh', 'en'])
response = client.text_detection(image=image, image_context=image_context)
print(response.full_text_annotation.text)

AWS Textract

  • 14 languages supported
  • Best for forms and tables
  • $1.50 per 1,000 pages
Documents in multiple languages processed with OCR

AltoUnlockPDF Language Support

Our OCR tool supports 35 languages including:

  • All major European languages
  • Russian and other Cyrillic scripts
  • Arabic and Hebrew (RTL support)
  • Chinese (Simplified and Traditional)
  • Japanese and Korean

Select your language from the dropdown before converting.


Tips for Non-Latin Script OCR

  1. Ensure correct text direction is set (RTL for Arabic/Hebrew)
  2. Avoid compressed JPEG — use PNG or TIFF for sharper character edges
  3. Font clarity matters more — many non-Latin scripts have more complex strokes
  4. Post-process with a native spell checker for the specific language
  5. Use script-specific tools (PaddleOCR for CJK, dedicated Arabic OCR for Arabic business documents)

For mission-critical multilingual document processing, combining Google Cloud Vision’s API with human review is the current best practice in enterprise settings.

Related Articles