OCR February 17, 2026 · 7 min read

Multilingual OCR: How to Extract Text From PDFs in Any Language

Guide to running OCR on non-English documents — Arabic, Chinese, Japanese, Russian, and more — with the best free and paid tools.

AltoUnlockPDF Team

PDF Tools Expert

The majority of documents in the world are not in English. If you’re working with German legal contracts, French medical records, Chinese business documents, or Arabic government forms, you need an OCR tool that truly handles that language.

This guide covers multilingual OCR across the major language families.

Why Language Matters for OCR

OCR engines work by matching pixel patterns against trained character models. For this to work:

The engine must have training data for that language’s alphabet/characters
It must understand how characters combine into words (language model)
For complex scripts (Arabic, Thai, Devanagari), it needs special handling of ligatures and diacritics

Using an English OCR engine on French text: passable (mostly the same alphabet, with errors on é, à, ü, etc.)

Using an English OCR engine on Arabic: completely unusable (entirely different script)

Supported Languages by Tool

Tesseract (100+ Languages)

Tesseract is the most multilingual free OCR engine. Language packs must be installed separately:

# Install language packs (Ubuntu)
sudo apt install tesseract-ocr-deu  # German
sudo apt install tesseract-ocr-fra  # French
sudo apt install tesseract-ocr-ara  # Arabic
sudo apt install tesseract-ocr-chi-sim  # Chinese Simplified
sudo apt install tesseract-ocr-jpn  # Japanese

# macOS (Homebrew)
brew install tesseract-lang

# List installed languages
tesseract --list-langs

# Run OCR with specific language
tesseract arabic_doc.jpg output -l ara
tesseract chinese_doc.jpg output -l chi_sim
tesseract multilingual.jpg output -l fra+eng

Languages by Difficulty

Latin Script Languages (Easy)

English, French, German, Spanish, Italian, Portuguese — all use the same basic alphabet with minor variations. Tesseract handles these excellently.

Cyrillic Script (Moderate)

Russian, Ukrainian, Bulgarian, Serbian — well-supported by Tesseract and most OCR tools. Key Tesseract codes: rus, ukr, bul.

Arabic / Hebrew (Challenging — RTL)

Arabic and Hebrew are right-to-left scripts with complex joining rules. Dedicated models are needed:

import pytesseract
from PIL import Image

# Arabic OCR
text = pytesseract.image_to_string(Image.open('arabic.jpg'), lang='ara')

# Configuration for Arabic (right-to-left)
custom_config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(Image.open('arabic.jpg'), lang='ara', config=custom_config)

For Arabic documents, ABBYY FineReader significantly outperforms Tesseract.

Chinese / Japanese / Korean (CJK — Complex)

CJK scripts have thousands of characters. Dedicated models are required:

Tesseract: chi_sim (Simplified Chinese), chi_tra (Traditional Chinese), jpn (Japanese), kor (Korean)
PaddleOCR (by Baidu) is generally better for CJK than Tesseract:

from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang='ch')  # Chinese
result = ocr.ocr('chinese_document.jpg', cls=True)
for line in result[0]:
    print(line[1][0])  # extracted text

Cloud OCR APIs for Multilingual Documents

When accuracy is critical and volume justifies cost:

Google Cloud Vision API

Supports 50+ languages
Excellent for CJK and Arabic
$1.50 per 1,000 pages

from google.cloud import vision

client = vision.ImageAnnotatorClient()
with open('document.jpg', 'rb') as f:
    image = vision.Image(content=f.read())

# Specify language hints
image_context = vision.ImageContext(language_hints=['zh', 'en'])
response = client.text_detection(image=image, image_context=image_context)
print(response.full_text_annotation.text)

AWS Textract

14 languages supported
Best for forms and tables
$1.50 per 1,000 pages

Documents in multiple languages processed with OCR

AltoUnlockPDF Language Support

Our OCR tool supports 35 languages including:

All major European languages
Russian and other Cyrillic scripts
Arabic and Hebrew (RTL support)
Chinese (Simplified and Traditional)
Japanese and Korean

Select your language from the dropdown before converting.

Tips for Non-Latin Script OCR

Ensure correct text direction is set (RTL for Arabic/Hebrew)
Avoid compressed JPEG — use PNG or TIFF for sharper character edges
Font clarity matters more — many non-Latin scripts have more complex strokes
Post-process with a native spell checker for the specific language
Use script-specific tools (PaddleOCR for CJK, dedicated Arabic OCR for Arabic business documents)

For mission-critical multilingual document processing, combining Google Cloud Vision’s API with human review is the current best practice in enterprise settings.

OCR Jan 29, 2026 · 8 min