Implement script-detection module
It would be helpful to use Tesseract's OSD module to detect scripts and page orientation.
$ tesseract --psm 0 -l osd lanjingdeyanjing0000bing_0030.jp2 -
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 577
Warning. Invalid resolution 0 dpi. Using 70 instead.
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 8.27
Script: Han
Script confidence: 0.48
$ tesseract --psm 0 -l osd baghobahar_0020.jp2 -
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 291
Warning. Invalid resolution 0 dpi. Using 70 instead.
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 14.67
Script: Arabic
Script confidence: 55.56
We could run this on every single page (later perhaps we can sample) and try to find the most prominent matches, based on confidence as well. How exactly we use this (as input) is TBD.
We could also write this to a metadata value: ocr_detected_script
(can be repeatable field?)