Tesseract issueshttps://git.archive.org/www/tesseract/-/issues2021-01-05T14:33:21Zhttps://git.archive.org/www/tesseract/-/issues/36Perhaps use polyglot's language detection2021-01-05T14:33:21ZMerlijn WajerPerhaps use polyglot's language detectionLooks like the polyglot language detection can discern Traditional and Simplified chinese, which have the same language code.
This can help in the case of autonomous mode where we attempt language detection with both `HanS` and `HanT`, ...Looks like the polyglot language detection can discern Traditional and Simplified chinese, which have the same language code.
This can help in the case of autonomous mode where we attempt language detection with both `HanS` and `HanT`, but don't know which script to use when we just detect `zh` as language.
https://polyglot.readthedocs.io/en/latest/Detection.htmlhttps://git.archive.org/www/tesseract/-/issues/35Potentially use three letter language ISO639 codes for ocr_detect_lang2020-11-19T22:51:37ZMerlijn WajerPotentially use three letter language ISO639 codes for ocr_detect_langhttps://git.archive.org/www/tesseract/-/issues/34(Autonomous) Fraktur/script detection2020-11-11T01:23:39ZMerlijn Wajer(Autonomous) Fraktur/script detectionThis collection might be useful, which seems to be 2/3 Fraktur: https://archive.org/details/pub_abendpost-sonntagpostThis collection might be useful, which seems to be 2/3 Fraktur: https://archive.org/details/pub_abendpost-sonntagposthttps://git.archive.org/www/tesseract/-/issues/33Perhaps use character confidence somewhere2020-11-09T22:17:45ZMerlijn WajerPerhaps use character confidence somewhereFrom: https://groups.google.com/g/tesseract-ocr/c/SN8L0IA_0D4
```
Hi,
I think the confidence score is returned by the neural network itself. In my experience values below 95 are usually unusable. Above 99 is usually correct. I would set...From: https://groups.google.com/g/tesseract-ocr/c/SN8L0IA_0D4
```
Hi,
I think the confidence score is returned by the neural network itself. In my experience values below 95 are usually unusable. Above 99 is usually correct. I would set the threshold somewhere between 97.5 and 98.5 depending on your requirements.
The lowest value I have ever seen is 75 but anything below 90 is extremely rare, even below 95 is rare.
From a very very rough measurement on the data I'm using with a 97.5 score you have about 10% wrong characters on average and 2% at 99.
This is based on fine tuned models (on validation data), it partially depends on what model you are using, image quality, etc.
```https://git.archive.org/www/tesseract/-/issues/29Set up tesseract repo clone and add a gitlab-ci.yml to automatically build th...2021-01-30T01:46:42ZMerlijn WajerSet up tesseract repo clone and add a gitlab-ci.yml to automatically build the .debhttps://git.archive.org/www/tesseract/-/issues/24Fraktur testing collection2020-10-24T13:01:17ZMerlijn WajerFraktur testing collectionAndrea shared this collection as a collection with lots of Fraktur:
https://archive.org/details/ushmmAndrea shared this collection as a collection with lots of Fraktur:
https://archive.org/details/ushmmhttps://git.archive.org/www/tesseract/-/issues/23Consider disabling OpenMP alltogether for Tesseract2020-10-24T13:00:41ZMerlijn WajerConsider disabling OpenMP alltogether for TesseractStefan Weil wrote:
```For maximum throughput you might consider compiling an optimized Tesseract without OMP support. That's faster than using code with OMP support and disabling it. In a first step I suggest to try the latest Tesseract...Stefan Weil wrote:
```For maximum throughput you might consider compiling an optimized Tesseract without OMP support. That's faster than using code with OMP support and disabling it. In a first step I suggest to try the latest Tesseract from PPA (https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr-devel?field.series_filter=focal) and compare the throughput.```https://git.archive.org/www/tesseract/-/issues/22Consider setting `tessedit_do_invert=0` for books/microfilm without inverted ...2020-10-24T13:00:11ZMerlijn WajerConsider setting `tessedit_do_invert=0` for books/microfilm without inverted textStefa Weil wrote:
```The processing time per page can be reduced significantly for pages without inverted text (= most pages). By default, Tesseract tries OCR twice, once for normal image and once for inverted image. This behaviour can ...Stefa Weil wrote:
```The processing time per page can be reduced significantly for pages without inverted text (= most pages). By default, Tesseract tries OCR twice, once for normal image and once for inverted image. This behaviour can be deactivated with parameter tessedit_do_invert.```https://git.archive.org/www/tesseract/-/issues/20Consider a testing version of tesseract (5.x) in a testing container2021-01-30T01:46:42ZMerlijn WajerConsider a testing version of tesseract (5.x) in a testing containerStefan Weil wrote:
For maximum throughput you might consider compiling an optimized Tesseract without OMP support. That's faster than using code with OMP support and disabling it. In a first step I suggest to try the latest Tesseract fr...Stefan Weil wrote:
For maximum throughput you might consider compiling an optimized Tesseract without OMP support. That's faster than using code with OMP support and disabling it. In a first step I suggest to try the latest Tesseract from PPA (https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr-devel?field.series_filter=focal) and compare the throughput.Merlijn WajerMerlijn Wajerhttps://git.archive.org/www/tesseract/-/issues/19Consider writing PPI to hOCR files2020-10-19T12:03:28ZMerlijn WajerConsider writing PPI to hOCR filesPPI would be per page: http://kba.cloud/hocr-spec/1.2/#scan_resPPI would be per page: http://kba.cloud/hocr-spec/1.2/#scan_reshttps://git.archive.org/www/tesseract/-/issues/18Implement test phase on several languages to ensure that the modules work fine2020-10-19T09:25:15ZMerlijn WajerImplement test phase on several languages to ensure that the modules work finehttps://git.archive.org/www/tesseract/-/issues/16Consider adding DPI info via page property 'scan_res'2020-10-15T12:33:41ZMerlijn WajerConsider adding DPI info via page property 'scan_res'http://kba.cloud/hocr-spec/1.2/#propdef-scan_res
Could be useful for PDF generation -- although we can likely also get it via item info and scandatahttp://kba.cloud/hocr-spec/1.2/#propdef-scan_res
Could be useful for PDF generation -- although we can likely also get it via item info and scandatahttps://git.archive.org/www/tesseract/-/issues/15(perhaps) Improve hocr files: add ocr-langs, ocr-scripts2020-10-15T11:28:15ZMerlijn Wajer(perhaps) Improve hocr files: add ocr-langs, ocr-scriptsSee http://kba.cloud/hocr-spec/1.2/#metadataSee http://kba.cloud/hocr-spec/1.2/#metadatahttps://git.archive.org/www/tesseract/-/issues/11Implement fallback procedure in case OCR fails2020-10-14T22:13:48ZMerlijn WajerImplement fallback procedure in case OCR failsIf the `skipocr` item level metadata key is available, we might not want to red row is Tesseract errors out, but just insert an empty page (with a note that processing failed?)
(I think we don't need to do this now, let's just see if Te...If the `skipocr` item level metadata key is available, we might not want to red row is Tesseract errors out, but just insert an empty page (with a note that processing failed?)
(I think we don't need to do this now, let's just see if Tesseract fails at all)https://git.archive.org/www/tesseract/-/issues/3Implement/check page orientation2020-10-14T11:58:39ZMerlijn WajerImplement/check page orientationTesseract is not particularly happy when pages are rotated, so we might want to account for that if the Tesseract OSD / PSM (mode 0) can detect orientation properly. - then we can rotate input images.
We'll have to make sure the resulti...Tesseract is not particularly happy when pages are rotated, so we might want to account for that if the Tesseract OSD / PSM (mode 0) can detect orientation properly. - then we can rotate input images.
We'll have to make sure the resulting PDF and hOCR are sensible, though.https://git.archive.org/www/tesseract/-/issues/1Document module features, limitations and input metadata/arguments in README.rst2020-10-14T10:58:07ZMerlijn WajerDocument module features, limitations and input metadata/arguments in README.rstIt would be good to have standalone documentation this module:
* What does the module do?
* What files does the module generate?
* What metadata is created/written by the module?
* What metadata keys from the item are taken into accoun...It would be good to have standalone documentation this module:
* What does the module do?
* What files does the module generate?
* What metadata is created/written by the module?
* What metadata keys from the item are taken into account?
* What task argument are supported?
* What limitations does the module have, and how can we solve some of those?