Perhaps use character confidence somewhere
From: https://groups.google.com/g/tesseract-ocr/c/SN8L0IA_0D4
Hi,
I think the confidence score is returned by the neural network itself. In my experience values below 95 are usually unusable. Above 99 is usually correct. I would set the threshold somewhere between 97.5 and 98.5 depending on your requirements.
The lowest value I have ever seen is 75 but anything below 90 is extremely rare, even below 95 is rare.
From a very very rough measurement on the data I'm using with a 97.5 score you have about 10% wrong characters on average and 2% at 99.
This is based on fine tuned models (on validation data), it partially depends on what model you are using, image quality, etc.