Sample pages for script detection
Script detection is taking quite a while: about 40-50% of the time it takes to OCR a page. So let's attempt to sample some pages in an attempt to lessen the compute load.
We will want to come up with clever sampling algorithm. For a book with only 5 pages, I assume we just want to process all pages. For a book with 20 pages, we could 5-10 pages. For a book with 50 pages, 10-15 is probably enough. For 100 pages, 10 is probably enough. For anything between 100-200 pages, we could just sample 15%. After 200, let's just do 10%.