Commit f467c624 authored by Merlijn Wajer's avatar Merlijn Wajer
Browse files

Merge hocr-to-epub

Thanks to Aram Verstegen, still work in progress.

commit 6f6a91929eae49f7fb81813dd6eee2f8ead0e8d8
Author: Merlijn Wajer <>
Date:   Thu Jan 20 18:07:26 2022 +0100

    hocr-to-epub: remove epub verify

    Depends on deprecated code

commit c65a544bcdd18b616ee342ad5b5c370b3454c206
Author: Aram Verstegen <>
Date:   Thu Nov 4 19:13:05 2021 +0100

    Don't abort for low confidence documents. Allow all file paths to be specified externally

commit c8bd25fa3d672841bb8396458810eb1a09ca5e48
Author: Aram Verstegen <>
Date:   Tue Sep 28 00:37:14 2021 +0200

    Trying to improve dehyphenation

commit f603149b00081b4b21ffc6f50b4ed3f722d9d399
Author: Aram Verstegen <>
Date:   Tue Sep 28 00:02:35 2021 +0200

    Use fast storage if available

commit cd1fed48ff9d620da997b52abcef8c243310a864
Author: Aram Verstegen <>
Date:   Mon Sep 27 23:22:16 2021 +0200

    Only add textual metadata tag when text is present

commit 406e780aa069e56dc03811208e3b74931aadc7a4
Author: Aram Verstegen <>
Date:   Mon Sep 27 23:19:36 2021 +0200

    Use WORKING_DIR constant for imagestack basenames

commit ddf098c921cc15651055cb3bf3a13b1c9826cc20
Author: Aram Verstegen <>
Date:   Mon Sep 27 23:15:21 2021 +0200

    Avoid divide by zero

commit 7a99123b4db07bd9f52467c67f3eb0880b4b81e2
Author: Aram Verstegen <>
Date:   Mon Sep 27 23:06:06 2021 +0200

    Added comments

commit 9f57ed9b352c653e8468ac71fb35f582d299d205
Author: Aram Verstegen <>
Date:   Mon Sep 27 23:03:44 2021 +0200

    Forgot a word

commit 800daa28389e4c2e8adbf00d84351f6de23235d6
Author: Aram Verstegen <>
Date:   Mon Sep 27 23:00:46 2021 +0200

    Keep the decoded temp files around to speed up cropping multiple images from one page. Keep track of all the temporary files and delete them in the destructor. Tried to improve stylesheet

commit aa11017506896ce084d15a737c30af6d111436f8
Author: Aram Verstegen <>
Date:   Mon Sep 27 21:05:05 2021 +0200

    Added jp2000 to TIFF conversion using kakadu

commit 139e5fb3d29bd3a3006f438fed343eede451fe7c
Author: Aram Verstegen <>
Date:   Mon Sep 27 20:11:05 2021 +0200


commit 0bf4cd8645fea43c76e8e8f0a5e988844063dbe0
Author: Aram Verstegen <>
Date:   Mon Sep 27 20:08:16 2021 +0200


commit 07805eb01feb04da3926c3e896bf00e98c6ab20a
Author: Aram Verstegen <>
Date:   Mon Sep 27 20:07:57 2021 +0200

    Show warning instead of omitting pages, try to clean up hyphenation

commit 5f068bfd73de26f468bed5621c7605e5a90e4410
Author: Aram Verstegen <>
Date:   Mon Sep 27 17:24:08 2021 +0200

    Removed shortcut for debugging

commit eb649fc0918a13087cd5d6b35cb0a6f3df1d354c
Author: Aram Verstegen <>
Date:   Mon Sep 27 17:23:39 2021 +0200

    Cleaned up accessibility summary spacing

commit 9f9daea19ffb77f08eaf6de57d3959757faefbd2
Author: Aram Verstegen <>
Date:   Mon Sep 27 17:21:57 2021 +0200

    Fixed usage of iso639 module

commit 4d0d8e487b13c0b575d8c8696dc9f21550bca150
Author: Aram Verstegen <>
Date:   Mon Sep 27 17:16:34 2021 +0200

    Fixed (accessibility) metadata

commit ef9cd514f71d4fee22ac7207e0eef8ba8c9c39fe
Author: Aram Verstegen <>
Date:   Tue Sep 7 12:21:39 2021 +0200

    Skipping pages based on scandata.xml info

commit 7d62725ed6b9c2c373002110cb3778d2d04d2e2c
Author: Aram Verstegen <>
Date:   Tue Sep 7 10:52:59 2021 +0200

    Adding the cropped images in the epub file

commit a0d80ea9e641b96a448214b53eeb8f5316d962a5
Author: Aram Verstegen <>
Date:   Mon Sep 6 22:51:13 2021 +0200

    Organisation and comments

commit 94c6f5f1a34919f0d37992b5d96c6b1d9ca1ae0f
Author: Aram Verstegen <>
Date:   Mon Sep 6 22:37:27 2021 +0200

    Comments and naming

commit fdf363e41b500aa7f4693e41b2f6028532090d0e
Author: Aram Verstegen <>
Date:   Mon Sep 6 21:38:51 2021 +0200


commit 48b363550fe7064daafd3a7c51430e8690967dcb
Author: Aram Verstegen <>
Date:   Mon Sep 6 21:25:55 2021 +0200

    Make minimum_page_area_pct actually work as a percentage

commit 3e3c53ad163f3de1e5a7d26fdb58cd5f13580d51
Author: Aram Verstegen <>
Date:   Mon Sep 6 21:01:30 2021 +0200

    Fixed photo box cleaning logic

commit 1394a947461274715687dd5010411940f30c7083
Author: Aram Verstegen <>
Date:   Mon Sep 6 18:33:04 2021 +0200

    Starting with Image Stack (zip file) parsing

commit 587328d5724f11b305ca065c9b0ff33091f7b505
Author: Aram Verstegen <>
Date:   Mon Sep 6 18:32:39 2021 +0200

    Removed recursive requirement

commit 4493c39c063b48323173f6947ad8748225ebf565
Author: Aram Verstegen <>
Date:   Mon Sep 6 18:32:22 2021 +0200

    Added hocr_page_to_photo_data function

commit 4e98dbb1c53641bea647f977f9bddf4dfeb6ff90
Author: Aram Verstegen <>
Date:   Mon Aug 30 17:28:25 2021 +0200

    Take care of metadata provided as lists. Track average OCR word confidence scores

commit c0d074a3833edde98834ee76085bee78751ffc4c
Author: Aram Verstegen <>
Date:   Mon Aug 23 19:49:36 2021 +0200

    Don't create useless TOCs

commit 171507455f3c5202ed1f73a8d50ea7e054acf4cc
Author: Aram Verstegen <>
Date:   Mon Aug 23 19:17:49 2021 +0200

    Updated requirements.txt with pinned versions for dependencies

commit 9107a7d6e124278235b44cf93c2b9d41aef7a326
Author: Aram Verstegen <>
Date:   Mon Aug 23 18:56:07 2021 +0200

    Put code into a classs. Added metadata parsing and first steps toward verification

commit 4d8f5a0277257851ec534c5bbc678ee6fad64173
Author: Aram Verstegen <>
Date:   Fri Jul 23 13:57:21 2021 +0200

    Initial PoC for hOCR to EPUB conversion
parent 504032b9
#!/usr/bin/env python
import sys
import argparse
from collections import OrderedDict
import hocr.parse
from ebooklib import epub
from derivermodule.metadata import parse_item_metadata
from internetarchivepdf.scandata import *
import iso639
from PIL import Image
import zipfile
import os
import shutil
import subprocess
if os.path.exists('/var/tmp/fast'):
WORKING_DIR = '/var/tmp/fast/'
WORKING_DIR = '/tmp/'
class ImageStack(object):
filenames = []
images_per_page = {}
temp_files = []
def __init__(self, image_archive_file_path, output_basename):
self.output_basename = output_basename
self.image_archive_file_path = image_archive_file_path
self.tempdir_zip = os.path.join(WORKING_DIR, 'kakadu_input')
self.tempfile_jp2 = os.path.join(WORKING_DIR, 'temp.jp2')
def parse_zip(self):
# Get all the images in the filename order (this should correspond to the page ordering)
self.zf = zipfile.ZipFile(self.image_archive_file_path)
for idx, img in enumerate(sorted(self.zf.namelist())):
info = self.zf.getinfo(img)
if info.is_dir():
def crop_image(self, page, box):
# Keep track of the number of images cropped out from each page
self.images_per_page[page] += 1
except KeyError:
self.images_per_page[page] = 0
output_filename = "%s_%04u_%02u.jpeg" % (self.output_basename, page, self.images_per_page[page])
#return output_filename
from datetime import datetime
print("%s - Cropping page %u to box %s" % (, page, box))
# Extract the image from the zipfile
tempfile_tiff = os.path.join(WORKING_DIR, 'page_%u.tiff' % page)
if tempfile_tiff not in self.temp_files:
self.zf.extract(self.filenames[page], self.tempdir_zip)
extracted_file_path = os.path.join(self.tempdir_zip, self.filenames[page])
os.rename(extracted_file_path, self.tempfile_jp2)
cmd = [
'-num_threads', str(1),
'-i', self.tempfile_jp2,
'-o', tempfile_tiff
cmd, stdout=subprocess.DEVNULL, check=True
except subprocess.CalledProcessError as e:
raise RuntimeError(
"Can't convert JP2 to TIFF: {}".format(e)
# Keep track of the temp files so we can delete them later
#fh =[page])
img =
#img =
region = img.crop(box)
return output_filename
def __del__(self):
# Close zipfile
# Clean up temporary files
for tempfile in self.temp_files:
class EpubGenerator(object):
__version__ = '0.0.1'
front_matter = (
'<div class="offset">'
'<p dir="ltr">This book was produced in EPUB format by the '
'Internet Archive.</p> '
'<p dir="ltr">The book pages were scanned and converted to EPUB '
'format automatically. This process relies on optical character '
'recognition, and is somewhat susceptible to errors. The book may '
'not offer the correct reading sequence, and there may be '
'weird characters, non-words, and incorrect guesses at '
'structure. Some page numbers and headers or footers may remain '
'from the scanned page. The process which identifies images might '
'have found stray marks on the page which are not actually images '
'from the book. The hidden page numbering which may be available '
'to your ereader corresponds to the numbered pages in the print '
'edition, but is not an exact match; page numbers will increment '
'at the same rate as the corresponding print edition, but we may '
'have started numbering before the print book\'s visible page '
'numbers. The Internet Archive is working to improve the '
'scanning process and resulting books, but in the meantime, we '
'hope that this book will be useful to you.</p> '
'<p dir="ltr">The Internet Archive was founded in 1996 to build '
'an Internet library and to promote universal access to all '
'knowledge. The Archive\'s purposes include offering permanent '
'access for researchers, historians, scholars, people with '
'disabilities, and ' 'the general public to historical '
'collections that exist in digital format. The Internet Archive '
'includes texts, audio, moving images, '
'and software as well as archived web pages, and provides '
'specialized services for information access for the blind and '
'other persons with disabilities.</p>'
'<p>Created with hocr-to-epub (v.%s)</p></div>'
) % __version__
# define CSS style
style = """
.center {text-align: center}
.sr-only {
width: 1px;
height: 1px;
padding: 0;
margin: -1px;
overflow: hidden;
clip: rect(0,0,0,0);
border: 0;
.strong {font-weight: bold;}
.italic {font-style: italic;}
.serif {font-family: serif;}
.sans {font-family: sans-serif;}
.big {font-size: 1.5em;}
.small {font-size: .75em;}
.offset {
margin: 1em;
padding: 1.5em;
border: black 1px solid;
img {
padding: 0;
margin: 0;
max-width: 100%;
max-height: 100%;
column-count: 1;
break-inside: avoid;
oeb-column-number: 1;
p {
text-indent: 4em;
strip_whitespaces = True
def __init__(self,
# Copy arguments to locals
self.hocr_xml_file_path = hocr_xml_file_path
self.meta_xml_file_path = meta_xml_file_path
self.image_stack_zip_file_path = image_stack_zip_file_path
self.scandata_xml_file_path = scandata_xml_file_path
self.epub_zip_file_path = epub_zip_file_path
# Set sensible defaults for arguments that weren't provided
if not self.meta_xml_file_path:
self.meta_xml_file_path = self.hocr_xml_file_path.replace('_hocr.html', '_meta.xml')
if not self.image_stack_zip_file_path:
self.image_stack_zip_file_path = self.hocr_xml_file_path.replace('_hocr.html', '')
if not self.scandata_xml_file_path:
self.scandata_xml_file_path = self.hocr_xml_file_path.replace('_hocr.html', '_scandata.xml')
if not self.epub_zip_file_path:
self.epub_zip_file_path = self.hocr_xml_file_path.replace('_hocr.html', '_ebook.epub')
self.img_stack = ImageStack(self.image_stack_zip_file_path, os.path.join(WORKING_DIR, "epub_img"))
self.metadata = parse_item_metadata(self.meta_xml_file_path)
raise RuntimeError("Could not fine _meta.xml file for this item")
self.skip_pages = scandata_xml_get_skip_pages(self.scandata_xml_file_path)
self.skip_pages = []
print("Parsing file %s" % self.hocr_xml_file_path)
def normalize_language(self, language):
Attempt to convert a language tag to a valid country code
return iso639.to_iso639_1(language)
return language
def set_metadata(self):
Set the metadata on the epub object
if 'language' in self.metadata.keys():
if type(self.metadata['language']) is str:
self.metadata['language'] = self.normalize_language(self.metadata['language'])['language'])
elif type(self.metadata['language']) is list:
self.metadata['language'] = '; '.join(map(self.normalize_language, self.metadata['language']))['language'])
if 'title' in self.metadata.keys():['title'])
if 'creator' in self.metadata.keys():
if type(self.metadata['creator']) is str:['creator'])
elif type(self.metadata['creator']) is list:
for i, creator in enumerate(self.metadata['creator']):
creator_uid = 'creator_{creator_uid}'.format(creator_uid=i), uid=creator_uid)
if 'description' in self.metadata.keys():
if type(self.metadata['description']) is str:'DC', 'description', self.metadata['description'])
elif type(self.metadata['description']) is list:
for description in self.metadata['description']:'DC', 'description', description)
if 'publisher' in self.metadata.keys():
if type(self.metadata['publisher']) is str:'DC', 'publisher', self.metadata['publisher'])
elif type(self.metadata['publisher']) is list:
for publisher in self.metadata['publisher']:'DC', 'publisher', publisher)
if 'identifier-access' in self.metadata.keys():
if type(self.metadata['identifier-access']) is str:
'DC', 'identifier', 'Access URL: {}'.format(
elif type(self.metadata['identifier-access']) is list:
for identifier_access in self.metadata['identifier-access']:
'DC', 'identifier', 'Access URL: {}'.format(
if 'identifier-ark' in self.metadata.keys():
if type(self.metadata['identifier-ark']) is str:
'DC', 'identifier', 'urn:ark:{}'.format(self.metadata['identifier-ark'])
elif type(self.metadata['identifier-ark']) is list:
for identifier_ark in self.metadata['identifier-ark']:
'DC', 'identifier', 'urn:ark:{}'.format(identifier_ark)
if 'isbn' in self.metadata.keys():
if type(self.metadata['isbn']) is str:
'DC', 'identifier', 'urn:isbn:{}'.format(self.metadata['isbn'])
elif type(self.metadata['isbn']) is list:
for isbn in self.metadata['isbn']:
'DC', 'identifier', 'urn:isbn:{}'.format(isbn)
if 'oclc-id' in self.metadata.keys():
if type(self.metadata['oclc-id']) is str:
'DC', 'identifier', 'urn:oclc:{}'.format(self.metadata['oclc-id'])
elif type(self.metadata['oclc-id']) is list:
for oclc_id in self.metadata['oclc-id']:
'DC', 'identifier', 'urn:oclc:{}'.format(oclc_id)
if 'external-identifier' in self.metadata.keys():
if type(self.metadata['external-identifier']) is str:'DC', 'identifier', self.metadata['external-identifier'])
elif type(self.metadata['external-identifier']) is list:
for external_identifier in self.metadata['external-identifier']:'DC', 'identifier', external_identifier)
if 'related-external-id' in self.metadata.keys():
if type(self.metadata['related-external-id']) is str:'DC', 'identifier', self.metadata['related-external-id'])
elif type(self.metadata['related-external-id']) is list:
for related_external_id in self.metadata['related-external-id']:'DC', 'identifier', related_external_id)
if 'subject' in self.metadata.keys():
if type(self.metadata['subject']) is str:'DC', 'subject', self.metadata['subject'])
elif type(self.metadata['subject']) is list:
for subject in self.metadata['subject']:'DC', 'subject', subject)
if 'date' in self.metadata.keys():'DC', 'date', self.metadata['date'])
def set_accessibility_metadata(self):
summary = ''
# Add the accessibility metadata to the publication
summary += (
'The publication was generated using automated character '
'recognition, therefore it may not be an accurate rendition '
'of the original text, and it may not offer the correct '
'reading sequence.'
modes = []
modes_sufficient = []
if self.has_text:
if self.has_images:
summary += ' This publication is missing meaningful alternative text.'
summary += ' The publication otherwise meets WCAG 2.0 Level A.'
OrderedDict([('property', 'schema:accessibilitySummary')])
for mode in modes:
OrderedDict([('property', 'schema:accessMode')])
for mode in modes_sufficient:
OrderedDict([('property', 'schema:accessModeSufficient')])
features = ['none', ]
for feature in features:
OrderedDict([('property', 'schema:accessibilityFeature')])
# these states will be true for any static content, which we know
# is guaranteed for OCR generated texts.
hazards = [
controls = [
for hazard in hazards:
OrderedDict([('property', 'schema:accessibilityHazard')])
for control in controls:
OrderedDict([('property', 'schema:accessibilityControl')])
def generate(self, confidence_threshold=75.0): = epub.EpubBook()
css_file = epub.EpubItem(
front_matter_epub = epub.EpubHtml(title='Notice', file_name='notice.html', lang='en')
pages_hocr = hocr.parse.hocr_page_iterator(self.hocr_xml_file_path)
pages_epub = []
# Iterate all the pages
images_found = 0
words_found = 0
for page_idx, page in enumerate(pages_hocr):
if page_idx in self.skip_pages:
# Get all the words on the page
word_data = hocr.parse.hocr_page_to_word_data(page)
# Get all the photos on the page
photo_boxes = hocr.parse.hocr_page_to_photo_data(page)
page_content = []
page_confidence = 0
words_on_page = 0
# ABBYY converter sometimes identifies linebreaks as a negation sign
hyphens = ['-', '¬']
# Combine all all the words on the page
for element in word_data:
line_content = []
for line in element['lines']:
for word in line['words']:
# Save text data
text = word['text']
if self.strip_whitespaces:
text = text.strip()
# Count word confidence scores
page_confidence += word['confidence']
words_found += 1
words_on_page += 1
# Examine the last character of of the last element of the line
if len(line_content) and len(line_content[-1]) and line_content[-1][-1] in hyphens:
# Remove the last character if it is a hyphen
line_content[-1] = line_content[-1][:-1]
# Add placeholder value
page_content += line_content
# Flatten list into string and add spaces
page_text = ' '.join(page_content)
# Remove placeholder and spaces in the positions that previously had a line break hyphen
page_text = page_text.replace(' \x7f ', '')
# Create HTML/epub page
page_html = u"<p>%s</p>" % page_text
# Add a warning if the confidence in the text is below the given threshold
if words_on_page:
page_confidence = page_confidence/words_on_page
if page_confidence < confidence_threshold:
page_html = (u"<b>The text on this page is estimated to be only %0.02f%% accurate</b>" % page_confidence) + page_html
# Add all the images from the page
images_on_page = 0
for image_idx, box in enumerate(photo_boxes):
cropped_image_filename = self.img_stack.crop_image(page_idx, box)
cropped_jpeg_data = open(cropped_image_filename, "rb").read()
image_filename_epub = "image_%04u_%02u.jpeg" % (page_idx, image_idx)
image_epub = epub.EpubImage()
image_epub.file_name = image_filename_epub
image_epub.media_type = "image/jpeg"
page_html += "<img src=\"%s\" alt=\"Image %u\"/>" % (image_filename_epub, images_found)
images_found += 1
images_on_page += 1
if words_on_page or images_on_page:
page_epub = epub.EpubHtml(title='Page %s' % page_idx,
file_name='page_%s.html' % page_idx,
href='style/style.css', rel='stylesheet', type='text/css'
# Apply some transformations to remove headings and page numbers
#for page_epub in pages_epub:
# print(page_epub.get_body_content())
# Add all the pages to the book
for page_epub in pages_epub:
self.has_text = words_found > 0
self.has_images = images_found > 0
# We don't have enough information to create TOC/chapters/sections yet
#book.toc = pages_epub = ['cover', 'nav', ] + pages_epub
epub.write_epub(self.epub_zip_file_path,, {})
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='hOCR to ePUB converter')
parser.add_argument('-f', '--infile', help='Item _hocr.html file',
type=str, default=None)
parser.add_argument('-o', '--outfile', help='Output _ebook.epub file',
type=str, default=None)
parser.add_argument('-m', '--metafile', help='Item _meta.xml file',
type=str, default=None)
parser.add_argument('-i', '--imagestack', help='Item file',
type=str, default=None)
parser.add_argument('-s', '--scandata', help='Item _scandata.xml file',
type=str, default=None)
parser.add_argument('-w', '--workingdir', help='Directory used for temp files',
type=str, default=None)
args = parser.parse_args()
if not args.infile:
raise Exception("Must provide hOCR input file with -f")
# Allow external caller to override working directory from default /tmp/ or /var/tmp/fast/
if args.workingdir:
WORKING_DIR = args.workingdir
EpubGenerator(args.infile, args.metafile, args.imagestack, args.scandata, args.outfile)
......@@ -107,7 +107,7 @@ def hocr_page_to_word_data(hocr_page, scaler=1):
A list of paragraph, each paragraph containing a list of lines, and each
A list of paragraphs, each paragraph containing a list of lines, and each
line containing a list of words, plus properties.
Paragraphs have the following attributes:
......@@ -214,6 +214,59 @@ def hocr_page_to_word_data(hocr_page, scaler=1):
return paragraphs
def hocr_page_to_photo_data(hocr_page, minimum_page_area_pct=10):
Parses a single hocr_page into photo data.
* hocr_page: a single hocr_page as returned by hocr_page_iterator
* (optional) minimum_page_area_pct: a minimum percentage of the page area the picture should inhabit
A list of bounding boxes where photos were found
# Get the actual boxes from the page
photo_boxes = []
for photo in hocr_page.xpath('.//*[@class="ocr_photo"]'):
box =['title']).group(1).split()
box = [float(i) for i in box]
# Helper function to determine if there are nested boxes
def box_contains_box(box_a, box_b):
return box_a[0] <= box_b[0] and box_a[1] <= box_b[1] \
and box_a[2] >= box_b[2] and box_a[3] >= box_b[3]
# Clean up the box data a bit
cleaned_photo_boxes = list(photo_boxes)
dim = hocr_page_get_dimensions(hocr_page)
area_page = dim[0]*dim[1]
for box_a in photo_boxes:
# Image must cover at least minimum_page_area_pct of page
width, height = box_a[2]-box_a[0], box_a[3]-box_a[1]
area_box = width*height
if area_box < area_page*(minimum_page_area_pct/100.):