Commit f467c624 authored by Merlijn Wajer's avatar Merlijn Wajer
Browse files

Merge hocr-to-epub

Thanks to Aram Verstegen, still work in progress.

commit 6f6a91929eae49f7fb81813dd6eee2f8ead0e8d8
Author: Merlijn Wajer <merlijn@wizzup.org>
Date:   Thu Jan 20 18:07:26 2022 +0100

    hocr-to-epub: remove epub verify

    Depends on deprecated code

commit c65a544bcdd18b616ee342ad5b5c370b3454c206
Author: Aram Verstegen <aram@factorit.nl>
Date:   Thu Nov 4 19:13:05 2021 +0100

    Don't abort for low confidence documents. Allow all file paths to be specified externally

commit c8bd25fa3d672841bb8396458810eb1a09ca5e48
Author: Aram Verstegen <aram@factorit.nl>
Date:   Tue Sep 28 00:37:14 2021 +0200

    Trying to improve dehyphenation

commit f603149b00081b4b21ffc6f50b4ed3f722d9d399
Author: Aram Verstegen <aram@factorit.nl>
Date:   Tue Sep 28 00:02:35 2021 +0200

    Use fast storage if available

commit cd1fed48ff9d620da997b52abcef8c243310a864
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 23:22:16 2021 +0200

    Only add textual metadata tag when text is present

commit 406e780aa069e56dc03811208e3b74931aadc7a4
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 23:19:36 2021 +0200

    Use WORKING_DIR constant for imagestack basenames

commit ddf098c921cc15651055cb3bf3a13b1c9826cc20
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 23:15:21 2021 +0200

    Avoid divide by zero

commit 7a99123b4db07bd9f52467c67f3eb0880b4b81e2
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 23:06:06 2021 +0200

    Added comments

commit 9f57ed9b352c653e8468ac71fb35f582d299d205
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 23:03:44 2021 +0200

    Forgot a word

commit 800daa28389e4c2e8adbf00d84351f6de23235d6
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 23:00:46 2021 +0200

    Keep the decoded temp files around to speed up cropping multiple images from one page. Keep track of all the temporary files and delete them in the destructor. Tried to improve stylesheet

commit aa11017506896ce084d15a737c30af6d111436f8
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 21:05:05 2021 +0200

    Added jp2000 to TIFF conversion using kakadu

commit 139e5fb3d29bd3a3006f438fed343eede451fe7c
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 20:11:05 2021 +0200

    Cleanup

commit 0bf4cd8645fea43c76e8e8f0a5e988844063dbe0
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 20:08:16 2021 +0200

    Cleanup

commit 07805eb01feb04da3926c3e896bf00e98c6ab20a
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 20:07:57 2021 +0200

    Show warning instead of omitting pages, try to clean up hyphenation

commit 5f068bfd73de26f468bed5621c7605e5a90e4410
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 17:24:08 2021 +0200

    Removed shortcut for debugging

commit eb649fc0918a13087cd5d6b35cb0a6f3df1d354c
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 17:23:39 2021 +0200

    Cleaned up accessibility summary spacing

commit 9f9daea19ffb77f08eaf6de57d3959757faefbd2
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 17:21:57 2021 +0200

    Fixed usage of iso639 module

commit 4d0d8e487b13c0b575d8c8696dc9f21550bca150
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 27 17:16:34 2021 +0200

    Fixed (accessibility) metadata

commit ef9cd514f71d4fee22ac7207e0eef8ba8c9c39fe
Author: Aram Verstegen <aram@factorit.nl>
Date:   Tue Sep 7 12:21:39 2021 +0200

    Skipping pages based on scandata.xml info

commit 7d62725ed6b9c2c373002110cb3778d2d04d2e2c
Author: Aram Verstegen <aram@factorit.nl>
Date:   Tue Sep 7 10:52:59 2021 +0200

    Adding the cropped images in the epub file

commit a0d80ea9e641b96a448214b53eeb8f5316d962a5
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 6 22:51:13 2021 +0200

    Organisation and comments

commit 94c6f5f1a34919f0d37992b5d96c6b1d9ca1ae0f
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 6 22:37:27 2021 +0200

    Comments and naming

commit fdf363e41b500aa7f4693e41b2f6028532090d0e
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 6 21:38:51 2021 +0200

    Cleanup

commit 48b363550fe7064daafd3a7c51430e8690967dcb
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 6 21:25:55 2021 +0200

    Make minimum_page_area_pct actually work as a percentage

commit 3e3c53ad163f3de1e5a7d26fdb58cd5f13580d51
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 6 21:01:30 2021 +0200

    Fixed photo box cleaning logic

commit 1394a947461274715687dd5010411940f30c7083
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 6 18:33:04 2021 +0200

    Starting with Image Stack (zip file) parsing

commit 587328d5724f11b305ca065c9b0ff33091f7b505
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 6 18:32:39 2021 +0200

    Removed recursive requirement

commit 4493c39c063b48323173f6947ad8748225ebf565
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Sep 6 18:32:22 2021 +0200

    Added hocr_page_to_photo_data function

commit 4e98dbb1c53641bea647f977f9bddf4dfeb6ff90
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Aug 30 17:28:25 2021 +0200

    Take care of metadata provided as lists. Track average OCR word confidence scores

commit c0d074a3833edde98834ee76085bee78751ffc4c
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Aug 23 19:49:36 2021 +0200

    Don't create useless TOCs

commit 171507455f3c5202ed1f73a8d50ea7e054acf4cc
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Aug 23 19:17:49 2021 +0200

    Updated requirements.txt with pinned versions for dependencies

commit 9107a7d6e124278235b44cf93c2b9d41aef7a326
Author: Aram Verstegen <aram@factorit.nl>
Date:   Mon Aug 23 18:56:07 2021 +0200

    Put code into a classs. Added metadata parsing and first steps toward verification

commit 4d8f5a0277257851ec534c5bbc678ee6fad64173
Author: Aram Verstegen <aram@factorit.nl>
Date:   Fri Jul 23 13:57:21 2021 +0200

    Initial PoC for hOCR to EPUB conversion
parent 504032b9
This diff is collapsed.
...@@ -107,7 +107,7 @@ def hocr_page_to_word_data(hocr_page, scaler=1): ...@@ -107,7 +107,7 @@ def hocr_page_to_word_data(hocr_page, scaler=1):
Returns: Returns:
A list of paragraph, each paragraph containing a list of lines, and each A list of paragraphs, each paragraph containing a list of lines, and each
line containing a list of words, plus properties. line containing a list of words, plus properties.
Paragraphs have the following attributes: Paragraphs have the following attributes:
...@@ -214,6 +214,59 @@ def hocr_page_to_word_data(hocr_page, scaler=1): ...@@ -214,6 +214,59 @@ def hocr_page_to_word_data(hocr_page, scaler=1):
return paragraphs return paragraphs
def hocr_page_to_photo_data(hocr_page, minimum_page_area_pct=10):
"""
Parses a single hocr_page into photo data.
Args:
* hocr_page: a single hocr_page as returned by hocr_page_iterator
* (optional) minimum_page_area_pct: a minimum percentage of the page area the picture should inhabit
Returns:
A list of bounding boxes where photos were found
"""
# Get the actual boxes from the page
photo_boxes = []
for photo in hocr_page.xpath('.//*[@class="ocr_photo"]'):
box = BBOX_REGEX.search(photo.attrib['title']).group(1).split()
box = [float(i) for i in box]
photo_boxes.append(box)
# Helper function to determine if there are nested boxes
def box_contains_box(box_a, box_b):
return box_a[0] <= box_b[0] and box_a[1] <= box_b[1] \
and box_a[2] >= box_b[2] and box_a[3] >= box_b[3]
# Clean up the box data a bit
cleaned_photo_boxes = list(photo_boxes)
dim = hocr_page_get_dimensions(hocr_page)
area_page = dim[0]*dim[1]
for box_a in photo_boxes:
# Image must cover at least minimum_page_area_pct of page
width, height = box_a[2]-box_a[0], box_a[3]-box_a[1]
area_box = width*height
if area_box < area_page*(minimum_page_area_pct/100.):
try:
cleaned_photo_boxes.remove(box_a)
#print("Box %s is too small, removing" % (box_a))
except: # Already removed
pass
# Nested boxes are redundant
for box_b in photo_boxes:
if box_a == box_b:
continue
if box_contains_box(box_a, box_b):
try:
cleaned_photo_boxes.remove(box_b)
#print("Box %s is fully inside box %s, removing" % (box_b, box_a))
except: # Already removed
pass
return cleaned_photo_boxes
def get_title_attrs(title): def get_title_attrs(title):
# Assume Tesseract generated hOCR, where every ';' has a space after it # Assume Tesseract generated hOCR, where every ';' has a space after it
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment