Skip to content
  • Merlijn Wajer's avatar
    Merge hocr-to-epub · f467c624
    Merlijn Wajer authored
    Thanks to Aram Verstegen, still work in progress.
    
    commit 6f6a91929eae49f7fb81813dd6eee2f8ead0e8d8
    Author: Merlijn Wajer <merlijn@wizzup.org>
    Date:   Thu Jan 20 18:07:26 2022 +0100
    
        hocr-to-epub: remove epub verify
    
        Depends on deprecated code
    
    commit c65a544bcdd18b616ee342ad5b5c370b3454c206
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Thu Nov 4 19:13:05 2021 +0100
    
        Don't abort for low confidence documents. Allow all file paths to be specified externally
    
    commit c8bd25fa3d672841bb8396458810eb1a09ca5e48
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Tue Sep 28 00:37:14 2021 +0200
    
        Trying to improve dehyphenation
    
    commit f603149b00081b4b21ffc6f50b4ed3f722d9d399
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Tue Sep 28 00:02:35 2021 +0200
    
        Use fast storage if available
    
    commit cd1fed48ff9d620da997b52abcef8c243310a864
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 23:22:16 2021 +0200
    
        Only add textual metadata tag when text is present
    
    commit 406e780aa069e56dc03811208e3b74931aadc7a4
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 23:19:36 2021 +0200
    
        Use WORKING_DIR constant for imagestack basenames
    
    commit ddf098c921cc15651055cb3bf3a13b1c9826cc20
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 23:15:21 2021 +0200
    
        Avoid divide by zero
    
    commit 7a99123b4db07bd9f52467c67f3eb0880b4b81e2
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 23:06:06 2021 +0200
    
        Added comments
    
    commit 9f57ed9b352c653e8468ac71fb35f582d299d205
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 23:03:44 2021 +0200
    
        Forgot a word
    
    commit 800daa28389e4c2e8adbf00d84351f6de23235d6
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 23:00:46 2021 +0200
    
        Keep the decoded temp files around to speed up cropping multiple images from one page. Keep track of all the temporary files and delete them in the destructor. Tried to improve stylesheet
    
    commit aa11017506896ce084d15a737c30af6d111436f8
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 21:05:05 2021 +0200
    
        Added jp2000 to TIFF conversion using kakadu
    
    commit 139e5fb3d29bd3a3006f438fed343eede451fe7c
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 20:11:05 2021 +0200
    
        Cleanup
    
    commit 0bf4cd8645fea43c76e8e8f0a5e988844063dbe0
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 20:08:16 2021 +0200
    
        Cleanup
    
    commit 07805eb01feb04da3926c3e896bf00e98c6ab20a
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 20:07:57 2021 +0200
    
        Show warning instead of omitting pages, try to clean up hyphenation
    
    commit 5f068bfd73de26f468bed5621c7605e5a90e4410
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 17:24:08 2021 +0200
    
        Removed shortcut for debugging
    
    commit eb649fc0918a13087cd5d6b35cb0a6f3df1d354c
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 17:23:39 2021 +0200
    
        Cleaned up accessibility summary spacing
    
    commit 9f9daea19ffb77f08eaf6de57d3959757faefbd2
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 17:21:57 2021 +0200
    
        Fixed usage of iso639 module
    
    commit 4d0d8e487b13c0b575d8c8696dc9f21550bca150
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 27 17:16:34 2021 +0200
    
        Fixed (accessibility) metadata
    
    commit ef9cd514f71d4fee22ac7207e0eef8ba8c9c39fe
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Tue Sep 7 12:21:39 2021 +0200
    
        Skipping pages based on scandata.xml info
    
    commit 7d62725ed6b9c2c373002110cb3778d2d04d2e2c
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Tue Sep 7 10:52:59 2021 +0200
    
        Adding the cropped images in the epub file
    
    commit a0d80ea9e641b96a448214b53eeb8f5316d962a5
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 6 22:51:13 2021 +0200
    
        Organisation and comments
    
    commit 94c6f5f1a34919f0d37992b5d96c6b1d9ca1ae0f
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 6 22:37:27 2021 +0200
    
        Comments and naming
    
    commit fdf363e41b500aa7f4693e41b2f6028532090d0e
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 6 21:38:51 2021 +0200
    
        Cleanup
    
    commit 48b363550fe7064daafd3a7c51430e8690967dcb
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 6 21:25:55 2021 +0200
    
        Make minimum_page_area_pct actually work as a percentage
    
    commit 3e3c53ad163f3de1e5a7d26fdb58cd5f13580d51
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 6 21:01:30 2021 +0200
    
        Fixed photo box cleaning logic
    
    commit 1394a947461274715687dd5010411940f30c7083
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 6 18:33:04 2021 +0200
    
        Starting with Image Stack (zip file) parsing
    
    commit 587328d5724f11b305ca065c9b0ff33091f7b505
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 6 18:32:39 2021 +0200
    
        Removed recursive requirement
    
    commit 4493c39c063b48323173f6947ad8748225ebf565
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Sep 6 18:32:22 2021 +0200
    
        Added hocr_page_to_photo_data function
    
    commit 4e98dbb1c53641bea647f977f9bddf4dfeb6ff90
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Aug 30 17:28:25 2021 +0200
    
        Take care of metadata provided as lists. Track average OCR word confidence scores
    
    commit c0d074a3833edde98834ee76085bee78751ffc4c
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Aug 23 19:49:36 2021 +0200
    
        Don't create useless TOCs
    
    commit 171507455f3c5202ed1f73a8d50ea7e054acf4cc
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Aug 23 19:17:49 2021 +0200
    
        Updated requirements.txt with pinned versions for dependencies
    
    commit 9107a7d6e124278235b44cf93c2b9d41aef7a326
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Mon Aug 23 18:56:07 2021 +0200
    
        Put code into a classs. Added metadata parsing and first steps toward verification
    
    commit 4d8f5a0277257851ec534c5bbc678ee6fad64173
    Author: Aram Verstegen <aram@factorit.nl>
    Date:   Fri Jul 23 13:57:21 2021 +0200
    
        Initial PoC for hOCR to EPUB conversion
    f467c624