1. 05 May, 2022 1 commit
    • Merlijn Wajer's avatar
      pdf-to-hocr: fix ocr_line bug and add scaler · a7d074de
      Merlijn Wajer authored
      This commit fixes a bug where ocr_line elements would not have any
      title for some PDFs (like ones created by OCRMyPDF).
      
      This commit also adds PDF Metadata JSON as a requirement to make the
      hOCR files, using the information contained within to estimate the DPI
      and to scale the hOCR coordinates.
      a7d074de
  2. 16 Apr, 2022 1 commit
  3. 17 Feb, 2022 2 commits
  4. 14 Feb, 2022 3 commits
  5. 07 Feb, 2022 1 commit
  6. 22 Jan, 2022 3 commits
    • Merlijn Wajer's avatar
      version: increase to 1.1.15 · 05c1e03d
      Merlijn Wajer authored
      05c1e03d
    • Merlijn Wajer's avatar
      setup: make ebooklib optional · c75f179b
      Merlijn Wajer authored
      c75f179b
    • Merlijn Wajer's avatar
      Merge hocr-to-epub · f467c624
      Merlijn Wajer authored
      Thanks to Aram Verstegen, still work in progress.
      
      commit 6f6a91929eae49f7fb81813dd6eee2f8ead0e8d8
      Author: Merlijn Wajer <merlijn@wizzup.org>
      Date:   Thu Jan 20 18:07:26 2022 +0100
      
          hocr-to-epub: remove epub verify
      
          Depends on deprecated code
      
      commit c65a544bcdd18b616ee342ad5b5c370b3454c206
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Thu Nov 4 19:13:05 2021 +0100
      
          Don't abort for low confidence documents. Allow all file paths to be specified externally
      
      commit c8bd25fa3d672841bb8396458810eb1a09ca5e48
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Tue Sep 28 00:37:14 2021 +0200
      
          Trying to improve dehyphenation
      
      commit f603149b00081b4b21ffc6f50b4ed3f722d9d399
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Tue Sep 28 00:02:35 2021 +0200
      
          Use fast storage if available
      
      commit cd1fed48ff9d620da997b52abcef8c243310a864
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 23:22:16 2021 +0200
      
          Only add textual metadata tag when text is present
      
      commit 406e780aa069e56dc03811208e3b74931aadc7a4
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 23:19:36 2021 +0200
      
          Use WORKING_DIR constant for imagestack basenames
      
      commit ddf098c921cc15651055cb3bf3a13b1c9826cc20
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 23:15:21 2021 +0200
      
          Avoid divide by zero
      
      commit 7a99123b4db07bd9f52467c67f3eb0880b4b81e2
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 23:06:06 2021 +0200
      
          Added comments
      
      commit 9f57ed9b352c653e8468ac71fb35f582d299d205
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 23:03:44 2021 +0200
      
          Forgot a word
      
      commit 800daa28389e4c2e8adbf00d84351f6de23235d6
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 23:00:46 2021 +0200
      
          Keep the decoded temp files around to speed up cropping multiple images from one page. Keep track of all the temporary files and delete them in the destructor. Tried to improve stylesheet
      
      commit aa11017506896ce084d15a737c30af6d111436f8
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 21:05:05 2021 +0200
      
          Added jp2000 to TIFF conversion using kakadu
      
      commit 139e5fb3d29bd3a3006f438fed343eede451fe7c
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 20:11:05 2021 +0200
      
          Cleanup
      
      commit 0bf4cd8645fea43c76e8e8f0a5e988844063dbe0
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 20:08:16 2021 +0200
      
          Cleanup
      
      commit 07805eb01feb04da3926c3e896bf00e98c6ab20a
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 20:07:57 2021 +0200
      
          Show warning instead of omitting pages, try to clean up hyphenation
      
      commit 5f068bfd73de26f468bed5621c7605e5a90e4410
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 17:24:08 2021 +0200
      
          Removed shortcut for debugging
      
      commit eb649fc0918a13087cd5d6b35cb0a6f3df1d354c
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 17:23:39 2021 +0200
      
          Cleaned up accessibility summary spacing
      
      commit 9f9daea19ffb77f08eaf6de57d3959757faefbd2
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 17:21:57 2021 +0200
      
          Fixed usage of iso639 module
      
      commit 4d0d8e487b13c0b575d8c8696dc9f21550bca150
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 27 17:16:34 2021 +0200
      
          Fixed (accessibility) metadata
      
      commit ef9cd514f71d4fee22ac7207e0eef8ba8c9c39fe
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Tue Sep 7 12:21:39 2021 +0200
      
          Skipping pages based on scandata.xml info
      
      commit 7d62725ed6b9c2c373002110cb3778d2d04d2e2c
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Tue Sep 7 10:52:59 2021 +0200
      
          Adding the cropped images in the epub file
      
      commit a0d80ea9e641b96a448214b53eeb8f5316d962a5
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 6 22:51:13 2021 +0200
      
          Organisation and comments
      
      commit 94c6f5f1a34919f0d37992b5d96c6b1d9ca1ae0f
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 6 22:37:27 2021 +0200
      
          Comments and naming
      
      commit fdf363e41b500aa7f4693e41b2f6028532090d0e
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 6 21:38:51 2021 +0200
      
          Cleanup
      
      commit 48b363550fe7064daafd3a7c51430e8690967dcb
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 6 21:25:55 2021 +0200
      
          Make minimum_page_area_pct actually work as a percentage
      
      commit 3e3c53ad163f3de1e5a7d26fdb58cd5f13580d51
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 6 21:01:30 2021 +0200
      
          Fixed photo box cleaning logic
      
      commit 1394a947461274715687dd5010411940f30c7083
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 6 18:33:04 2021 +0200
      
          Starting with Image Stack (zip file) parsing
      
      commit 587328d5724f11b305ca065c9b0ff33091f7b505
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 6 18:32:39 2021 +0200
      
          Removed recursive requirement
      
      commit 4493c39c063b48323173f6947ad8748225ebf565
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Sep 6 18:32:22 2021 +0200
      
          Added hocr_page_to_photo_data function
      
      commit 4e98dbb1c53641bea647f977f9bddf4dfeb6ff90
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Aug 30 17:28:25 2021 +0200
      
          Take care of metadata provided as lists. Track average OCR word confidence scores
      
      commit c0d074a3833edde98834ee76085bee78751ffc4c
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Aug 23 19:49:36 2021 +0200
      
          Don't create useless TOCs
      
      commit 171507455f3c5202ed1f73a8d50ea7e054acf4cc
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Aug 23 19:17:49 2021 +0200
      
          Updated requirements.txt with pinned versions for dependencies
      
      commit 9107a7d6e124278235b44cf93c2b9d41aef7a326
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Mon Aug 23 18:56:07 2021 +0200
      
          Put code into a classs. Added metadata parsing and first steps toward verification
      
      commit 4d8f5a0277257851ec534c5bbc678ee6fad64173
      Author: Aram Verstegen <aram@factorit.nl>
      Date:   Fri Jul 23 13:57:21 2021 +0200
      
          Initial PoC for hOCR to EPUB conversion
      f467c624
  7. 20 Jan, 2022 1 commit
  8. 13 Jan, 2022 1 commit
  9. 08 Jan, 2022 2 commits
  10. 04 Jan, 2022 1 commit
  11. 04 Dec, 2021 3 commits
  12. 29 Nov, 2021 1 commit
  13. 28 Nov, 2021 4 commits
  14. 15 Oct, 2021 2 commits
  15. 12 Oct, 2021 1 commit
  16. 11 Oct, 2021 1 commit
  17. 10 Oct, 2021 1 commit
    • Merlijn Wajer's avatar
      hocr/fts: More clean and robust matching · 37ea5e0d
      Merlijn Wajer authored
      Matching code is now more clear, and also more robust against elastic
      search highlighting in weird ways (across paragraphs, causing lines
      (which are entire paragraphs in our case) matching to break.
      
      Matching should also more accurately match multiple words.
      37ea5e0d
  18. 08 Oct, 2021 1 commit
  19. 07 Oct, 2021 1 commit
  20. 05 Oct, 2021 3 commits
  21. 04 Oct, 2021 2 commits
  22. 27 Sep, 2021 3 commits
  23. 06 Aug, 2021 1 commit