Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
Merlijn Wajer
archive-hocr-tools
Commits
14896264
Commit
14896264
authored
Nov 28, 2021
by
Merlijn Wajer
Browse files
hocr/parse: deal with em, strong in word elements
parent
9e505a69
Changes
1
Hide whitespace changes
Inline
Side-by-side
hocr/parse.py
View file @
14896264
...
...
@@ -269,8 +269,22 @@ def hocr_page_to_word_data_fast(hocr_page):
has_ocrx_cinfo
=
2
if
wordbased
:
# Words may contains additional nodes like <em>
while
True
:
children
=
word
.
getchildren
()
if
len
(
children
)
==
0
:
break
if
len
(
children
)
>
1
:
raise
ValueError
(
'Not character based but word has multiple children?'
)
word
=
children
[
0
]
rawtext
=
word
.
text
if
word
.
text
is
None
:
raise
ValueError
(
'Word with no text value?'
)
word_data
.
append
({
'bbox'
:
box
,
'text'
:
rawtext
,
'confidence'
:
conf
})
...
...
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment