Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
Aram Verstegen
archive-hocr-tools
Commits
5c15ef26
Commit
5c15ef26
authored
Nov 28, 2021
by
Merlijn Wajer
Browse files
hocr/parse: deal with em/strong in normal parse as well
parent
099aa0f2
Changes
1
Hide whitespace changes
Inline
Side-by-side
hocr/parse.py
View file @
5c15ef26
...
...
@@ -162,7 +162,22 @@ def hocr_page_to_word_data(hocr_page, scaler=1):
wordbased
=
False
if
wordbased
:
rawtext
=
word
.
text
wword
=
word
# Words may contains additional nodes like <em>
while
True
:
children
=
wword
.
getchildren
()
if
len
(
children
)
==
0
:
break
if
len
(
children
)
>
1
:
raise
ValueError
(
'Not character based but word has multiple children?'
)
wword
=
children
[
0
]
rawtext
=
wword
.
text
if
wword
.
text
is
None
:
raise
ValueError
(
'Word with no text value?'
)
box
=
BBOX_REGEX
.
search
(
word
.
attrib
[
'title'
]).
group
(
1
).
split
()
box
=
[
float
(
i
)
for
i
in
box
]
...
...
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment