Commit 4279277a authored by Merlijn Wajer's avatar Merlijn Wajer
Browse files

Documentation updates

parent 38df5104
......@@ -6,8 +6,6 @@ Example deriver module
This modules calls ``hocr-fold-chars`` from the ``archive-hocr-tools`` package,
and then gzips the content.
.. TODO: Annontate code some more
.. code-block:: python
#!/usr/bin/env python3
......@@ -27,19 +25,24 @@ and then gzips the content.
if __name__ == '__main__':
# Log our module version
logger.info('hocr-char-to-word module version %s' % VERSION)
# Read task.json
info = get_task_info()
# Item identifier
identifier = info['identifier']
# Read _meta.xml
metadata = load_item_metadata(identifier)
# sourceFile does not necessarily have to match the item identifier plus
# a suffix, and can also be in a directory.
source_file = info['sourceFile']
target_file = info['targetFile']
# Let's state our intentions
logger.info('sourceFile: \'%s\' -> targetFile \'%s\'',
source_file, target_file)
......@@ -68,4 +71,6 @@ and then gzips the content.
# Write changes, if any.
write_item_metadata(identifier, metadata)
# TODO: Write module version to targetFile file metadata
......@@ -67,14 +67,12 @@ Useful keys:
Quickstart
----------
Check out the "Example deriver module" for a simple deriver module that uses
most of the functionality exposed by this library.
The subsections below will get you started with a simple module.
Additionally, the (internal) ``www/tesseract`` and ``www/pdf`` git repositories
might also be a good reference.
There is also this example repository, which is not python-specific:
https://git.archive.org/www/serverless
Also check out the "Example deriver module" for a simple deriver module that
uses most of the functionality exposed by this library, see `modules using
derivermodule`_ for more examples, and for more (non python specific)
documentation, see https://git.archive.org/www/serverless
Build your module with Docker
......@@ -103,7 +101,8 @@ From the root directory of your project, run these steps:
CMD python3 main.py
2. Build a wheel from `derivermodule` using::
2. Build a wheel from ``derivermodule`` (from the directory of
``python-derivermodule``) using::
python3 setup.py bdist_wheel
......@@ -112,25 +111,38 @@ From the root directory of your project, run these steps:
mkdir docker-bin
cp -v /path/to/derivermodule/dist/*.whl docker-bin
4. Create ``main.py``
4. Create ``main.py``, the starting point of your module
5. Build the container::
5. Build the container (pick a name other than ``example_container``)::
sudo docker build -t example-container .
Running your module with Docker
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From the root directory of your project, run these steps:
First, create a directory to store the test-items that you would like to test
with. It is important that you do not store all the items in the directory of
your docker module, because docker reads **all** files in the directory of your
project when building a module (it won't necessarily include them in the final
artifact), so having lots of large files around slows down build times.
For each item that you would like to test your module on, you will have to
create a ``task`` and ``item`` directory. The ``task`` directory has to
contain the ``task.json`` and ``petabox.json`` files, and the ``item``
directory has to contain all the (relevant) item files (at least
``<identifier>_meta.xml``, ``<identifier>_files.xml``, and your ``sourceFile``
and ``targetFile``).
1. ``mkdir -p items/my_identifier/{task,item}``
Step by step, from the root directory of your project:
1. ``mkdir -p ../test-items/my_identifier/{task,item}``
2. Place ``my_identifier_meta.xml`` and any other requires files in
``items/my_identifier/item``.
``../items/my_identifier/item``.
3. Create ``items/my_identifier/task/task.json`` with something like this::
3. Create ``../test-items/my_identifier/task/task.json`` with something like this::
cat > items/my_identifier/task/task.json
cat > ../test-items/my_identifier/task/task.json
{
"identifier" : "my_identifier",
"sourceFile" : "/item/my_identifier_chocr.html.gz",
......@@ -142,15 +154,67 @@ From the root directory of your project, run these steps:
}
}
4. Run the container::
4. Run the container (make sure to swap out ``example-container`` for the name
you picked)::
sudo docker run -v `pwd`/../test-items/my_identifier/task:/task -v `pwd`/../test-items/my_identifier/item:/item -i -t example-container
5. Wait for the module to finish.
If you need to make changes, you usually have to rebuild your module. You can
map (override) specific files from your directory with arguments like this to
``docker run``::
-v `pwd`/main.py:/app/main.py
This will map your new ``main.py`` to the containers ``/app/main.py``.
General Deriver Module Guidelines
---------------------------------
When writing a deriver module, it's important to keep the following things in
mind:
* Ensure that your module returns a non-zero exit code upon fatal errors. This
is done with ``sys.exit(<any positive non-zero number>)``, or when a
python exception is raised and not caught. Doing so will cause the derive to
'red row', marking the process as "needs administration attention", which in
turn allows for you or someone else to find the problem and analyse it.
* When starting out, it's better to hard-fail rather than silently ignore
errors, and deal with any potential red-rows later on.
* When failing, try to make one of the last lines of your program be clear and
unique, so that the red row analyser can find and classify your red rows:
* Don't::
>>> print('Something went wrong! :-('); sys.exit(1)
* Do::
sudo docker run -v `pwd`/items/my_identifier/task:/task -v `pwd`/items/my_identifier/item:/item -i -t example-container
>>> print('FATAL: Invalid wibble marker for this wobble, exiting'); sys.exit(1)
* Version your module, and increase the module version when it makes sense.
Log the version to the task log at least; also consider writing the module
version to the file metadata of the file(s) you create (e.g. to the ``targetFile``)
You can also write the module version information to the item metadata, but
consult with the collections team before doing so.
* Consider if your module should support task arguments (see
``derivermodule.task.get_task_arg_or_environ_arg``).
Modules using derivermodule
---------------------------
.. TODO: Add some remarks on clear error messages, ones that can be matched for
red for classification, etc
The following deriver module make use of this library:
* https://git.archive.org/www/tesseract
* https://git.archive.org/www/pdf
* https://git.archive.org/www/microfilm-issue-generator
* https://git.archive.org/www/hocr-char-to-word
* https://git.archive.org/www/hocr-fts-text
* https://git.archive.org/www/hocr-pageindex
Components
----------
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment