What's New

From Paper to Screen (Part 3): Transcribing Hundreds of Pages… in a Few Hours

April 23rd, 2025

Have you ever had to type an article or a book? Imagine that you are not writing it — which would mostly happen in multiple non-consecutive stretches of time — but just copying it: it is a tedious and time-consuming endeavor… Those of us who are interested in the ancient world, particularly its textual output, find themselves often reflecting on the material and operative aspects of the work of copyists. And yet, we can hardly grasp the reality of a person copying not just dozens, but hundreds of pages!

That was precisely the prospect we were faced with when we started our project for a digital edition of Kennicott’s Vetus Testamentum Hebraicum cum variis lectionibus. Fortunately, we were no longer constrained by traditional means of knowledge transmission. As we detailed in Part 1 and 2 of our From Paper to Screen series, eScriptorium allowed us to automate core steps in our workflow: segmentation (or layout analysis) and transcription. Let us explore transcription in greater detail.

As we saw in our previous posts, we imported scanned images of Kennicott’s work into eScriptorium, performed layout analysis of every page through a fine-tuned ML model, then verified the automatic segmentation output. To extract the text in digital format we trained a transcription model similar to, yet distinct from the segmentation model. In a matter of hours, we had hundreds of pages transcribed in a machine-readable format.

Fig. 1 — Transcription of Isa 1:1-11: main text

Fig. 2 — Transcription of Isa 1:1-11: apparatus

eScriptorium’s user interface makes it quite easy to check the automatic transcription and apply all the necessary corrections. Three panels are displayed to the user (from left to right, figs. 1-2):

the segmentation panel
the transcription panel
the text panel

The Transcription and Text panels display the transcribed text but slightly differ as to their functions and features. The former shows the text approximating as much as possible the original layout of the page, whereas the latter displays a plain sequence of lines. Both panels allow for manual editing of the transcription, however only in the latter can the lines be rearranged whenever necessary.

Fig. 3 — Line transcription for Psa 2:5: main text

Fig. 4 — Line transcription for Psa 10:9: apparatus

Selecting a line in the Transcription panel opens an input window where the line transcription can be compared to an image of the selected line (figs. 3-4). This makes spotting and correcting transcription errors much easier.

The model we used to transcribe Kennicott’s VTH has proved remarkably accurate: eScriptorium reports a character accuracy rate of 98.2%. On average, we found an error every 2.5 pages in the apparatus and one every 1.5 pages in the main text. Most errors are limited to 1-2 characters per page and are generally the product of physical damage in the source text (less frequently digital) or the result of bad segmentation.

Fig. 5 — Transcription error in apparatus entry to Ezek 30:13

In fig. 5 we can see that the model has mistaken 5 for 1 in «158», because an ink smudge obscured the upper half of the character.

Fig. 6 — Transcription error in apparatus entry to Ezek 40:31

Whilst in fig. 6 the polygon cut off the vertical stroke of the ק, resulting in an inaccurate transcription.

Once the transcription has been manually checked and reviewed, we can go on to export .txt files of the transcription which will be further processed to generate .xml files. To this third step in our workflow — parsing and encoding — we will dedicate a future post.

From Paper to Screen (Part 2): What is Segmentation?

March 17th, 2025

Fig. 1 – Segmentation of VTH Jer 1

In our previous post (Part 1 of our From Paper to Screen series) we introduced our 4-step workflow to produce a digital edition of Kennicott’s Vetus Testamentum Hebraicum:

Image acquisition
Segmentation
Transcription
Parsing and encoding

After all images have been imported we move to the segmentation process. In eScriptorium segmentation is the process by which the page layout is analyzed by identifying areas that contain text. Here is where the digitisation begins: each image is analysed through a machine-learning model trained specifically for this task. Segmentation takes place at two levels: on lines and regions.

Fig. 2 – Lines and masks: reference text (Isa 1:1-11)

Fig. 3 – Lines and masks: critical apparatus (Isa 1:1-11)

At the first level, the lines of text (or segments) to be transcribed are identified. The program draws the lines (baselines or toplines), then it automatically calculates masks, i.e. the pixel areas that enclose each line (the coloured boxes in figs. 2-3) and that eScriptorium will process for transcription.

Fig. 4 – Regions: reference text (Isa 1:1-11)

Fig. 5 – Regions: critical apparatus (Isa 1:1-11)

At the second level the program identifies regions. Regions are in turn pixel areas, which are assigned a specific logical-semantic value within the image. In our case, segmentation by regions is useful to reproduce the logical structure of the text as laid out on the page: this allowed us to distinguish and separately handle text and paratext — the critical apparatus in particular (figs. 4-5) — so as to generate coherent transcriptions that reflect the correct reading order.

eScriptorium offers the possibility to perform these operations either manually or automatically. However, machine learning models can be trained to automatically recognize regions and lines of text within the image. Training the model requires a sample of manually annotated images — called ground truth — on which the machine can “learn” to recognize the layout of other images by itself. Once the model has been trained, we proceed with the automatic segmentation, the results of which undergo manual verification. This procedure is particularly convenient for the Vetus Testamentum Hebraicum, as the text is quite long and displays a high degree of homogeneity: with a relatively limited sample of images it is possible to automatically segment 1300+ pages in a far shorter amount of time, compared to the same operation performed manually.

Fine-tuning the Segmentation Model

Fig. 6 – Segmentation of Jer 1:1-17 through Kraken’s default model

Fig. 7 – Segmentation of Jer 1:1-17 through a specifically-trained model

eScriptorium offers a ready-to-use, default segmentation model. However, the advantage of using a model trained on Kennicott’s VTH can be clearly observed in figs. 6-7. eScriptorium’s default model interprets the image as a continuum of indistinct text, therefore it fails to recognise titles, main text, and apparatus. It correctly places the lines as well as the two-column layout, but fails to identify the text orientation (left-to-right for the apparatus; right-to-left for the reference text) and the reading order of the lines. As a result we are left with a completely unusable transcription. Most of these difficulties are solved by training the default model on the Vetus Testamentum Hebraicum (fig. 9).

Fig. 8 – Isaiah 1: a case of irregular layout

Fig. 9 – Job 3: a case of irregular layout

The layout of VTH is basically stable: two columns of main text and apparatus. There are, however, some exceptions that decrease the model’s predictive abilities: in poetic works (e.g. Isaiah in fig. 8) the text is distributed irregularly (shorter lines of uneven length). Moreover, the placement of the verse numbers (on the far right of each line, separated by a blank space) makes it difficult for the model to identify them as part of the corresponding line. On the other hand, the alphabetic psalms, Proverbs and the poetic section of Job (fig. 9) have further peculiarities: the text is arranged distichously with each couplet of hemistichs forming a line split halfway by a blank space. The model mistakes this layout as a regular 2-column one, thus creating two separate lines and disrupting the correct reading order.

After having carefully verified the output of the segmentation we can then proceed to generate the automatic transcription, which will be the subject of a later post.

From Paper to Screen: Creating a Digital Edition of the Vetus Testamentum Hebraicum

February 28th, 2025

msIA eScriptorium's Homepage

The main goals of DiKe — as clearly stated in the project description — are to produce a digital edition of Kennicott’s Vetus Testamentum Hebraicum cum variis lectionibus and to create a corpus of linguistically and philologically annotated variants.

This is the first of a series of posts that will feature the workflow (i.e. methods and tools) that we are using to achieve these goals. Each post will focus on one of the steps that make up the workflow. In this post we will provide a general introduction to the digitisation process and the OCR/HTR software we are using; then, we will concentrate on the operations of image acquisition (Step 1).

The Workflow

To create a digital edition that would allow scholars to either access specific points of data or perform quantitative analyses on the impressive corpus of variants collected by Kennicott, we needed a solution that would provide the largest degree of flexibility. We opted for a methodology developed by Luigi Bambaci for the project Reverse Engineering Kennicott (REK), that yielded promising results in creating complete manuscript transcriptions from Kennicott’s apparatus and aligning them to the HTR output obtained from the same manuscripts on eScriptorium.

The workflow consists of 4 steps:

image acquisition
segmentation
transcription
parsing and encoding

This workflow enables us to convert in digital format the reference text of the Hebrew Bible on which Kennicott collated all other textual witnesses and the critical apparatus where he listed the variants (variae lectiones). Steps 1-3 enable us to “extract” the text from the physical pages and create a digital Doppelgänger, i.e. a .txt file containing the entire transcription of the source text. Step 4 takes the output of steps 1-3 and converts it into a machine actionable format.

The “extraction” process happens through optical character recognition (OCR) which in recent years has become quite popular thanks to the “smartphone revolution”. Therefore, we needed an OCR software that was able to (a) process a mixed text (Latin and Hebrew scripts; left-to-right/right-to-left text) and (b) to generate a complete transcription of the ca 1400-page Vetus Testamentum Hebraicum in a reasonable amount of time, all the while reducing as much as possible the need for manual intervention.

eScriptorium: a Virtual Research Environment

Fig. 1 – eScriptorium Logo

In the last two decades the OCR/HTR software market has considerably expanded. A wide array of software solutions geared toward the needs of specialists in various fields are being developed: Transkribus, Aletheia, and Tesseract, to cite the most popular ones. We settled for eScriptorium, a virtual research environment for manuscript cultures developed by the École Pratique des Hautes Études, Université Paris Science et Lettres (EPHE – PSL).

eScriptorium provides the tools to produce a digital text from both handwritten (hence, Handwritten Text Recognition or HTR) and printed sources (OCR) using machine learning techniques. We have chosen eScriptorium over other candidates because:

It is designed to handle manuscripts and printed texts featuring complex layouts and a variety of scripts;
It provides a system for the automatic layout recognition and transcription of the source text;
It features a user-friendly web interface that can be accessed from most web browsers;
The software is completely open-source.

eScriptorium is a decentralised software. Therefore, there are multiple instances deployed on different servers, access to which has to be obtained through the server’s owners. Alternatively, thanks to its openness eScriptorium can be deployed by anyone who has a working knowledge of OCR systems on any machine.

Thanks to a collaboration with Daniel Stökl Ben Ezra and Luigi Bambaci, we were able to access the Paris eScriptorium instance msIA.

Image Acquisition

Fig. 2 – Importing images in eScriptorium

As we have seen, the first step in the process is the acquisition of the source text images. eScriptorium allows to import images:

from the local file system
from a PDF file
from IIIF

Importing images from the local file system is as easy as dragging and dropping image files from a local folder or hard drive. Alternatively, a PDF file can be uploaded and the system will automatically extract the images. Lastly, eScriptorium supports the IIIF framework, which allows importing high-quality images — complete with their metadata — through the URL of a IIIF manifest (a JSON file that contains all the information pertaining to a IIIF object).

Kennicott’s Vetus Testamentum Hebraicum is available in a variety of formats on the Internet Archive (vol. 1; vol. 2). High-resolution JPEG scans are required so that eScriptorium’s OCR engine, Kraken, can perform automatic layout analysis and transcription on each page. ZIP folders containing .jp2 files of both volumes can be downloaded from the Internet Archive and then uploaded to eScriptorium through its Import function (Fig. 2).

After all images have been imported we can start the segmentation process (i.e. the layout analysis). This, however, will be the subject of a later post.

How did Kennicott Structure the Critical Apparatus?

January 31st, 2025

B. Kennicott, Vetus Testamentum Hebraicum cum variis lectionibus, p. 4

Kennicott’s work provides a huge number of textual variants to the reference text. These variants, which were collected by comparing hundreds of manuscripts and printed editions of the Hebrew Bible, are presented in the critical apparatus.

The apparatus is located in the lower part of each page, under the reference text. It is introduced by the words Variae lectiones and displayed in two columns. Every variant is followed by the reference number of the manuscript or printed work that attests it. The sets of variants are introduced by the number of the verse of the Bible they refer to.

In the Latin preface (praefatio) to his work (pp. I-IV), Kennicott illustrates, among other issues, the conventions he adopted to present and organise the textual variants. For example, a long dash separates the different forms of a single word, together with the corresponding manuscripts (fig. 1).

Fig. 1. Apparatus entry to Gen 1:21

The symbol resembling an "S" rotated 90 degrees is used to signal that the order of some letters (fig. 2) or words (fig. 3) was inverted as compared to the reference text.

Fig. 2. Apparatus entry to Gen 1:16

Fig. 3. Apparatus entry to Gen 1:26

Similarly, the symbol ˰ indicates that a word is lacking in the witness (see fig. 1). Erased words or letters (rasurae) are indicated by a symbol resembling an asterisk (fig. 4).

Fig. 4. Apparatus entry to Gen 1:3

In cases where a later scribe erased a word and wrote over it, Kennicott records the deleted word preceded by lat. primo (“first” or “earlier”) if it is clearly legible; otherwise the word forte (“perhaps”) is used. In cases where the earlier word was not erased, but a second hand wrote another word on top of it, Kennicott uses nunc (“now”) if the replacement is clearly legible, videtur (“seemingly”) if it is not. This is one of the few cases where Kennicott uses natural language instead of symbols in the apparatus. In fact, differently from De Rossi’s edition of the Hebrew Bible, Kennicott’s critical apparatus rarely makes use of natural language, and, more importantly, it is accurately organised and formalised. As we will detail in future posts, this makes Vetus Testamentum Hebraicum particularly suitable for digitization through automatic processing systems.

Which Bible Did Kennicott Use?

January 15th, 2025

Biblia Hebraica, ed. Everard van der Hooght, 1705, front page

Benjamin Kennicott chose as the reference text for his work the edition of the Hebrew Bible curated by Everard var der Hooght (1705), which was in turn based on the Rabbinic Bible printed by Daniel Bomberg in Venice in the early 16th century.

The first edition of the Rabbinic Bible was completed in 1517-18 under the editorial supervision of Felix Pratensis, a Jew converted to Christianity, and it included the Tanak (i.e. the Hebrew Bible), the Targum Onqelos and some medieval commentaries. The same structure was adopted in the second edition, printed in 1524-25, whose editor was Jacob ben Ḥayyim ben Isaac ibn Adoniyyah. This important kabbalist, talmudist and masoretic scholar was born in Tunis in ca. 1470, but had to flee to Italy because of anti-Jewish persecutions. In ca. 1520 he arrived in Venice, where he too converted to Christianity and started to collaborate with Daniel Bomberg. We do not know on which manuscripts he based the new edition of the Bible, though it is likely that he used Sephardic witnesses of the Ben-Asher Masoretic tradition. He decided to include the Massorah and other medieval commentaries, which became standard in all later editions of the Rabbinic Bible, known as Miqra’ot Gedolot from the 19th century on. The innovatory intent behind the printing of this second edition is evident in the title that was given to it: Šaʽar YHWH ha-Ḥadaš, i.e. The New Gate of YHWH.

The 1524-25 edition by Daniel Bomberg and Jacob ben Ḥayyim became the text on which the first scholarly editions of the Biblia Hebraica were based (1906 and 1913). This led Rudolf Kittel to consider this edition as the textus receptus of the Hebrew Bible, although this term should be used with caution. In the third edition of the Biblia Hebraica, which was prepared between 1929 and 1937, Codex Leningradensis, the oldest complete witness of the Hebrew Bible (dated between 1008 and 1009), became the reference text.

However, the influence of the Bomberg-Ben Ḥayyim edition is evident in all later developments of the Rabbinic Bible and in the first tentative critical editions that were prepared in the 18th century, as the Variae Lectiones Veteris Testamenti by Giovanni Battista de Rossi and Kennicott’s Vetus Testamentum Hebraicum.

Digitizing Kennicott's Vetus Testamentum: First Steps

October 2nd, 2024

B. Kennicott, Vetus Testamentum Hebraicum cum variis lectionibus, p. 1

The first phase of the project was dedicated to the digitization of the two volumes of Kennicott's work. The PDF version of the Vetus Testamentum, accessible on Archive.org, was imported in eScriptorium, an open source software for OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition). This software is explicitly designed to recognize the text of printed works and manuscripts in their different layouts and scripts and to transcribe it authomatically. eScriptorium, which is based on the software Kraken, is also able to work on texts with mixed characters and different writing directions - in our case, left to right for Latin, right to left for Hebrew.

In order to produce a machine-readable version of the text, the sofware needs first to operate a segmentation of every single page, its layout and its lines. The segmentation process works on two levels. First, the software recognizes the lines of the text that will then be transcribed. Secondly, it incorporates them into different visual units, called regions, each one having a precise label, as "header", "main text", "apparatus", "left column", "right column", etc. This is essential to obtain an automatic transcription of the text in its correct order.

The first months of our work were dedicated to the segmentation process. This required training machine learning models capable of working on the different layouts that Kennicott's work presents.

After the optical recognition of the text and the manual correction of the errors produced by the segmentation process, the text is now being transcribed automatically into a workable format. This allows the manual correction of the (rare) errors of the transcription.

Page updated

Report abuse