We live in an imperfect world. So let’s hope for the best and prepare for the worst.
The Apache Unstructured Information Management Architecture (UIMA) is a battle-proven framework for building applications to analyze large volumes of unstructured information (e.g., natural language documents). While UIMA can operate on non-text artifacts, we will limit our focus to just natural language documents. According to the project’s website (link):
UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names, etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
In other words, UIMA is a processing framework that defines interfaces to allow custom Analysis Engines (AE) to run together as a pipeline. An AE is a UIMA component that analyzes and annotates a Cas/JCas. The Common Analysis Framework (Cas), and its Java counterpart (JCas), is an object that provides UIMA components a common representation and mechanism for accessing the scrutinized artifact and current analysis results. Put more simply, a Cas or JCas is a data object that represents a document. It is also the repository for annotations created by the AEs that process it. So, an AE in a UIMA processing pipeline ingests a Cas, adds its annotations to it, and emits the Cas. The Cas goes to the next AE, and the process repeats. Therefore, the AEs build on one another. So, the pipeline is the sum of its parts.
An annotation is a piece of metadata produced by an AE from its analysis of the Cas’s source text and currently held annotations. For example, an AE may have the task of finding and annotating number words (e.g., one, two, and so on) with its numerical representation. So, the AE creates an annotation for each number word it finds. The terms Cas and JCas will be used interchangeably to refer to the same concept.
A Collection Reader (CR) creates Cases. A CR is a UIMA component that ingests the artifacts to be analyzed (e.g., it reads the documents off the local disk) and creates a Cas for each one. The CR sends each Cas to a designated pipeline for processing. Once the pipeline has finished processing a Cas, it informs the CR.
I came across UIMA in 2012 and have used it on many projects. Checkpointing was a feature I found myself wanting repeatedly.
Checkpointing is a mechanism to store the state of a computation. That way, it can be retrieved later and continued. UIMA does not have checkpointing. Nothing saves a Cas from when it is created to when it is processed. If a Cas fails to process at the last AE, we lose the time spent by the previous AEs. That is inefficient. UIMA’s design encourages the creation of relatively simple AEs that can be combined to form complex processing pipelines. The lack of checkpointing limits error recovery and hurts this design.
So what are some solutions? The most obvious one is to create AEs that persist the Cases it receives. The drawback is that specific CRs are needed to reconstitute those Cases and send them to irregular pipelines to continue processing. An additional mechanism to clean the checkpointed Cases that go on to be successfully processed is also needed. Otherwise, there would be duplication of processing effort.
The UIMA JCas Repository Service is a scalable, managed “file store” for Cases.
Another possible solution is to break up a pipeline. We save the Cases after each sub-pipeline, and each sub-pipeline has its own CR. Like the previous approach, checkpointing occurs at predetermined points. It avoids duplication of processing effort since only Cases that go through a sub-pipeline can move onto a subsequent one. There is also no need for specific CRs and pipelines since those already exist in another form. However, saving the Cases to the local disk limits scaling across multiple servers. Losing a machine also means losing the Cases. Checkpointing to another server (e.g., file store, database, and so on) solves that problem. But, it necessitates a mechanism to manage and ID the persisted Cases.
The UIMA JCas Repository Service (CRS) takes the second solution and adds scalability and a management and identification mechanism. It stores Cases as binary blobs and references them using unique IDs (Cas ID) that it generates. It writes the Cas IDs to its metadata store in different places. We route the Cases to specific pipelines by CRs that can read the IDs from the various areas of the metadata store. In short, it is a scalable, managed “file store” for Cases.
We looked at why UIMA needs checkpointing. Then, we discussed some possible solutions. To cap things off, we got introduced to the CRS. Next time, we’ll examine the inner workings of the CRS itself.