UIMA JCas Repository Service: Checkpointing for UIMA

Chuong Ngo
,
Technical Consultant

Theories do not pay the bills. In the real world, we need something we can use.

Last time, we looked at why UIMA needs checkpointing and got introduced to the CRS (link). This time let’s take a look at the inner workings of the CRS.

To Store, To Store

The CRS is a scalable, managed “file store” for UIMA Cases. It allows for the checkpointing of UIMA Cases at predetermined points, the retrieval of the checkpointed Cases, and the routing of Cases between pipelines. Scaling the CRS is trivial. Deploying it to the cloud is also easy. A no-frills implementation of the CRS can be found here.

Miles To Go Before I REST

The CRS is a RESTful interface on top of a data repository, metadata store, and cache. The data repository holds the binary Cas data. The CRS also writes the Cases' metadata to the metadata store. Being segmented into smaller stores, the metadata store enables the routing of Cases between UIMA pipelines. The CRS generates a unique Cas ID value every time a Cas is stored. This ID value is for retrieving the Cas. The cache stores recent IDs to help ensure the uniqueness of IDs generated in the future.

The CRS enables dynamic routing of UIMA Cases.

The data repository, metadata store, and cache are scaled independently of the REST interfaces. A load balancer, like Nginx, can scale the REST interface.

Integrating the CRS into a UIMA-based system requires creating CRS-aware UIMA AEs and CRs. The AEs communicate with the CRS to store the Cases it receives. It also tells the CRS where to register the Cas in the metadata store. While the CRS-aware AEs can be placed anywhere in a pipeline, it is best to include the AEs at the end.

Upon initialization, CRs read from different locations of the metadata store to retrieve the Cas IDs of the Cases it should get. Using the Cas IDs, the CRs retrieve the Cases from the CRS. It reconstitutes each Cas object and sends it off to the CR's associated pipelines for processing. It also deletes a Cas from the CRS once the Cas has finished processing.

Having individual CRs and AEs use different parts of the metadata store is how the CRS enables the dynamic routing of Cases. CRs are unaware of Cas IDs written after its initialization. So, the AEs and CRs can run in parallel. Deleting successfully processed Cases eliminates duplicated processing. In short, we loosely couple pipelines by using CRS-aware AEs and CRs.

Details, Details

A no-frills implementation of the CRS is available here. It is a Java 11 web application. The data repository and metadata store leverages MongoDB and REDIS is the cache. The RESTful interface uses Jersey 3.0.3 and Jakarta 5.0.0, while SLF4J provides logging. Communication between the CRS and CRS-aware components uses Protobuf objects. Any compatible web server can accept the WAR file.

A diagram of the CRS in action.

Upon initialization, the CRS reads a Properties file (i.e., web.config.properties) in order to get the Cache URI. It then reads the remaining configuration values from the cache. Next, it stores singleton data access objects (DAO) for the data repository, metadata store, and cache in the ServletContext. The DAOs use synchronous connections to their respective stores instead of asynchronous ones. The CRS scales through multiple instances. Therefore, the additional complexity from asynchronous connections was not worth it.

Saving a Cas

An AE kicks this off by creating a Cas Protobuf object. The AE then affixes the UIMA Cas to it as a binary array. It also adds the binary array's CRC hash to the Protobuf object. Then, the AE creates a Message Protobuf object to hold the Cas Protobuf object, the document ID string, and a query key. The query key specifies how to save the UIMA Cas's metadata in the metadata store. Finally, the AE serializes the Message object and sends it off to the /rest/store endpoint with a POST. The CRS takes the binary data and checks its hash. Then, it generates a unique Cas ID and stores the serialized UIMA Cas in its data repository with that ID. The generated Cas ID is then registered with the metadata store using the provided query key. Finally, the CRS creates Cas and Message Protobuf objects to return to the calling AE. The returned objects hold the Cas ID, a status code, and message detailing how the process went. In the case of an error state, it also holds an exception class.

Communication between the CRS and CRS-aware components uses Protobuf objects.

Retrieving a Cas

When initialized, CRS-aware CRs grab a list of Cas IDs from the metadata store. For each Cas ID, it retrieves the stored Cas, reconstitutes it, and sends it off for processing. The Cas is retrieved by attaching the Cas ID to a Cas Protobuf object attached to a Message Protobuf object. The CR sends that Message object, attached to a POST, to the /rest/get endpoint. The CRS gets the checkpointed UIMA Cas from its data repository and returns it attached to a Cas Protobuf object attached to a Message Protobuf object. Then, the calling CR extracts the binary data and reconstitutes a UIMA Cas from it. The reconstituted Cas is sent off to the CR’s associated pipeline for processing.

Deleting a Cas

Once a CR is notified that its associated pipeline has finished processing a Cas, the CR will instruct the CRS to delete that Cas. It sends a Cas Protobuf object attached to a Message Protobuf object to the /rest/delete endpoint using a POST. The CRS extracts the Cas ID and removes all entries with it from its data repository and metadata store.

Wrapping Things Up

Overall, the CRS is quite simple in design, concept, and implementation. For actual Java code examples for how to interact with the CRS, see the integration tests.

Banner image credit to
Cybrain
Natural Language Processing
Product

Related Posts