Past few years, I have been working as a solution architect for a large-scale digital preservation programme. An Internet-facing application developed for the programme was a Java servlet-based viewer to display the contents of archived websites harvested using the Web Curator Tool and stored in the digital preservation system.
A web harvest contains a set of ARC files and one ARC Index file (CDX file format). Each entry in the CDX file for a given web harvest is an ARC record - a structure that holds the information about a given web resource, such as the original URL of the resource and the information about where the underlying content stream for the resource could be found (e.g. name of the ARC file, seek position and stream length). The CDX file can thus be considered as the "table of contents" of all harvested resources captured inside the various ARC files of the particular web harvest.
In order to improve the performance of the web harvest viewing, we had employed a caching solution based on Ehcache in the viewer. As soon as the user clicks to view a given web harvest, the viewer converts the contents from the corresponding CDX file into a collection of ARC records, and stores this collection as an element in the cache (See the class diagram). When the user requests to view various pages in the same web harvest, the application doesn’t need to parse the CDX file again. It just needs to retrieve the ARC record collection of the harvest from the cache, locate the particular ARC record and then read the corresponding ARC file from the disk for the resource content. The viewer was deployed in the same application server that also runs the third-party digital preservation software.
All worked well until the application server started giving Java "Out Of Memory" error now and then. Analysis of the heap dump at one such occasion revealed that the ARC record collections held in the cache occupied about 50% (close to 2 GB) of the used heap space. Since the web archive viewer was "stealing" the memory thus, the digital preservation application was struggling to find enough space for its processing. A screenshot of the memory analyser shown below indicates the heap utilisation of the cached objects.
What happened here, to have the cache eat up a lot of memory?
- The design has a constraint that every cache element must be a collection of ARC records for a given web harvest. Some of the web harvests were small, containing only few hundreds of web resources. However, many "killer" web harvests containing close to 100,000 web resources were also present in the preservation system. Viewing one such web harvest then creates one ARC record collection containing these many ARC record objects. This was big enough to blot the cache and in turn the Java heap. In the future, the preservation system may store even bigger web harvests. Therefore, it was not possible to profile the objects that are stored in the cache.
- The original configuration parameters of the cache were not in accordance with the usage pattern of the application. For example, the cache was configured with a very high value for the "maximum elements in memory" parameter. But the average number of users in the system was only a small fraction of that value. In addition, the cache used the Least Frequently Used (LFU) strategy as the “memory store eviction policy”. Apparently, this was unsuitable - the largest web harvests scored high as the most frequently used objects, and thus they were never evicted from the cache even after expiry!
- In Ehcache, "expired" elements do not mean "evicted" elements. The eviction happens only when the threshold is reached. With the cache configured with a large value for the maximum number of elements, the cache got filled very soon with very many ‘dead’ ARC record collection objects in the cache occupying the space, but not being used at all.
Once the web archive viewer started affecting the stability of the application server and the digital preservation application, I took a very close look at how to tune the cache. Some parameters tuned included the maximum elements in memory and also the eviction policy.
Interested to know more? Please continue reading the Part 2 of this article.