January 28, 2011

TIFF to JPEG 2000 backlog, losslessness, and a perplexing speed issue

In October 2010 we initiated our "TIFF to JPEG 2000 backlog project", an endeavor to convert all the legacy images that make up our current image archive (Wellcome Images), as well as around 120,000 images that had been created during the Archives digitisation project. Over 450,000 images comprise the backlog, saved in a multitude of folders, on different servers on our Pillar SAN storage system. Converting the Wellcome Images TIFFs to lossless JPEG 2000 will save us around 12 Tb of storage space alone.

Why lossless, you ask? We have indeed expounded on the merits of lossy compression for large image sets created as a result of digitisation projects. But there is a significant difference with regards to the backlog project. While digitisation projects are usually carried out on collections of material that have fairly similar physical formats (modern printed books, paper documents, Arabic manuscripts, etc.), lending themselves to a generalised approach to compression determined via testing, this backlog project has no overall commonality (other than that they are all TIFFs of one flavour or another). Wellcome Images is populated one image at a time, or by small sets of images, including born digital photography and represent a cross-section of hundreds of different content types. There was no feasible way to group these images into sets that could be assessed for compression tolerance. The decision was made, therefore, to convert the entire Wellcome Images backlog to lossless JP2 files, thus removing any doubt whether the compression levels were appropriate.

During the initial stages of this project, we tested our installation of the LuraWave conversion tool (v.2.1.21.10) with high volumes of images stored on our network storage (as all the archived TIFFs are). What we found surprised us - instead of 20 min or so we expected for a batch of around 600 25Mb images, it was taking all night (around 6 hours). Was it a bandwidth issue? With the support of our IT team we carried out tests over the 1Gb network area. It was still unacceptably slow, showing that bandwith was not the issue. We moved the same batch of images onto the local hard drive of the machine that LuraWave was installed on, and confirmed that, yes, LuraWave can convert those images in around 20 min when they are colocated.

We turned to our suppliers, LuraTech, who quickly ferreted out the problem. LuraWave was programmed to convert images in parallel, to speed up the process, but it also buffers images in parallel. This buffering process, when carried out across our 100Mb network cable, slowed down considerably due to the parallel running. LuraTech modified the programme to cache each image onto the local disk first, individually, before then buffering and converting in parallel as usual. This brought the overall time down by 80%. The version we are currently using is 2.1.22.10.

In practice our approach has been tailored to suit individual sets of images within our backlog. A balance has to be struck between ease of use and the practicalities of applying multiple processing stages to files over a 100Mb network. Some image sets are copied locally to external hard drives, taking advantage of the speed gains this gives, whereas others that are more straightforward can be processed directly over the network using the much improved processing speeds. The combined effeciencies made converting our entire backlog feasible within the timeframe we had to spend on it.

We are now about a third of the way through the conversion backlog, and on track to become virtually TIFF-free by May 2011. What I haven't mentioned is the colour profile embedding issues that cropped up, the legacy colour space problems, and the work LuraTech did in addressing these issues - the topic of a future blog post.