September 09, 2011

Simplifying our JPEG2000 conversion workflow

Over the summer, we have been working to streamline our JPEG 2000 conversion workflow. With the help of software developers from Genisys - one of the Trust’s strategic IT development and support partners - we have put the LuraWave command line interface to use in automating batch conversion.

Up to now we have been using the native GUI interface that comes with the LuraWave software, manually entering parameters and initiating the conversion process for each batch of images. This was useful for us as we settled into a large-scale digitisation workflow incorporating RAW - TIFF - JP2 conversion, cleared our backlog and established our compression testing methodology (as described in previous posts on this blog). With no relevant in-house programming expertise, the GUI was essential during these early stages. 

Now that we have a firm idea of how we want to use LuraWave, where it fits into the overall workflow, and what kind of throughput we need on a day-to-day basis, it was time to set up an automated solution.

The Wellcome Trust operates in an (almost) entirely Windows environment, so we commissioned the Genisys software engineers to code a .NET wrapper script running as an executable.  The wrapper script invokes LuraWave’s command line conversion to allow us to convert images with no manual intervention. An XML configuration file that contains the following information is used to control how the wrapper script invokes LuraWave:
  • "Inbox" directory (files ready for conversion)
  • Temporary directory (files copied before conversion)
  • "Outbox" directory (converted files)
  • LuraWave command line
  • Error directory
  • List of any files to exclude from conversion
LuraWave retains the original folder structure, so the "Inbox" and "Outbox" is the top level directory, with the original folder hierarchy maintained throughout the conversion process.

Polling of the specified input folder is handled with Windows Scheduler, which can be run on a PC or on a server (we run it on a virtual server). Every 5 minutes Windows Scheduler prompts the script to check for TIFFs in the "Inbox".  Lurawave is then invoked, converting the TIFFs to JP2s that are copied out to the “Outbox”.  We’ve got some really good error handling in place so if one rogue file can’t be converted the rest of the files still get converted – essential when converting big volumes, we don’t want the first file failing and halting an overnight run of thousands of files.

Windows Scheduler does not parallel process, so folders are queued for conversion. With speeds of around 30Gb (at least 1,200 TIFFs) per hour, this is quick enough for our needs.

This implementation means that a single LuraWave license can be used for any number of input streams, and with the facility to "call" multiple definitions; it can also convert images to multiple JPEG 2000 profiles (we currently have a lossless profile and a lossy profile).

With thanks to Alastair Reid, Wellcome Trust IT Account Manager, for providing this information and reviewing this post.

June 21, 2011

Thoughts on the 2011 JP2 Summit

I attended the JP2 Summit in Washington D.C. in May (initiated and organised by Robert Buckley and Steve Puglia and hosted by the Library of Congress) representing both the Wellcome Library and the JP2K-UK Working Group. I found this event an interesting counterpart to the JPEG2000 Seminar we held here at the Wellcome Trust last year.

There were around 90 people at the Summit, most from the D.C. area and eastern seaboard cultural institutions such as the LoC; National Archives; Smithsonian libraries and archives; a range of university libraries including Yale, Harvard, U. of Virginia, UConn; NARA; and many others. The level of experience in digital imaging and preservation was generally quite high, while the understanding of JPEG2000 ranged from very little to highly informed. Nearly a mirror audience to the Wellcome Trust event, although perhaps with fewer privately funded organisations represented (although there were some, including Google).

The day began with a tutorial by Robert Buckley, and although I had heard much of this in previous presentations, or through reading up on JP2, I always find it hard to keep the details fresh in my mind. So it was useful to get a refresher, and it set the stage well for people who had little knowledge of the technical issues and background to the format.

After the tutorial, there was a series of presentations, all of which are listed on the JPEG2000 page of the FADGI website. I won't go into the details here (and you can read more on Steve Puglia's blog post), but we heard about a range of practical issues around use of JP2 for newspaper digitisation, digital video, special collections and Google books; technical developments around implementing JP2 as part of a workflow including quality assurance and issues of long-term preservation; and the results of a survey of use and attitudes toward JP2 in libraries and archives.

In the library and archive community JP2 is being adopted mainly for mass digitisation with storage costs being the primary driver - there is no denying that. What was clear here - as with the presentations given last year - was that while JP2 is not yet the most practical solution in terms of usability, it is becoming more and more widely accepted for its flexibility and robustness as well as for its space-saving intelligent compression. With increasing knowledge of the format practitioners are now coming to see JP2 in the context of these other important features, and investigating - even demanding - ways to use these other features more easily.

Of course, not everyone is 100% convinced that JP2 can meet the needs of digital archiving, or digital image delivery. Many concerns seem to have been appeased by the presentations and tutorial - simply by finding out how many people are using the format, and how much value they get from it. There are still barriers to people taking up JP2 more enthusiastically - mainly around the lack of adoption by digital cameras and browsers, loss of information in lossy compression, risk that there still isn't a wide enough take-up in the community to maintain the currency of the format in the longer term, and the small range of tools for implementing the format that simply can't meet their needs.

The second day of the Summit finished off with a small-group discussion session around JP2 implementation. For me, the most interesting part of this discussion was around community building.

While we may never see digital cameras natively producing JP2s, for example, some barriers can be broken down by simply sharing. Information on and results of testing, tools and ways to use them, workflow advice, and preservation technologies are all important and can easily be shared. Use of JP2 doesn't always boil down to technical reassessment however. There is also revisiting certain aspects of digital preservation strategy such as defining significant properties/data, predicting migration scenarios and what that really entails, determining what the use of the digital content really is. It is also recognising emotional responses to preservation risks and the fact that these decisions have a long-term effect, shaping the legacy of entire collections. The leap to JP2 is best done in collaboration, and moral support should not be discounted!

June 15, 2011

The JP2K-UK wiki has moved

The wiki created as part of the JP2K-UK working group has been moved to a dedicated space on the Open Planets wiki. The content has now been transferred and is in the process of being updated and added to. We welcome contributions - all you have to do is log into the OPF wiki.

May 27, 2011

ICC profiles and LuraWave

Johan van der Knijff's long-awaited D-Lib paper JPEG 2000 for long term preservation: JP2 as a preservation format, has now come out. In this paper he mentions the various ways LuraWave has handled colour profile information, and I thought it was a good time to elaborate some on the developments we have commissioned from Luratech regarding this issue.

As Johan mentions in the paper, when we started using LuraWave and carrying out JHOVE testing to determine whether the files were compliant with the standard, we found that where an ICC display profile was included in the TIFF (and this was virtual standard across our image set) LuraWave automatically encoded the file as JPX in a JP2 wrapper. This ensured compliance with the standard, but we were not happy with using JPX. So we asked Luratech to modify LuraWave to include an additional command that allowed us to tell the application to ignore the ICC profile completely. This meant that we got a 100% JP2 file, but the colour profile information was then stripped out.

We wanted to include a colour profile in our digital image files. This prevents ambiguity when decoding the images in an image editor or image viewer. We were left with only one option - convert everything to sRGB and allow LuraWave to include the numerical value of sRGB in the file, which is allowed by the standard. Adobe RBG 1998, as Johan explains in detail in his article, is allowed only as an input profile, and our images did not include an input profile (and we didn't know how we could go about adding an input profile to our images).

We knew that it wouldn't matter to us, to the user, or to the decoding programme, how the profile was labelled - as long as it was there. It mattered only to the standard. So we asked Luratech to modify LuraWave yet again in order to read the display profile in our TIFFs and embed it into the JP2 file as an input profile. It is not an input profile. But we were limited by the standard, and this was our best option within those limitations to ensure we could include colour information without having to limit ourselves to sRGB - and without having to add in a workflow step to convert all our legacy images to sRGB.

This is the version of LuraWave that we currently use (2.1.22.10 - which includes other enhancements around improving performance, as reported in an earlier blog post). However - since Johan has succeeded in raising awareness of the deficient colour space provision in the standard, leading to agreement in the JPEG Committee to change the standard to accommodate real use scenarios such as our own, we can envisage requesting further changes to the LuraWave command tool once this is finalised.

April 28, 2011

Guest post: Color in JP2

Rob Buckley, colour imaging expert and author of JPEG 2000 as a Preservation and Access Format for the Wellcome Library, writes about the implementation of colour space metadata in the JP2 format and planned changes to the specification to better accommodate this information.

When I talk about JPEG 2000, I point out that most if not all still image applications that use JPEG 2000, especially in the cultural heritage community, can be satisfied with the JP2 file format. JP2 is the basic file format defined in Part 1 of the JPEG 2000 standard, along with the core decoder. Part 2 of the standard defines extended versions of both the file format and decoder, offering features aimed at specialized or advanced applications.

One point of confusion about the use of JP2 has had to do with its support for color spaces. When we were developing JP2 in the late 1990’s (JPEG 2000 was intended to come out in 2000), the application that most influenced the design was digital photography—JP2 was expected to be the next digital camera format. So support for sRGB was built in, along with support for the YCC and grayscale versions of sRGB. Other RGB color spaces used for image capture would be supported by using ICC input profiles, leaving aside display and output profiles. However, not all ICC input profiles were allowed: support was restricted to the ones needed for grayscale and RGB image data. Not supported and considered too complex for applications without a full color management engine was the input profile type that used a full multi-dimensional lookup-table. So users had the choice of specifying color in a JP2 file by name as sRGB (or sYCC or sGray) or via a simple ICC input profile.

After the release of the JPEG 2000 standard, two things happened. First digital cameras kept exporting the JPEG Baseline format; when they added a new export format, it was Raw and not JP2. The drive was toward more creative control rather than better compression when what they had was good enough.

The second thing was that most people ended up using ICC display profiles for RGB spaces rather than input profiles. A small thing you’d think, especially when the only difference between the display profiles they used and the input profiles supported by JP2 was the profile class value in the profile’s header: except for that, the data content of the two profile types is identical for RGB color spaces. As a result, I could take a JP2 file containing an RGB display profile (which technically makes the JP2 file illegal) change the profile class from display to input (by changing four bytes in the profile header and leaving everything else the same) and produce a legal JP2 file. It turns out that most readers ignore this value anyway and read the file fine either way. Using the extended file format was no help because it only extended color support to all types of input profiles, plus some other named and vendor-specified color spaces.

This confusion needed to be addressed as more and more institutions are using JP2 as a long-term preservation format, where predictability and clarity are prized. The solution is straightforward: amend the JP2 file format specification, aligning it with current practice so that it supports ICC display profiles as well as the set of input profiles it supports now.

And this is what is happening. Richard Clark and I led an activity that culminated in the JPEG 2000 committee approving a new activity to amend JP2 when it met this past February in Tokyo. This means that JP2 will support a wide range of RGB color spaces, which was the original intent, via both ICC input and display profiles. Since the JP2 spec was first issued, the ICC spec has undergone a major revision from V2 to V4 and been issued as an ISO standard. While this revision hardly affects the profiles used for RGB color spaces, it will also be addressed as part of the amendment. (The amendment will also address the ambiguity in the JP2 definition of resolution that Johan van der Knijff has brought up on this blog.)

The final outcome of all this will be a JP2 file format standard that aligns with current practice; supports RGB spaces such as Adobe RGB 1998, ProPhoto RGB and eci RGB v2; and provides a smooth migration path from TIFF masters as JP2 increasingly becomes used as an image preservation format.

January 28, 2011

TIFF to JPEG 2000 backlog, losslessness, and a perplexing speed issue

In October 2010 we initiated our "TIFF to JPEG 2000 backlog project", an endeavor to convert all the legacy images that make up our current image archive (Wellcome Images), as well as around 120,000 images that had been created during the Archives digitisation project. Over 450,000 images comprise the backlog, saved in a multitude of folders, on different servers on our Pillar SAN storage system. Converting the Wellcome Images TIFFs to lossless JPEG 2000 will save us around 12 Tb of storage space alone.

Why lossless, you ask? We have indeed expounded on the merits of lossy compression for large image sets created as a result of digitisation projects. But there is a significant difference with regards to the backlog project. While digitisation projects are usually carried out on collections of material that have fairly similar physical formats (modern printed books, paper documents, Arabic manuscripts, etc.), lending themselves to a generalised approach to compression determined via testing, this backlog project has no overall commonality (other than that they are all TIFFs of one flavour or another). Wellcome Images is populated one image at a time, or by small sets of images, including born digital photography and represent a cross-section of hundreds of different content types. There was no feasible way to group these images into sets that could be assessed for compression tolerance. The decision was made, therefore, to convert the entire Wellcome Images backlog to lossless JP2 files, thus removing any doubt whether the compression levels were appropriate.

During the initial stages of this project, we tested our installation of the LuraWave conversion tool (v.2.1.21.10) with high volumes of images stored on our network storage (as all the archived TIFFs are). What we found surprised us - instead of 20 min or so we expected for a batch of around 600 25Mb images, it was taking all night (around 6 hours). Was it a bandwidth issue? With the support of our IT team we carried out tests over the 1Gb network area. It was still unacceptably slow, showing that bandwith was not the issue. We moved the same batch of images onto the local hard drive of the machine that LuraWave was installed on, and confirmed that, yes, LuraWave can convert those images in around 20 min when they are colocated.

We turned to our suppliers, LuraTech, who quickly ferreted out the problem. LuraWave was programmed to convert images in parallel, to speed up the process, but it also buffers images in parallel. This buffering process, when carried out across our 100Mb network cable, slowed down considerably due to the parallel running. LuraTech modified the programme to cache each image onto the local disk first, individually, before then buffering and converting in parallel as usual. This brought the overall time down by 80%. The version we are currently using is 2.1.22.10.

In practice our approach has been tailored to suit individual sets of images within our backlog. A balance has to be struck between ease of use and the practicalities of applying multiple processing stages to files over a 100Mb network. Some image sets are copied locally to external hard drives, taking advantage of the speed gains this gives, whereas others that are more straightforward can be processed directly over the network using the much improved processing speeds. The combined effeciencies made converting our entire backlog feasible within the timeframe we had to spend on it.

We are now about a third of the way through the conversion backlog, and on track to become virtually TIFF-free by May 2011. What I haven't mentioned is the colour profile embedding issues that cropped up, the legacy colour space problems, and the work LuraTech did in addressing these issues - the topic of a future blog post.