August 13, 2010

The JPEG2000 problem for this week

JPEG2000 isn’t the easiest of formats to disseminate. Browsers typically handle the format with difficulty and then require plugins or extensions to render the format. We don’t want our users to have to download anything just to be able to view our material on-line. So, we plan to convert our JPEG2000 files to a browser friendly JPEG or PDF for dissemination. Both formats admirably handled by browsers. (OK, PDF needs an Adobe plugin but it's commonly included with browsers.) Other formats may come along later. The thing is, how do we do that conversion? There are plenty of conversion tools out there – we use Lurawave for the image conversion. But then the question becomes when do we convert from a master to a dissemination format? Especially if we want a speedy delivery of content to the end user.

One of the guiding principles behind our decision to use JPEG2000 was that we could reduce our overall storage requirements by creating smaller files than we might have done if we’d used, say, TIFF. So if we automatically convert every JPEG2000 to a low res thumbnail JPEG, a medium res JPEG and a high res JPEG and to a PDF then we’re back to having to find storage for these dissemination files. OK, JPEG won’t consume terabytes of storage and nor will PDF, but we’d need structured storage to keep track of each manifestation and metadata to provide to our front end delivery system as to which JPEG was to be used in which circumstances. True, this has been very successfully done for many projects before now but alongside efficiency of storage is efficiency of managing what we have stored and a speedy delivery.

So we plan to convert JPEG2000 to JPEG or PDF on-the-fly at the time each image is requested. The idea is that we serve JPEG2000 images out of our DAM to an image server, the image is converted and the dissemination file served up. Instead of paying for large volumes of static storage we believe that putting the saving on storage into a fast image server will directly benefit those who want to use our material online.

One outcome of a conversation had with DLConsulting is that we've learned that on-the-fly conversion is a potentially system intensive (and at worst inefficient) activity that could create a bottleneck in the delivery of content to the end user. We've said that speed is an issue. We need to efficiently process the tiled and layered JPEG2000 files we plan to create. A faster more powerful image server may help but good conversion software qwill be key. Alongside on-the-fly conversion we plan to use a cache that would hold, in temporary storage the most requested images/PDFs. The cache would work something like this. It has a limited size/capacity and contains the most popular/most often requested images/PDFs. If an image/PDF in the cache were not requested for n amount of time it would be removed from the cache. In practice a user requests an digitised image of a painting, the front end system queries the cache to see if the image is there, if it is its served directly and swiftly to the user. If not the front end system calls the file from the back end DAM. The DAM delivers that image to the image server, which converts JPEG2000 to JPEG and places that images in the cache. From where it can be passed to the front end system and the end user. Smooth, fast and efficient in the use of system resources.

But there are still questions. If we pass the JPEG2000 to the image server for conversion to JPEG that’s fine; but what happens next? Is the JPEG2000 discarded after the conversion process leaving only the JPEGs? Is this the best way to support the zooming in on image sections that we want to offer. The original proposal was to hold only dissemination formats in the cache, now we’re thinking that for flexibility we may prefer to hold the JPEG2000 images and convert them as the image is requested by a user. Is this still the most efficient process? It's easy to build bottlenecks into a system that slow processes down, much more difficult to design a system for speed and efficiency. We’re pretty certain that the conversion–on-the-fly is a good idea and we also think the cache is too. Unless you know differently….

7 comments:

Carsten said...

Many very good questions!

We - LuraTech - thought about them all and more on a product level already. Our Image Content Server product solves it out of the box more or less.

It offers on-the-fly conversion from JPEG2000 repositories to JPEG, including on the fly conversion of zoomed portions of images and allows to zoom / pan pretty smoothly - without plugins, just pure JPEG and HTML. And for more advanced features like page-turn-viewing of books some JavaScript on the client side is needed.

The key to performace and forseeable server hardware requirements is an intelligent caching mechanism though. Sure, you can't store forever all different zoomed/pan views of every image (so upfront conversion is a rather limited option!), but, if you track what is going on, you can come up with good caching rules.

Take a look at it:
http://ics.luratech.com

or at the KB library newspaper use of that product:
http://kranten.kb.nl

Christy Henshaw said...

Carsten, do you have any indicitive stats for your imaging server regarding speed of conversion from JP2 to other formats (JPEG, PDF, Gif, etc.)? Similarly, stats regarding space requirements (i.e. for any given JP2, what is the proportionate storage space required for delivery manifestations and working space? Do you convert from JP2 directly to the output format, or use an interim format (presumably this will affect the space requirements in the "cache" or working space).

Leonard said...

Since the JPEG2000 data in the PDF is "Raw" (meaning that it's the same as .jp2/.jpx), why not just convert all your existing JP2K files to PDF and then simply store the PDFs?

Then you only need to keep ONE FORMAT (PDF) that is known to work with all browsers and it takes full advantage of JPEG2000.

Leonard Rosenthol
PDF Standards Architect
Adobe Systems

lovelycode said...
This comment has been removed by the author.
lovelycode said...

Have you tried Djatoka? I believe it's an on-the-fly JP2-to-JPEG image server with caching. Might be a relatively easy way to evaluate that approach. Includes bookmarkable image panning and zooming, which would be rather difficult with PDF.

(Removed my earlier comment due to a typo)

Christy Henshaw said...

Regarding the use of PDF - the problem is that PDF is proprietary and therefore not a viable format for long term preservation. It works great as a dissemination format, so we will offer it to the end user. JPEG 2000 ticks the boxes of both standardisation and flexibility in how you can use it, even if it isn't (yet) handled by browsers.

We do know about Djakota, but unfortunately we're not really set up here to run open-source software that doesn't work "out of the box" as an evaluation tool.

Johan said...

My two cents on the PDF-related comments: although PDF used to be a proprietary format, it was released as an open ISO standard in 2008. See e.g.:

http://www.iso.org/iso/pressrelease.htm?refid=Ref1141

The ISO specification is based on PDF 1.7.

However, I would not advise using PDF as a preservation format in this case (although it's absolutely fine for dissemination).

As Leonard pointed out, a PDF can hold JPEG2000 image data, and these data are stored in such a way that the 'raw' data of the original image are embedded as a 'JPEG2000 data stream' in the PDF file. Or, seeing it the other way round: the PDF here acts as a 'wrapper' around the JPEG2000 image data. From a preservation point of view there are two problems with this approach:

1. If you ever need to migrate these files to some alternative format (which, in the long run, is inevitable since eventually any file format will at some point cease to be used), you'll have to deal with both the raw image data and the wrapper around it. This adds an extra (unnecessary) layer of complexity, which may lead to various problems.

2. When people talk about JPEG2000, what they usually mean is the JP2 format (defined by Part 1 of the standard). However, JPEG2000 data streams in PDF follow a subset of the JPX format, which is an extension of JP2 that is defined by Part 2 of the standard. From what I understand the main reason for going for JPX (rather than JP2) in PDF was that JP2 poses some restrictions on the use of colour spaces and ICC profile data, and these restrictions were removed in JPX. Unfortunately, very few existing software applications support JPX, which increases the odds of things going wrong in future migrations. Because of this, I wouldn't recommend it as a preservation format.

The combination of 1 + 2 above would -in my opinion- make using PDF for preservation quite problematic in this case.

To be completely clear: these comments only apply to this particular application, and do not in any way reflect my views on PDF in general. For instance, PDF/A is a subset of the full PDF feature set (published as a separate ISO standard) which was designed specifically for long-term archiving. For many applications this is an excellent archiving format.

In this particular case however, I don't think this is the way to go on the preservation side...

Johan van der Knijff
KB / National Library of the Netherlands