August 27, 2010

JPEG 2000 for the practitioner - free one-day seminar

This recently announced call for papers/registration may be of interest to our readers:

JPEG 2000 for the practitioner - a one-day seminar

A free seminar to explore and examine the use of JPEG 2000 in the cultural heritage industry will be held at the Wellcome Trust. The seminar will include specific case studies of JPEG 2000 use. It will explain technical issues that have an impact on practical implementation of the format, and explore the context of how and why organisations may choose to use JPEG 2000. Although the seminar will have an emphasis on digitisation and digital libraries, the papers will be relevant to a range of research and creative industries. Places are limited to 80 attendees. Papers will be made available online after the event.

Tuesday 16 November 2010
9am - 5pm
Wellcome Trust, 215 Euston Road, London, UK

This seminar is hosted by the JPEG 2000 Implementation Working Group and the Wellcome Library.

Contributors: please submit the title and a brief abstract of your proposed paper and a bio of the speaker/s to c.henshaw@wellcome.ac.uk by October 4, 2010.

Delegates: if you would like to attend please email your name and the name of your institution to c.henshaw@wellcome.ac.uk by 1 November, 2010.

August 24, 2010

Determining rates of JPEG 2000 compression on a collection-by-collection basis

As a result of our decision to "go lossy", we need to make sure that the level of lossiness is appropriate to the image content. We can't do this on the individual image level, as there are simply too many images. But we can do this on the collection level. We came up with a rule of thumb:

For any given collection of physical formats we will apply a range of different compressions on a representative sample from that collection . We will continue compressing at regular intervals until visual artefacts began to appear on any individual image (i.e. 2:1, 4:1, 6:1, and so on).
Once we determined at which compression level the worst-performing image began to show visual artefacts, we will choose the next lowest compression level (if the worst-performing image showed artefacts at 10:1, we would chose 6:1) and apply that to the entire collection, regardless of how much more compression other material types in that collection might bear.
This rule of thumb allowed us to strike a balance between storage savings and the time and effort in assessing compression levels for a large number of images.

The first "real life" test of this methodology was carried out in relation to our archives digitisation project. We are currently digitising a series of paper archives (letters, notebooks, photos, invitations, memos, etc.) in-house. The scope runs to something like half a million images over a couple of years, and includes the papers of some notable individuals and organisations (Francis Crick being the foremost of these). Archives can be quite miscellaneous in the types of things that you find, but different collections within the archives tend to contain a similar range of materials. This presents a problem if you want to treat images differently depending on their content. The photographer doesn't know, from one file of material to the next, what sort of content they will be handling. So even for a miscellaneous collection, once the image count gets high enough, you have to make the compromise by taking a collection-level decision on compression rates.

For archival collections we needed to test things like faint pencil marks on a notebook page, typescript on translucent letter paper, black and white photos, printed matter, newsprint, colour drawings, and so on. We chose 10 samples for the test. As this was our first test, and we were curious just how far we could go for some of the material types in our sample, we started with 1:1 lossy compression and increased this to 100:1. We used LuraWave for this testing.

For the archives, the compression intervals were: 1:1 lossy, 2:1, 4:1, 6:1, 10:1, 25:1, 50:1, and 100:1. The idea is that at 2:1, the compression will reduce the file size by half in comparison to the source TIFF, and so on.

Not surprisingly, the biggest drop in filesize was seen in converting from TIFF to JPEG 2000 in the first place. At a 1:1 compression rate, this reduced the average filesize by 86% (ranging from 67% to 95%). A 2.1 compression resulted in no noticeable drop in filesize from 1:1 - begging the question what differences there could possible be between 1:1 and 2.1 in the LuraWave software. At the average file size (5mb) at this compression (2:1) , a 500,000 image repository (our estimate for the archives project) would require 2.4 Tb of storage. These averages are somewhat misleading, because while they represent a spread of material, they do not represent the relative proportions of this material in the actual collection as a whole (and we can't estimate that yet).

File size reduction was relatively minimal between 2:1 and 10:1. What is obvious here is that setting the compression rate at, say, 2:1 does not give you a 2:1 ratio. You can achieve in fact a 14:1 ratio or higher. An interesting point to make about the very high experimental compression rates of 25:1 and above, was that output file sizes were essentially homogeneous across all the images, where as at 10:1 and lower, file sizes ranged from 1.5 Mb to 11.5 Mb.

TIFF = 35 Mb
1:1/2:1 = 4.96 Mb (86% reduction)
4:1 = 4.56 Mb (87% reduction)
6:1 = 3.89 Mb (89% reduction)
10:1 = 2.87 Mb (92% reduction)
25:1 = 1.39 Mb (96% reduction)
50:1 = 0.72 Mb (98% reduction)
100:1 = 0.37 Mb (99% reduction)

We found that the most colourful images in the collection (such as a colour photograph of a painting) performed the worst, as expected, and started to show artefacts at 10:1. These were extremely minor artefacts, but they could be seen. Other material types were impossible to differentiate from the originals even at 50:1 or 100:1, surprisingly. These tended to be black and white textual items. Using our rule of thumb, we chose 6:1 lossy compression for the archive collections. Were an archive to consist solely of printed pieces of paper, we would reassess and choose a higher compression rate, but an 89% reduction was highly acceptable in storage savings terms.

You may ask: why not just use 1:1 across the board? Is the extra saving actually worth it? Viewed in comparison to the 1:1 setting, we were getting a better than 20% reduction at 6:1 on average. This continues to represent a significant storage saving when you consider the ultimate goal is to digitise around 3.5 million images from the archive collections. Bearing in mind all the other collections we plan to digitise in future (up to 30m images), the savings become further magnified if we strive to reduce file sizes within the limits of what is visually acceptable.

There are a couple of follow-on questions remaining from all this: first, what size original should you begin with? And secondly, is it possible to automate compression using a quality control (such as peak to signal noise ratio) that allows you to compress different images at different rates depending on an accepted level of accuracy. These will be the subject future posts.

August 13, 2010

The JPEG2000 problem for this week

JPEG2000 isn’t the easiest of formats to disseminate. Browsers typically handle the format with difficulty and then require plugins or extensions to render the format. We don’t want our users to have to download anything just to be able to view our material on-line. So, we plan to convert our JPEG2000 files to a browser friendly JPEG or PDF for dissemination. Both formats admirably handled by browsers. (OK, PDF needs an Adobe plugin but it's commonly included with browsers.) Other formats may come along later. The thing is, how do we do that conversion? There are plenty of conversion tools out there – we use Lurawave for the image conversion. But then the question becomes when do we convert from a master to a dissemination format? Especially if we want a speedy delivery of content to the end user.

One of the guiding principles behind our decision to use JPEG2000 was that we could reduce our overall storage requirements by creating smaller files than we might have done if we’d used, say, TIFF. So if we automatically convert every JPEG2000 to a low res thumbnail JPEG, a medium res JPEG and a high res JPEG and to a PDF then we’re back to having to find storage for these dissemination files. OK, JPEG won’t consume terabytes of storage and nor will PDF, but we’d need structured storage to keep track of each manifestation and metadata to provide to our front end delivery system as to which JPEG was to be used in which circumstances. True, this has been very successfully done for many projects before now but alongside efficiency of storage is efficiency of managing what we have stored and a speedy delivery.

So we plan to convert JPEG2000 to JPEG or PDF on-the-fly at the time each image is requested. The idea is that we serve JPEG2000 images out of our DAM to an image server, the image is converted and the dissemination file served up. Instead of paying for large volumes of static storage we believe that putting the saving on storage into a fast image server will directly benefit those who want to use our material online.

One outcome of a conversation had with DLConsulting is that we've learned that on-the-fly conversion is a potentially system intensive (and at worst inefficient) activity that could create a bottleneck in the delivery of content to the end user. We've said that speed is an issue. We need to efficiently process the tiled and layered JPEG2000 files we plan to create. A faster more powerful image server may help but good conversion software qwill be key. Alongside on-the-fly conversion we plan to use a cache that would hold, in temporary storage the most requested images/PDFs. The cache would work something like this. It has a limited size/capacity and contains the most popular/most often requested images/PDFs. If an image/PDF in the cache were not requested for n amount of time it would be removed from the cache. In practice a user requests an digitised image of a painting, the front end system queries the cache to see if the image is there, if it is its served directly and swiftly to the user. If not the front end system calls the file from the back end DAM. The DAM delivers that image to the image server, which converts JPEG2000 to JPEG and places that images in the cache. From where it can be passed to the front end system and the end user. Smooth, fast and efficient in the use of system resources.

But there are still questions. If we pass the JPEG2000 to the image server for conversion to JPEG that’s fine; but what happens next? Is the JPEG2000 discarded after the conversion process leaving only the JPEGs? Is this the best way to support the zooming in on image sections that we want to offer. The original proposal was to hold only dissemination formats in the cache, now we’re thinking that for flexibility we may prefer to hold the JPEG2000 images and convert them as the image is requested by a user. Is this still the most efficient process? It's easy to build bottlenecks into a system that slow processes down, much more difficult to design a system for speed and efficiency. We’re pretty certain that the conversion–on-the-fly is a good idea and we also think the cache is too. Unless you know differently….