JPEG 2000 at the Wellcome Library: June 2010

June 25, 2010

JPEG 2000 workshop with Richard Clark

In the wake of taking on board the recommendations from the Buckley/Tanner report (see a previous blog post), we needed to start looking at how we would actually create these JPEG 2000s as part of our digitisation workflow. As the JP2K-UK group meeting showed us, there is not a lot of knowledge in our industry regarding to the tools we could use - not only for creating the JPEG 2000s in the first place, but also for managing, displaying and converting them back into other (browser-friendly, for example) formats. We knew of a few tools, but wanted a more thorough understanding of the possibilities.

We turned to software engineer Richard Clark, who was deeply involved in the JPEG Committee and has worked on the JPEG 2000 technology. Richard is based in the UK, and currently owns Elysium Ltd., offering software and IT support solutions to businesses and organisations. Richard Clark was asked to deliver a half-day workshop for those Wellcome Library staff that would be involved in implementing our JPEG 2000 solution.

The workshop focused on options for the practical implementation of JPEG 2000, and the situation regarding software support for the format. He also touched on the workflow issues we need to be aware of and address in planning our strategy. The workshop helped us determine which solution would work best for us - as will be described in subsequent posts on this blog. You can read a version of his presentation as embedded in this post. Richard also shared with us some of the more technical details from his presentation at the British Library in 2007, available on Scribd.

J2K Workshop for the Wellcome Library

June 22, 2010

Initiating the JP2K-UK Implementation Working Group

By Autumn 2009, we were committed to using the JP2 format for our digitisation projects. However, we knew that the lack of good information and communication between practitioners was a risk factor. First of all, we wanted to know what was going on so we didn't have to keep re-inventing the wheel. Has anyone else carried out compression tests on historic materials? Who uses which tools, and why? Secondly, if we could improve communications, presumably more people would feel comfortable about using JPEG 2000 serving to broaden the user base and further entrench the format into practice - essential to ensuring longevity.

This information wasn't just going to come out of the woodwork - or not as quickly as we would have liked - so we set up the "JP2K-UK Implementation Working Group", with a starting membership of one. We then cold-called a number of contacts from relevent organisations to test the level of interest in joining such a group. We optimistically booked a small meeting room, with a free lunch as an added temptation, hoping someone would be interested.

Someone was indeed interested; in fact, nearly every single person or organisation we contacted had a high level of interest in JPEG 2000, and most were actively pursuing JPEG 2000 implementation in some way as a practitioner, consultant or software developer. We booked a much bigger room, shelled out for a lot more sandwiches, and realised we needed a proper agenda. At this point we set up the JP2K-UK wiki to consolidate online resources, provide dates of any events, and list the member organisations.

Our first meeting was held in December, and we had 17 initial individual members representing 12 organisations, drawn primarily from the library world (see the wiki for member organisations). The bulk of the meeting was taken up with small groups, discussing what they knew of the different technical aspects of JPEG 2000 (formats and features, compression, IPR, and tools) where the knowledge gaps were, what the general opinion of JPEG 2000 was, and how we might act to make the use of JPEG 2000 a little bit easier for everyone.

Not surprisingly, there was a range of opinion, levels of knowledge and understanding, and intended use of the format; but every attendee was keen to work toward creating a resource for practitioners and disseminating information with a series of workshops, seminars and/or conferences. As an initial discussion the meeting set the tone for the future of the group, and further developments will be posted here very soon.

June 18, 2010

Bringing in the experts

There are numerous articles, reviews, and technical reports on the JPEG 2000 format, many free to view online. Despite this, we found it difficult to determine how we could make best use of the format in a practical way. There are 13 "parts" to JPEG 2000 - from basic image formats to a metadata format, and even a digital cinema format. Mostly these parts are extensions to the core specification. Through our own reading, we were able to determine that Parts 1 and 2 were the ones we needed to look at. But which one to use? Part 1 specifies both a compression algorithm, and a format. Part 2 specifies a different algorithm, and extensions to the format. We could find little - short of becoming a technical expert - that would allow us to adequately weigh up the pros and cons of the various options, and even less on how others have made their decisions.

In Spring 2009 we turned to Simon Tanner, Director of Kings Digital Consultancy Services, for some advice. Simon agreed to search out the experts and provide us with a report setting out clear recommendations: primarily which format and compression to use for preservation and access, and what features we should implement. We provided him with a brief of our requirements, the background to our intended digitisation activities, and some sample images.

Simon did find an expert to work on the report: Robert Buckley, colour digital imaging expert and member of the JPEG Committee. Rob carried out a number of tests on the images we supplied looking at the implications of lossless v. lossy compression, how we might get the best out of certain JPEG 2000 features, how we should manage technical metadata, and more. This provided the evidence, set out in the report, that backed up his final recommendations.

The key recommendation was that we use the Part 1 compression and JP2 format for our digitisation projects, for both the archival master format as well as the access copy. Also important was the recommendation that we use a lossy rather than a lossless format - maintaining a high quality that could be considered "visually lossless". Although this results in a loss of information that is non-recoverable, the data that is lost was never visible to the human eye, and therefore simply unnecessary for our needs. The Wellcome Library intends to follow the recommendations as closely as possible for future digitisation projects, although exact compression levels used would need to be determined on a collection-by-collection basis with further tests.

The report is available to view on our website.

June 16, 2010

We need how much storage?

In 2009, the Wellcome Library set out an ambitious vision to digitise a large proportion of its historic collections. This would take the annual digitisation activities of the Library from hundreds, or at most, thousands of images per year to several million images per year. Collections were to include a wide range of content types - archives, printed books from the 15th to the 20th century, manuscripts, paintings and drawings, ephemera. Once we added up all these collections, using broad estimates of what we believed was there, we realised this could see the generation of up to 30m images over 5 years. Exciting, but perhaps slightly daunting, considering we didn't yet have an infrastructure to fully support such a large collection of digital assets.

Anyone reading this blog will understand why the scale of the programme is key to the blog topic. When we asked our IT department to tell us how much it would cost to store 30m TIFF files - our de facto standard for the couple hundred thousand images in our existing picture library - we were stunned. Two petabytes of online, spinning disk storage with a top-of-the-line enterprise management system and remote backup would cost how much? We learned that the cost would be something like a fifth of our total budget for the entire digitisation programme.

Should we consider a lower-cost storage solution? Even tape back-up was quite expensive for that scale, and you can't serve images up online from tape anyway. We revised our image sizes, factoring in smaller and smaller resolutions and/or bit depths for material like the printed books, which didn't need full colour, high resolution images. We still couldn't afford the storage costs.

Finally, we saw the light and started looking into a relatively new image format called JPEG 2000. We knew almost nothing about it, except that it employed an extremely efficient compression algorithm that could, possibly, allow us to reduce our storage costs without compromising too much on quality.

This was the start of our journey into the complicated and mystifying world of JPEG 2000. This blog charts our progress up to date in determining what type of JPEG 2000 we would use, how we would use it, and how it would impact on the rest of the Digital Library infrastructure. We have by no means worked out all the details around how we are going to implement JPEG 2000, so this blog will also serve as a diary of our progress as we go along. Happy reading, and feel free to post comments.