December 20, 2010

Guest post: LoC response to discussion on long-term preservation of JPEG 2000

Carl Fleischhauer, Program Officer at NDIIPP, Library of Congress, responds to recent posts from Johan van der Knijff and the Wellcome Library regarding long-term preservation of JPEG 2000. Both posts mentioned the need to rate the JPEG 2000 format for long-term sustainability using criteria drawn up by the Library of Congress and the National Archives, UK (we have helpfully created an openly available/editable Google doc to make this a collaborative effort).

Thanks for provocative blogs

Thanks to Johan van der Knijff and Dave Thompson for the helpful blog postings here that frame some important questions about the sustainability of the JPEG 2000 format. Caroline Arms and I were flattered to see that our list of format-assessment factors was cited, along with the criteria developed at the UK National Archives. We certainly agree that many of these factors have a theoretical turn and that judgments about sustainability must be leavened by actual experience.

We also call attention to the importance of what we call Quality and Functionality factors (hereafter Q&F factors). It is possible that some formats will "score" high enough on these factors as to outweigh perceived shortcomings on the Sustainability Factor front.

As I drafted this response, I benefited from comments from Caroline and Michael Stelmach, the Library of Congress staffer who chairs the Federal Agencies Still Image Digitization Guidelines Working Group.

Colorspace (as it relates to the LoC's Q&F factor Color Maintenance)

We agree that the JPEG 2000 specification would be improved by the ability to use and declare a wider array of color spaces and/or ICC profile categories. We join you in endorsing Rob Buckley's valuable work on a JP2 extension to accomplish that outcome.

When Michael and I were chatting about this topic, he said that he been doing some informal evaluations of the spectra represented in printed matter at the Library of Congress. This is an informal investigation (so far) and his comment was off the cuff, but he said he had been surprised to see that the colors he had identified in a wide array of original items could indeed be represented within the sRGB color gamut, one of the enumerated color spaces in part 1 of the JPEG 2000 standard.

Michael added that he knew that some practitioners favor scRGB - not included in the JPEG 2000 enumerated list - either because of scRGB's increased gamut and/or perhaps because it allows for linear-to-intensity representations of brightness rather than only gamma-corrected representations. The extended gamut - compared to sRGB - will be especially valuable when reproducing items like works of fine art. And we agree with Johan van der Knijff's statement that there will be times when we will wish to go beyond input-class ICC profiles and embrace 'working' color spaces. All the more reason to support Rob Buckley's effort.

Adoption (the LoC Sustainability criteria includes adoption as a factor)

This is an area in which we all have mixed feelings: there is adoption of JPEG 2000 in some application areas but we wish there were more. Caroline pointed to one positive indicator: many practitioners who preserve and present high-pixel-count images like scanned maps, have embraced JPEG 2000 in part because of its support for efficient panning and zooming. The online presentation of maps at the Library of Congress is one good example (for a given map you see an 'old' JPEG in the browser, generated from JPEG 2000 data under the covers).

Caroline adds that the geospatial community uses JPEG 2000 as a standard (publicly documented, non-proprietary) alternative to the proprietary MrSID. Both formats continue to be used. LizardTech tools now support both equally. Meanwhile, GeoTIFF is used a lot too. Caroline notes that LizardTech re-introduced a free stand-alone viewer for JPEG2000/MrSID images last year in response to customer demand. And a new service for solar physics from NASA, Helioviewer, is based on JPEG2000. NASA includes a justification for using the format on their website.

For my part, I can report encountering some JPEG 2000 uptake in moving image circles, ranging from its use in the digital cinema's 'package' specification (see a slightly out of date summary) to its inclusion in Front Porch Digital's SAMMA device, used to reformat videotapes in a number of archives, including the Library of Congress.

Meanwhile, Michael recalled seeing papers that explored the use of JPEG 2000 compression in medical imaging (where JPEG 2000 is an option in the DICOM standard), with findings that indicated that diagnoses were just as successful in JPEG 2000 compressed images as they were when radiologists consulted uncompressed images. An online search using a set of terms like "JPEG2000, medical imaging, radiology" will turn up a number of relevant articles on this topic, including Juan Paz et al, 2009, "Impact of JPEG 2000 compression on lesion detection in MR imaging," in Medical Physics, which provides evidence to this effect.

On the other hand - negative indicators, I guess - we have the example of non-adoption by professional still photographers. On the creation-and-archiving side, their fondness for retaining sensor data motivates them to retain raw files or to wrap that raw data in DNG. I was curious about the delivery side, and looked at the useful dpBestFlow website and book, finding that the author-photographer Richard Anderson reports that he and his professional brethren deliver the following to their customers: RGB or CMYK files (I assume in TIFF or one of the pre-press PDF wrappers), "camera JPEGs" (old style), "camera TIFFs," or DNGs or raw files. There is no question that the lack of uptake of JPEG 2000 by professional photographers hampers the broader adoption of JPEG 2000.

Software tools (their existence is part of the Sustainability Factor of Adoption; their misbehavior is, um, misbehavior)

It was very instructive to see Johan van der Knijff's report on his experiments with LuraTech, Kakadu, PhotoShop, and ImageMagick. If he is correct, these packages do misbehave a bit and we should all encourage the manufacturers to fix what is broken. There is of course a dynamic between the application developers and adoption by their customers. If there is not greater uptake in realms like professional photography, will the software developers like Adobe take the time to fix things or even continue to support the JPEG 2000 side of their products?

Caroline, Michael, and I pondered Johan van der Knijff's suggestion that "the best way to ensure sustainability of JPEG 2000 and the JP2 format would be to invest in a truly open JP2 software library." We found ourselves of two minds about this. On the one hand, such a thing would be very helpful but, on the other, building such a package is definitely a non-trivial exercise. What level of functionality would be desired? The more we want, the more difficult to build. Johan van der Knijff's comments about JasPer remind us that some open source packages never receive enough labor to produce a product that rivals commercial software in terms of reliability, robustness, and functional richness. Would we be happy with a play-only application, to let us read the files we created years earlier with commercial packages that, by that future time, are defunct? In effect such an application would be the front end of a format-migration tool, restoring the raster data so that it can be re-encoded into our new preferred format. As we thought about this, we wondered if people would come forward to continue to update the software for new programming languages and operating systems, to keep them in operation to ensure that they are still working.

As a sidebar, Johan van der Knijff summarizes David Rosenthal's argument that "preserving the specifications of a file format doesn’t contribute anything to practical digital preservation" and "the availability of working open-source rendering software is much more important." We would like to assert that you gotta have 'em both: it would be no good to have the software and not the spec to back it up.

Error resilience

Preamble to this point: In drafting this, I puzzled over the fit of error resilience to our Sustainability and Quality/Functionality factors. In our description of JPEG 2000 core coding we mention error resilience in the Q&F slot Beyond Normal. But this might not be the best place for it. Caroline points out that error resilience applies beyond images and she notes that it may conflict with transparency (one of our Sustainability Factors). We find ourselves wishing for a bit of discussion of this sub-topic. Should error resilience be added as a Sustainability Factor, or expressed within one of the existing factors? Meanwhile, how important is transparency as a factor?

Here's the point in the case of JPEG 2000: Johan van der Knijff's blog does not comment on the error resilience elements in the JPEG 2000 specification. These are summarized in annex J, section 7, of the specification (pages 167-68 in the 2004 version), where the need for error resilience is associated with the "delivery of image data over different types of communication channels." We have heard varying opinions about the potential impact of these elements on long term preservation but tend to feel, "it can't be bad."

Here are a few of the elements, as outlined in annex J.7:
  • The entropy coding of the quantized coefficients is done within code-blocks. Since the encoding and decoding of the code-blocks are independent, bit errors in the bit stream of a code-block will be contained within that code-block.
  • Termination of the arithmetic coder is allowed after every coding pass. Also, the contexts may be reset after each coding pass. This allows the arithmetic coder to continue to decode coding passes after errors.
  • The optional arithmetic coding bypass style puts raw bits into the bit stream without arithmetic coding. This prevents the types of error propagation to which variable length coding is susceptible.
  • Short packets are achieved by moving the packet headers to the PPM (Packed Packet headers, Main header marker) or PPT (Packed packet header, Tile-part header marker) segments. If there are errors, the packet headers in the PPM or PPT marker segments can still be associated with the correct packet by using the sequence number in the SOP (Start of Packet marker).
  • A segmentation symbol is a special symbol. The correct decoding of this symbol confirms the correctness of the decoding of this bit-plane which allows error detection.
  • A packet with a resynchronization marker SOP allows spatial partitioning and resynchronization. This is placed in front of every packet in a tile with a sequence number stating at zero. It is incremented with each packet.

Thanks to the Wellcome Library for helping all of us focus on this important topic. We look forward to a continuing conversation.

December 08, 2010

Suitability of JPEG2000 for preservation, help us do some further work

Following on from Johan van der Knijff's guest post on this blog we were interested in following up issues that Johan raised. If, as Johan suggests, there are some gaps in the tool sets available for working with JPEG2000 in a reliable way and if some of the long term preservation issues are not well understood, perhaps we could begin to explore where the gaps are. Specifically, we were wondering if we could compare the suitability of just one part of JPEG2000 - the JP2 format - for long term preservation against the two sets of criteria that Johan mentioned.

These criteria were

1. The Library of Congress Sustainability of Digital Formats Planning for Library of Congress Collections, and
2. The National Archives Digital Preservation Guidance Note 1: Selecting file formats for long-term preservation.

Our thinking is that we could do a quick, targeted exercise utilising our community expertise to provide an overview that might reveal useful areas for future research. We propose to limit our investigation to just the JP2 format (for now) and the two sets of suitability criteria. We're looking for high level properties of the JP2 format in relation to the TNA and LoC criteria. High level in the sense that we think that it should be possible to set out properties of JP2 as a series of bullet points against each of the TNA and LoC criteria. It's not a perfect approach by any means, but as a starting point it seems to offer interesting possibilities.

It's not meant to be definitive, but to serve as an information sharing exercise to help non-technical archivists/librarians better understand the suitability of JP2 to long term preservation, and to highlight areas where more work may be required. In this way we hope to point the way for developers and the more technically minded to do further work that makes JPEG2000 a more suitable format for long term preservation by providing better information/documentation to support that.

So we're asking you to collaborate with us in this piece of work. We've created a framework document and put it onto GoogleDocs, where it can be viewed and edited. This document summarises the TNA and LoC criteria (the full criteria can be seen online, following the links given above) and space to add your response as bullet points in the right hand column.

Remember that we're thinking about JP2 only and we're looking for a high level overview - so be brief and stick with the bullet points for now. We'll take on the editing and management of the document.

We will publish the results sometime in early 2011, providing we can get a sufficient and meaningful response. If you have any questions, please ask!

December 02, 2010

Guest post: Ensuring the suitability of JPEG 2000 for preservation

Johan van der Knijff, of the KB/National Library of the Netherlands, follows up his presentation at the JPEG 2000 seminar with a guest blog post on long-term preservation of JPEG 2000.

In my presentation during the JPEG 2000 seminar I discussed the suitability of JPEG 2000 (and more specifically its JP2 format) for long-term preservation. I highlighted the erroneous restriction in the JP2 (and JPX) format specification that only allows ICC profiles of the 'input' class to be used. This effectively prohibits the use of all working colour spaces such as Adobe RGB, which are defined using 'display device' profiles. I also showed how different software vendors interpret the format specification in subtly different ways, and how such issues can create problems in the long term, such as the loss of colour space and resolution information after some future migration.

This leads us to the question; to what extent we can predict a specific file format's suitability for long-term preservation. The answer is not that straightforward. The Library of Congress assesses file formats against 7 'sustainability factors', whereas the National Archives have formulated a list of 12 criteria. It is beyond the scope of this blog post to present a detailed analysis of the extent to which JP2 lives up to either set of criteria. However, it is interesting to have a look at whether these criteria could have been helpful in identifying the issues covered by my presentation.

Format specifications
First, both the LoC's 'sustainability factors' and the TNA criteria acknowledge the importance of having published specifications of a file format. The LoC uses a 'Disclosure' factor, which refers to “the existence of complete documentation, preferably subject to external expert evaluation”. TNA take this one step further by also defining a 'Documentation Quality' criterion, which expresses the degree to which documentation is comprehensive, accurate and comprehensible. This last criterion largely covers the JPEG 2000 ICC issue, although it's questionable how useful this would have been to identify it a priori. A problem with errors and ambiguities in format specifications is that they can be incredibly easy to overlook, and you may only become aware of them after discovering that different software products interpret the specifications in slightly different ways.

Formats that are widely used are typically well supported by an array of software tools, and such formats are unlikely to disappear into obsolescence. TNA expresses this through an 'Ubiquity' criterion, which essentially reflects a file format's overall popularity. The definition of the LoC's 'Adoption' factor includes a list of criteria that can be used as “evidence of adoption”. The first set of criteria here includes “bundling of tools with personal computers, native support in Web browsers or market-leading content creation tools, and the existence of many competing products for creation, manipulation, or rendering of digital objects in the format”.

Note that JP2 isn't doing particularly well when measured against any of these criteria. However, the LoC list adds that “a format that has been reviewed by other archival institutions and accepted as a preferred or supported archival format also provides evidence of adoption”. This certainly seems to be the case for JP2. But how relevant is this, really? Going back to the ICC profiles issue: the JP2 file format has been around for about 10 years now, and its acceptance by the archival community has been growing steadily over the last 5 years or so. Yet, this whole issue seems to have gone unnoticed in the archival community for all those years, and I think this is slightly worrying.

Now let's imagine for a moment that JP2 would have been picked up by the digital photography and graphic design communities. For such uses the ability to do proper colour management is a basic prerequisite, and limiting the support of ICC profiles to the 'input' class would have made the format virtually useless to these user communities. My guess is that in this -entirely fictional- scenario, the format specification would have either improved quickly (based on feedback from the user community), or the respective user communities would have simply stopped using the format altogether. The problem here seems to be that very few people in the archiving community are even aware of such things as colour spaces and colour management, let alone their importance within the context of preservation. With more established formats such as TIFF this may not be as much of a problem, if only because TIFF has been 'road tested' for decades by the photography and graphic design communities. As an archiving community we cannot fall back to any similar 'road testing' in the case of JP2. And this brings me to my next point.

Importance of hands-on experience
Preservation criteria such as those of the LoC or TNA are invaluable for assessing the suitability of a format for preservation, but I believe it is equally important to have actual hands-on experience with the tools that are used for creating, modifying, and reading the format. For instance, the TNA criteria use the number of software tools that support a given format as an indicator for the extent of current software support of that format. But knowing the number of tools says nothing about how good or useful these tools actually are! In the case of JP2, quite a large number of (mostly free or open-source) tools exist that, under the hood, are using the open JasPer library. JasPer is known to have performance and stability issues that make it unsuitable for most professional applications (for which, I should emphasise, it was never developed in the first place!). These issues affect all software tools that are using JasPer. So, only counting the number of available tools may be simply missing the point without incorporating any additional quality criteria. But how would you define these?

Part of the answer, I think, is that assessing a format's suitability for long-term preservation is not a purely top-down process. Most of the software-related issues that I showed in my presentation were found by simply experimenting with actual files, encoders and characterisation tools: convert a TIFF to JP2; convert it back to TIFF; use existing metadata-extraction and characterisation tools such as ExifTool and JHOVE to analyse the in- and output files; try to understand the output of these tools; compare the output before and after the conversion, and so on. Such experiments are extremely useful for getting a feel for the strengths and weaknesses of specific software tools, and they can reveal problems that are not readily captured by pre-defined criteria. In some cases, their results may be used to refine existing criteria, or even add new ones.

Final notes on preservation criteria
Although I wouldn’t downplay the importance of preservation criteria such as those used by the LoC or TNA, I think it’s important to realise that such criteria are largely based on theoretical considerations. In most cases they are not based on any empirical data, and as a result their predictive value is largely unknown. For example, an interesting blog post by David Rosenthal argues that preserving the specifications of a file format doesn’t contribute anything to practical digital preservation. According to Rosenthal, the availability of working open-source rendering software is much more important, and he explains how “formats with open source renderers are, for all practical purposes, immune from format obsolescence”.

This takes us directly to the lack of JPEG 2000-related activity in the open source community, which I also referred to in my presentation. Perhaps the best way to ensure sustainability of JPEG 2000 and the JP2 format would be to invest in a truly open JP2 software library, and release this under a free software license. This could either take the form of the development of a completely new library, or investing in the improvement and further development of an existing one, such as OpenJPEG. This would require an investment from the archival community, but the payoff may be well worth it.

Acknowledgement: this blog entry was largely inspired by an e-mail discussion that was started by Richard Clark, and in particular by a contribution to this discussion by William Kilbride.