June 16, 2010

We need how much storage?

In 2009, the Wellcome Library set out an ambitious vision to digitise a large proportion of its historic collections. This would take the annual digitisation activities of the Library from hundreds, or at most, thousands of images per year to several million images per year. Collections were to include a wide range of content types - archives, printed books from the 15th to the 20th century, manuscripts, paintings and drawings, ephemera. Once we added up all these collections, using broad estimates of what we believed was there, we realised this could see the generation of up to 30m images over 5 years. Exciting, but perhaps slightly daunting, considering we didn't yet have an infrastructure to fully support such a large collection of digital assets.

Anyone reading this blog will understand why the scale of the programme is key to the blog topic. When we asked our IT department to tell us how much it would cost to store 30m TIFF files - our de facto standard for the couple hundred thousand images in our existing picture library - we were stunned. Two petabytes of online, spinning disk storage with a top-of-the-line enterprise management system and remote backup would cost how much? We learned that the cost would be something like a fifth of our total budget for the entire digitisation programme.

Should we consider a lower-cost storage solution? Even tape back-up was quite expensive for that scale, and you can't serve images up online from tape anyway. We revised our image sizes, factoring in smaller and smaller resolutions and/or bit depths for material like the printed books, which didn't need full colour, high resolution images. We still couldn't afford the storage costs.

Finally, we saw the light and started looking into a relatively new image format called JPEG 2000. We knew almost nothing about it, except that it employed an extremely efficient compression algorithm that could, possibly, allow us to reduce our storage costs without compromising too much on quality.

This was the start of our journey into the complicated and mystifying world of JPEG 2000. This blog charts our progress up to date in determining what type of JPEG 2000 we would use, how we would use it, and how it would impact on the rest of the Digital Library infrastructure. We have by no means worked out all the details around how we are going to implement JPEG 2000, so this blog will also serve as a diary of our progress as we go along. Happy reading, and feel free to post comments.

2 comments:

Anonymous said...

Hi, I'm curious about the storage required and the technical solutions you eliminated. There are a number of different options available - e.g. my napkin numbers say going completely cloud with the storage would incur opex of approx US$3 million per year while buying decent san storage would be approx US$4.5million capex plus opex of (roughly) US$1million per year. There are other associated costs such as dev time to use these, but those estimates include maintenance/backup. Of course i don't know your usage estimates so am guessing a lot for these numbers.

Also, storage costs decrease on a monthly basis and (i'm guessing) you're fortunate in that your data store 1) doesn't change (once digitzed, an image stays the same) and 2) grows at a predictable rate (you can only digitize so much at once). 1) means backup is easy as you only need an initial copy and deltas are (virtually) nonexistant and 2) means you can buy storage in chunks at the cheapest current price instead of all upfront and not use most of it until the project nears completion.

So, i'm curious, just what was the fifth of total budget cost you were looking at, and what was the capex/opex breakdown? And what are your target expenses?

Christy Henshaw said...

Hello anonymous. Very good comment and questions. We will be posting soon about the whole storage issue in more detail, so I'll have to ask for a little patience as we get that one ready!