I was asked to deliver a presentation to the Scientific Archivists Group conference in Nice on “Electronic Signatures”. This group tends to focus on the needs of regulated industries but also covers other areas.
It was an interesting talk… I realised I needed to give a high-level overview avoiding the deeply technical stuff, but as I wrote it I realised it was the human dimension that mattered most. Which is where I ended up.
This is a good story on the problems of proprietary file formats, which has a happy ending but only through what appears to be days of effort and some Open Source tools.
I really don’t know why customers put up with vendors shipping undocumented private file formats. It just locks up your data.
Purchasing should be requiring Open File Formats for any piece of software that is purchased – it should be an absolute corporate mandate. Sadly the purchasing playbook has yet to get a section on probably one of the most business-critical aspects of scientific computing.
Not all of the “Stuff” sloshing around the lab is the same, and distinguishing between them helps tease out the best place to store things. We use a simple Triangle Diagram (originally proposed by John Trigg of PhaseFour which really just tries to point out that stuff is related, but it’s at different levels of abstraction:
It is quite hard to draw definite lines around things, but I think most people can appreciate that a raw data dump from an instrument is somewhat different from a report to management, or that an experimental write up in word is different from some tabular data in a spreadsheet. The differences between the levels come out in:
- The software that’s used to read the file and interpret the content. Some will require very specific software (e.g. from an instrument vendor), but a PDF or text file can be read by many different things.
- Who might be interested in the data. Again, some files are useful to anyone (for example, a report) but some only useful to certain people with specific training.
- How long your company might want to keep the data, and indeed how long you are realistically able to keep the data. Typically the lower you go, the harder it is to keep something, so if you feel it’s business critical you really need to pay attention to the formats used.
This differentiation can really help in ELN System design. Partly it draws your attention to what needs to be stored in the ELN (typically the “Experiment” write up level), and what can be left in systems e.g. a database or a file server, which can be pointed to from the ELN.
Not everything needs to be stored in the ELN, and indeed it would be unrealistic to expect to be able to do so. The important thing is common keys so you can offer the user a link to more information, and the advent of web-based systems has made this level of “integration” so trivial one sometimes feels a bit of fraud describing it as such.
By building on the storage tools you have in place, and focusing an ELN on Experiments, the resulting system is cheap to run, costs little to acquire, and results in little disruption to existing practices.
You can read the final part of this series here.
This is a good post showing the problems of “Standards” and how they aren’t necessarily a guarantee of any kind of interoperability.
Standards can not force a vendor to be interoperable. If a vendor wishes deliberately to withhold interoperability from the market, then they will always be able to do so, and, in most cases, devise an excuse using the text of the standard as a scapegoat.
You can’t rely on any representations about a “standard” unless there’s clear evidence of different implementations working together.
This really matters for ELNs, for two reasons:
- Your data is locked up in the ELN unless you know you have proved you can get it out.
- The need for long term (typically 30+) access to proof of experimental activity for Patent purposes means you can’t rely on any of the existing vendors being around, or the software you are currently using being functional.
We recommend that customers reassure themselves before they purchase any ELN product that they can get the data they need out into some suitable format – you have leverage with your supplier before you purchase, after that you’re locked in. I am amazed how many customers are aggressive about price discounts etc. but don’t look carefully at the “open data” situation – which is surely much more important than a 3% cost saving on the license price.
There are a number of ELNs around at the moment which can’t reliably create a PDF of the experiment. To some extent it’s a hard problem, but my concern is the “How do we get our data out” is answered with a wave of the hand and “You can just export to PDF” and everyone takes that at face value. In fact in some cases the PDF export facility is unreliable, doesn’t contain all the data, and can’t be automated! Buyer beware…
In terms of picking good “standards” we prefer published, straightforward, and well-used file formats. In our case that currently means XML and PDF/A, both of which are very open and used extensively by a wide range of software.
We’ve updated the PatentSafe Repository Checker script, and importantly released it under an Open Source license (the GPL) which means anyone can check the integrity of a PatentSafe repository.
The project is on GitHub – here’s the project’s GitHub page.
The checker script is a completely separate implementation of the signature and repository code, and is a useful way for anyone – Amphora customer or not – to check that things are OK with their data.
An important part of PatentSafe’s value is that it creates an open repository which you can read and take in to court without needing any additional software from Amphora. Everything is completely open as standard, no need for a complicated export step, or any software except a PDF reader, a text reader, and OpenSSL. The open release of the checker script is just part of this.