3 Comments

Bibliographies could be made easy if research material self-documented what they are reliably, using DOIs.

withered-love-romantic-500x269a

Scientists read. A lot. They also write research papers and grant applications (and blogs!).

All these need references and these references need to be cited.

Most of us (surely all of us, these days??) use reference management software of one kind or other. Isis has a post asking what people recommend for reference management software.

There are a wide range of options and to varying degrees they all work. Most let you drag a reference in the bibliography dataset to the open document you’re working on and “voilà” instant citation. Most come with a range of styles as specified by the publications, so that your citations match, say, what Nature wants (we can always dream…).

One issue that bothers me is the initial creation of the bibliographic entries. Most programs work on the basis of obtaining from the publisher or publication aggregator (Web of Science, etc.) a bibliographic entry, which you then add to your database.

Another approach is to extract the DOI from the downloaded research paper, which in my case is invariably a PDF file. (Some fields offer LaTeX files, and DOIs can refer to other kinds of files too.)

The DOI–Document Object Identifier–is a unique number associated with every document. If you give this number to one of a number of internet-based databases, they’ll return the (full) citation information. One such database is http://dx.doi.org/

So, for example, you’ll see something like ‘DOI:10.1000/182′ or ‘doi: 10.1093/nar/gkn763′ somewhere in most recent research papers. The ’10.1000/182′ is the DOI. This one, incidentally is for the DOI® Handbook. (The other is my most recent research paper!) More information can be found at the DOI main page.

This process can be done by software; the program can contact the DOI database on your behalf and get the citation for you.

It’s a great development offering a one-stop-shop to referencing documents of all kinds.

Several programs attempt to scan PDF files for DOIs, but with mixed success. I spent some time looking into this about a year ago. The concept is nice, and it would suit people like me who large archives of PDF files of research papers.

However, I write “with mixed success” as while you’d wish that scanning a PDF for DOI were simple, it’s not. What these programs try do is to scan the text of the document looking for the word ‘doi’ (or ‘DOI’) and extract the following text, which should be the DOI itself.

If it were only that simple…! Publishers use different ways of citing the DOI, which is ironic considering that DOIs are supposed to solve a citation problem by standardising it.

Some publishers list an URL to the DOI, rather than the DOI itself. Some place a colon after ‘DOI’, others don’t. Some allow the DOI to be broken over a line; this is doubly tricky for scanning software if it’s also broken over columns of text. And so on… parsing out the DOI from text isn’t as easy as you’d think.

It gets worse. Some PDFs contain more than one DOI, so which one refers to the current document?! This happens for example when the publication doesn’t keep papers on separate pages, so that the end of another article can be at the top of the one you downloaded (or, alternatively, the start of the following paper is at the end of the article you downloaded). Also, there’s nothing stopping authors from citing DOIs in the article or it’s references.

Solution?

What I’d like to see is use of a tag (or specialised comment) within the PDF file itself–not the text you read, the encoding of that text–that carries the DOI so that it can be reliably extracted.

This concept should be extended to other file formats.

The key here is self-documenting, something I’m very much a fan of.

Computer files should document themselves, they should say what they are within themselves without needing any further external information (what folder they’re in, the year they were written, etc).

I picked up this many years ago when writing a database of protein sequences as a Ph.D. student. (It happened to contain CCHH zinc finger proteins, but that isn’t important here.)

My software would generate in the output of an analysis a “commented out” copy of the input command file, the source of the data, the version of the database used, the date and time executed, and so on. Among other things this enabled me to feed the output back in as input in the event of losing the input file or could be used for data verification purposes.

In particular what I’d created was a self-documented output file: it said in some detail precisely what it contained within itself.

In the case of research papers, the DOI provides and extremely simple way of self-documenting what the file is. The trouble is how this is encoded within the document.

Having it “lying loose” in the document is hopeless, as we’ve seen. As I mentioned earlier, there is an (huge!) irony that the citations of the DOIs themselves are inconsistent and unreliable.

The solution would be to simply define a “meta tag” that contains nothing but the “raw” DOI.

Modern text formats (PDF, .doc, HTML, LaTeX [LaTeX is modern!??], etc.) have within them markup “tags” indicating the nature of portions of the document. Is it a header, a paragraph, the title, and so on.

A logic thing to do would be to define a tag containing the DOI for the document so that it becomes self-documenting and have an equivalent present in all the main document formats.

This seems an obvious solution to me and one that would have a major impact on maintenance of documents.

I for one would welcome the day all documents were self-documented.