Is a paper just a PDF file?

Practically all papers at are in PDF format, like with most online publication channels. Is this format sufficient? I think that supplementary material like datasets, source code, reviews etc. would significantly increase the utility of a paper for the scientific community. Currently, such supplementary material (if any) is put on repositories detached from This can create inconsistencies, e.g. when the data set is removed or altered. So, what about allowing authors of accepted papers to publish a “fat paper” , i.e. the printable paper (PDF) plus supplementary material. Are there standards available for such papers? Would you prefer a mandatory semantic annotation of the supplementary material? How can licenses be handled, e.g. for source code?

— Manfred Jeusfeld

  1. langec said:

    Good idea! Let’s first think about source material that’s available anyway within the editorial workflow:

    Sources of papers (usually LaTeX) are useful for researchers who want to do document analysis. I am aware of a project that does such things with the arXiv document sources ( In the EasyChair editorial workflow suggested by ceur-make ( the LaTeX sources are collected anyway, but not used at the moment. However the authors would have to agree to the sources being, under some licence. arXiv suggests full Creative Commons licences, or, what I think authors usually choose, a licence that only allows storage and processing for the purposes of arXiv.

    Reviews: EasyChair doesn’t have functionality to export them; however it would be an easy job for the editors to save the corresponding HTML files from EasyChair or any other review system they used. But the reviewers would have to agree. The Semantic Web Journal ( does something similar.

    For source code I’d rather encourage authors to publish it online at a site that’s better suited for this job, e.g. GitHub, and then link to it. Same for data, which could be published at Datahub ( Links to such source/data publications could be included in an extended version of our index.html tables of content.

    –Christoph Lange ( technical editor)

  2. Thanks! Utilizing LaTeX (perhaps with one of the semantically enriching macro packages) would open more opportunities, e.g. to cite a certain theorem in a paper rather than the whole paper.
    Not all authors are however using LaTeX and not all workshop editors use EasyChair.

    Source code is tricky if it requires a specific platform. But source code are not just computer programs. It can also include database schemas or other models. My argument of the “fat paper” was that the paper text and the dataset belong together. Storing them on different platforms is harmful for the link and makes archiving the “fat paper” a difficult task. I used such a datset repository some time ago. It was quite complicated to get the data into it and I am sure that nobody found it.

  3. langec said:

    Got your point about external source code repositories vs. a “fat paper”. We should aim for “fat and smart”, i.e. not just storing the fat material, but at the very least having meaningful links to it from index.html (ideally RDFa-annotated). If the paper itself is HTML we could, of course, have fine-grained RDFa links from the paper into the source code or other data. That’s within the authors’ responsibility, but we should prescribe a certain directory layout and, via the editors, communicate it to the authors.

    But there is one reason to prefer external source/data repositories. If projects evolve beyond the publication of that one workshop paper, the external repository will contain up-to-date sources. OTOH for reproducing results mentioned in a paper it’s probably a good idea to additionally have a snapshot of the sources from the time the paper was written. And I do agree that is a good place to publish the snapshot.

    I do like semantic LaTeX and have, in fact, contributed to the development of sTeX ( But I am not aware of a lot of users, even I myself have rarely used it for publications. (Michael Kohlhase, the main developer, does all of his lecture notes with it, but I think he is the only one.)

    You have a better overview than I have, but my impression is that the majority of papers are written using LaTeX (and if not: we could also publish *.doc sources, they are just less useful), and that at least a relative majority of workshops use EasyChair. That is, providing special support for the LaTeX/EasyChair workflow is probably the best investment.

  4. Yes, fat implies frozen here. Ongoing development cannot be frozen, hence it should not be materialized on — even though workshop by their very nature *report* on ongoing work. I am not sure whether the majority of papers at were produced with LaTeX. I guess it is 50% LaTeX and 50% Office.

    It is fine when a paper refers to external data sets or project sites. This is easy and beyond our control.

    What we could learn from (Open)Office is the packaging of a document into a zipped archive that has some structure.

  5. I am a strong believer in the fact that papers that build heavily on data should make this data publicly available as much as possible. Furthermore, if tools have been used to analyze the data, these tools should be publicly available as well.

    Of course this is not always possible, for example when NDA’s or licensing issues come into play. However, in my opinion, this is currently not the main reason for people not to make their data available. In fact, I believe it is simply due to the fact that data and source code don’t count as publications. There is no h-index for datasets, nor is there such a thing for tools. Hence, any effort in the direction of standardizing data availability should focus on this problem.

    In The Netherlands, we started an initiative called the 3TU Datacenter ( This datacenter allows researchers to store large volumes of data as if they are publishing the data. The data gets a DOI and can be referred to from papers in a standardized way and much like a journal or a proceedings series, once data is published, it will not change anymore.

    Unfortunately, some data cannot be archived this way. For example, when trying to archive process models (such as Petri nets, EPCs, BPMN models etc.) the datacenter cannot help, simply because they cannot commit to having tools available to read these models..

  6. Yes, tools for the data are an issue. If license conditions allow, then the tools could be packed into a virtual machine appliance that is likely to be executable for a couple of years. I never did this but a VM would make the tool rather independent from the users own operating platform.

    PS: I did that for the ConceptBase software:

  7. With the PDF format you can include different types of “annotations” such as attachments, comments, links, form elements, and JavaScript programs. For the “supplementary requirements” you mentioned, I think PDF is sufficient.

    Here is a reference to the PDF standard: There are software for “reading” and “authoring” PDF documents too. Personally, I use a paid version of a software that can edit PDF documents.

    I think it would be a very good idea to allow authors to include annotations in their “papers” submitted as PDF documents for publication at If certain data sets and special computer programs go together, they can be included as attachments to the PDF document. The entire submission can be treated as a unit that can be preserved together.

    If the size of the data set if very large, then it may not be a good idea to attach the data set with the PDF document if the PDF document with the attached data were used by many users who browse only the visual and textual content of the PDF document. Perhaps two types of PDF documents should be submitted by authors: one “fat paper” which includes everything that the author submits for publication to, and “thin paper” that is for “browsing” only. The “thin paper” may include a link to the “fat paper” or another type of reference.

    • PDF is a wonderful format but including meta data into them might be in conflict with archiving goals, such as promoted by PDF/A. But I agree with the idea of a “fat paper” very much. Even PDF/A is not just a standard but a series of standards, where different national libraries support different versions. In version 1.0, they disallow URLs in the PDF, which I find quite a restriction. They disallowed the URLs because links make a document dependent on external resources.
      If we could materialize the annotations as an XML document *and* at the same time embed them into the PDF, then we would have a basis that caters for multiple use cases.
      If such a function could be provided as automated scripts based on free, open-source tools, then we could even integrate it into our workflow … script donations to are very welcome!

  8. I saw yesterday a video promoting the “inquiry”-style of textbook created by Vinay Chaudhri and others at SRI. They essentially transcode the knowledge statements in a paper into a knowledge base, which then can be interactively queried. The application is geared towards teaching but this should also be applicable to academic papers — at least up to a certain level of knowledge facts. So, the addition of data sets is not the only way to go beyond the “print” style of papers. Papers and their accompanying objects should be subject to querying.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: