The CEUR-WS.org Core Values include “freeness and openness” and a “clear copyright”. With regard to the openness of its data, our current implementation of these two values leads to a stark self-contradiction. Here is why.
Let’s start by revisiting these values:
- Freeness and openness: The publication service is free of cost and openly accessible for the academic community. The freeness of costs refers to the main publication service, i.e. to publish a submission that is essentially free of errors.
- Clear copyright: The authors shall keep the copyright to their papers. The editors keep the copyright to the proceedings as a whole.
Seems reasonable, doesn’t it? – It does, but only for the papers we publish, not for the metadata about these papers.
I’m starting this discussion in my role of the CEUR-WS.org technical editor. This is so far my personal view, not (yet) the consensus of the CEUR-WS.org team. Part of my mission is working towards the publication of the CEUR-WS.org metadata as Linked Open Data. In particular, I helped to shape the definitions of the 2014 and 2015 Semantic Publishing Challenges to make them a major driver of the technical developments necessary for this mission.
We are an open access publication platform; thus, any paper published with CEUR-WS.org is gold open access. Not only accessing papers, but also publishing them is free of charge.
We do not actually publish open content, because the Open Definition defines that open content “can be freely used, modified, and shared by anyone for any purpose”. This contradicts the way we are currently implementing the “clear copyright” value: neither paper authors nor volume editors have to grant any permission; they reserve all rights.
By the same argument, the metadata about the papers and workshop volumes is not open. Let’s first discuss why data should be open. According to the Open Knowledge Foundation, there are three common reasons, and all of them apply to scientific publishing:
- Transparency: Not only do citizens want to understand what their governments are doing, the members of the scientific community also want to be able to assess the quality of the scientific output of their peers (which is the primary motivation for the Semantic Publishing Challenges).
- Releasing social and commercial value: Not only assessing the quality of a workshop series or of a paper, but even finding a good paper about some topic, or finding an expert in some field, requires access to data. By merely being able download the HTML and PDF files of CEUR-WS.org workshops, it is hard to realise retrieval or quality assessment in practice. It is even harder to deliver additional social and commercial value. To give a concrete example, researchers recently enquired about the possibility to develop a summarization service for our volumes and to re-publish such summarizations, which would only be with the consent of the copyright owners, i.e. the paper authors, but, to keep the publication process simple, CEUR-WS.org does not ask for them to give their consent.
- Participation and engagement: CEUR-WS.org is participatory, by its third fundamental value (“from scientists for scientists”). Every scientist can participate in CEUR-WS.org by publishing a workshop volume, or contributing their papers to such a volume – but once such a volume is published, participation gets reduced to being able to look at papers.
Now assume you want to open your data – how do you, technically, implement this openness, including transparency, the possibility to add value, and the possibility to participate and engage? The 5 Star Open Data scheme argues that Linked Data is the way to go:
- using Web-wide unique identifiers (i.e. URIs) for things (here: papers, proceedings volumes, authors, conferences, etc.) – CEUR-WS.org has been using stable URIs such as http://ceur-ws.org/Vol-1155/ for a long time,
- using HTTP URLs for these identifiers so that information about a thing (here, e.g., the table of contents of a proceedings volume) can be downloaded by simply typing its identifier into the browser’s address bar – this is the case at CEUR-WS.org,
- providing machine-comprehensible information about things for download from these URLs – this is not the case, as we only serve HTML and PDF designed for human consumption,
- providing links to other things so that further information can be discovered – this is not the case, as we leave submitted HTML and PDF files unchanged.
Linked Data principles (1) and (2) are prerequisites for 4-star open data, so is (3), and (4) is a prerequisite for the fifth star. All in all, the CEUR-WS.org papers, published as PDF, gain one star, and the HTML tables of content gain between one and three stars: you can manipulate them (e.g. enlarge the font size for readability) without proprietary software, but you can only manipulate their presentational aspects; you cannot, e.g., access them like a database to filter papers by topic or by author.
After the 2014 Semantic Publishing Challenge, and at the verge of announcing the 2015 Challenge, we are technically ready to publish at least the metadata of all CEUR-WS.org papers as Linked Data. The information extraction tools developed by the participants of the 2014 Challenge, in particular the winning one by Maxim Kolchin and Fedor Kozlov, combined with some scripts for automating the publishing workflow, make it possible.
However, there is a legal obstacle. The editors of the proceedings volumes own the copyright, and in particular CEUR-WS.org never asked for their permission to re-publish derivatives of the metadata of workshops and papers. An RDF representation of a workshop’s table of contents is such a derivative, even if just w.r.t. the technical format, not w.r.t. the content. One may argue that the fact that someone published a paper somewhere is public, non-copyrightable information, and our tables of content contain little more information than that. One may also argue that others have been publishing derivatives of the CEUR-WS.org metadata for a long time: DBLP indexes a subset of CEUR-WS.org with the consent of the CEUR-WS.org publisher, but actually not with the consent of the copyright owners, i.e. the proceedings editors, and it even publishes these derivatives under an open license, and it makes them available as RDF Linked Data. This is widely regarded fair use, but DBLP are doing so at their own risk – and would CEUR-WS.org itself want to run such a risk?
To be fair, CEUR-WS.org has been making an effort towards open data and linked data for a while: based on the results of a survey among former editors, the CC0 open data license became mandatory for metadata until 2014 (effective as of volume 1263). The first linked data enhusiasts published a volume annotated with machine-comprehensible RDFa attributes as early as 2009. RDFa became officially supported in 2013, and the ceur-make tool facilitates its generation – but still this is something for technophiles and only used by less than 1 out of 10 volume editors.
As a result, most of CEUR-WS.org’s data is neither open nor linked. We could wait until volume 2526, when CC0-licensed metadata will be in the majority, but thorough quality analysis requires a look back into the history of workshops, and the “old” proceedings volumes also still provide the majority of connection points to other linked open datasets, including DBLP, the Semantic Web Dog Food Corpus, COLINDA and even datasets of commercial publishers.
So, what can we do to open and to link the metadata of all volumes ≤ 1263? Note that technically it is possible to partition a linked dataset and to give its different parts different licenses – CC0 for volumes ≥ 1263, and “all rights reserved” for volumes < 1263. The question is whether this is how we want to continue implementing our values.