Logo

Cobaltmetrics: Web-Scale Citation Tracking

    Aim

    With Cobaltmetrics, Thunken is on a mission to make altmetrics genuinely alternative. Traditional citation indexes have stringent inclusion criteria and focus on privileged publication venues. Altmetrics were designed to overcome some of these limitations, but most data providers still somehow rely on predefined lists of citable/indexable research outputs, and they only scratch the surface of the web. We argue that the only way forward is to embrace web-scale citation tracking.

    Methods

    Cobaltmetrics crawls the web to index hyperlinks and persistent identifiers as first-class citations. We analyze a wide range of websites to reveal insightful links between documents. Cobaltmetrics goes deeper than backlink databases and altmetrics aggregators to help you report on all types of content: publications, books, clinical trials, patents, software artifacts, derivative works, etc. The web is our corpus, and our URI transmutation API collates citations to all known versions of a document.

    Cobaltmetrics combines the best of citation indices, altmetrics aggregators, and backlinks databases. Citation indices like OpenCitations, Scopus, or Web of Science focus on citations between traditional scholarly publications. Our approach is both complementary and much broader. In Cobaltmetrics, we track citations between all types of content on the web, not only publications. We think that it is not up to citation aggregators to define what is citable, so we have no selection criteria based on a document's format, language, publication venue, persistent identifiers, etc. Altmetrics aggregators like Altmetric, Crossref Event Data, or Plum Analytics are quite similar to Cobaltmetrics. However, we think that they are not alt- enough as, for many data sources, they focus on data published in a handful of languages and/or have restrictive selection criteria regarding the documents they index. Our goal is to go deeper: the web is our corpus, and we index all citations, no matter the language, the format, or the identifier. We also think that our URI transmutation API surpasses their search engines when it comes to aggregating or deduplicating results. On the web, backlinks and citations are similar objects. That being said, backlink databases also lack our URI transmutation API, i.e. the ability to collate backlinks to all known versions of a document. With Cobaltmetrics, you can not only discover that a given page links to your content, but also all the short URLs and other identifiers that directly or indirectly identify your content.

    One of our core principles is that it is not up to altmetrics data providers to decide what is citable, our role is to observe all citation patterns on the web. The web is not FAIR (and will most likely never be) and that is just fine. To produce a corpus that is diverse and inclusive, we track all URIs: every hyperlink, every occurrence of a URI is a citation. One of our biggest challenges is to collate URIs that directly or indirectly identify the same resource, so that citation counts and attention scores can be tallied accurately. We will present the design rationale of our URI transmutation API, and discuss how it relates to other approaches like meta-resolvers (e.g. identifiers.org and n2t.net) and PID graphs (e.g. FREYA).

    Results and Discussion

    We will then move into a discussion of web-scale altmetrics, a.k.a. alt-altmetrics. We must forget all limitations regarding publishing formats, languages, APIs, and, most importantly, data sources. Metrics are a sampling game, and the web is our corpus. We have started building an infrastructure for web-scale altmetrics by ingesting the massive datasets produced by the CommonCrawl project. Cobaltmetrics is thus in no way restricted to the scholarly web, and we hope the corpus will be useful to other communities. We will discuss how Cobaltmetrics compares to other altmetrics data providers Altmetric, Crossref Event Data, and PlumX Metrics. We will then share the lessons we have learned in the past 18 months, including implementation choices, negative results, and tips to pull citation data at scale with our API.

    In particular, we will present results from the analysis of legal citations. Recent initiatives to open access to the law now make it possible to track and analyze legal data on a large scale. Cobaltmetrics partnered with CourtListener to explore the potential in tracking and analyzing citations to and from court opinions from all state and federal courts in the US. Evaluating legal data gives insight into how resources are used, how resources influence other courts and other resources, and how different resources are connected across jurisdictions. We will discuss the main challenges in extracting and normalizing citations in court opinions.

    Conclusion

    We will then present preliminary results regarding the most cited domains in the CourtListener corpus. We will conclude with a special announcement about Cobaltmetrics, linked data, and permissive data licenses!

    • Damien VannsonDamien VannsonThunken

      Builder at heart, driven by the satisfaction of turning shower thoughts and back-of-the-envelope plans into full-fledged, user-friendly… More →

    • Luc BorutaLuc BorutaThunken

      Ph.D. in computational linguistics, natural language processor, interested in linked data and linguistic diversity. In previous lives, Luc… More →