So long Zotero, and thanks for all the fish

If you're in or around academia, chances are pretty high that you use Zotero to manage your papers. I don't anymore.

For those who need a refresher, Zotero is paper and citation management software -- advertised as "Your personal research assistant". Along with its browser extensions, it lets you easily add and snapshot resources or papers into a library, view those resources offline, annotate, tag, and cite them. It integrates into your document preparation software of choice, syncs libraries between different machines, has support for shared libraries with collaborators, and, with the plugin system, does so much more. It's really cool software, entirely open-source, and I've been a Zotero proselytizer for many years. I still am: for most people, I think it's absolutely the right tool.

It's just not right for me.

This post will offer praise and criticism of Zotero, and introduce what I'm doing instead to manage (far too) many research papers and bibliographies.

Marmite

I love Zotero but I also hate Zotero. I've been saying this for a couple of years now. I have by no means explored all of Zotero's features, and rarely use many of the "online" ones: I sync to a WebDAV server I setup at home, and I avoid shared libraries like the plague. But the features I did use, I thought were very useful: for example,

  • Excellent snapshot system. That you can keep a copy of a webpage as it was when you visited it is excellent.
  • Annotation and notes. Some of the UI choices in Zotero are really good, such as when you highlight text in a document, it comes up with a small tooltip of the possible colours you can use to highlight that text. If highlighted, it will appear in a sidebar where you can then add comments or links about that specific line. It makes annotating documents really straight forward.
  • Export library as BibTeX. Absolutely, yes please. Why wouldn't that be good?

Some example bits I found really not very good:

  • Search is slow and imprecise. I've indicated my frustration with the search in Zotero before, and wrote a command line tool that makes it better. I find the way search results are presented opaque: I don't immediately know where it matched a particular term, there's no way I can order the search results without it changing my library view too, and refining searches is too involved with an advanced search menu, when something like NASA/ADS author:"Name" would be more than sufficient here.
  • Annotations aren't baked into documents. I sort of understand why they are stored in the SQLite database, but it's frustrating that I can't send a PDF to someone, or open it in a different reader, and keep the highlighting and annotations in place. There is an export feature that will bake the annotations in, but that's more clicks and things to remember.
  • Collections don't nest like you'd expect by default. If I had some collection (directory) A and within that is another collection A.B, then I would expect that any item in A.B would also naturally be a member of A. But if I look in collection A I only see things that are in A but not in A.B. This makes it quite tedious to organise things by default. I only recently discovered there is an option to "Show Items from Subcollections" in the View options, which I don't remember seeing in previous versions of Zotero.

There's a number of things that I think are double edged swords, in that I understand how they are supposed to be good but I think they are bad, or worse, could even lead to bad habits.

  • Tagging system. Tags can be added manually and are automatically added from the keywords of a paper from a given publisher. This means you likely have hundreds of tags already, none of which are particularly useful for searching across a library comprised of keywords from different journals. It also generates weird strings as tags, such as profile--plasma. This goes hand in hand with the lack of checking if a tag already exists or not: you can very easily fat-finger your way into, or forgetfully end up having, X-ray, x-rays, xray, Xrays, x ray all as entirely separate tags, when they really should be the same thing.
  • Filesystem layout. When I giving a talk about Zotero I showed how Zotero stores items in the file system. Two people audibly gasped. It's great from a programmatic perspective having all of the directories be hash keys, but it makes it really difficult for people to navigate. Given that Zotero already has a way of 'uniquely' naming items based on the parent metadata, it would be great if the file system could work similarly, and if notes and annotations could be stored in plain text next to the PDFs instead of in the database. This would permit other tools to easily make use of Zotero's data, or even allow you to write your notes in a different editor and have Zotero use them for the convenience of displaying everything together.
  • Interoperability is difficult. Unless you're happy to use Zotero as your sole paper management tool, then it's pretty tricky to develop things that will work with Zotero. Some parts of the documentation are missing to make this sort of thing easier -- for example, the documentation will not tell you how the connectors send a URI to Zotero. If you wanted to add a paper from the command line, you need to use Wireshark to sniff the packets sent by the connector and replicate the request:
    /connector/saveSnapshot
    {
      "url": "https://arxiv.org/pdf/2411.08554",
      "pdf": true,
      "detailedCookies": "...",
      "uri": "https://arxiv.org/pdf/2411.08554"
    }

But the main thing that made me want to give up on Zotero is what I think it it's most substantial problem. It is much easier to put information into Zotero than it is to get it out.

Buttle, Tuttle, and Information Retrieval

Putting information in is easy. If it's too easy to put information in, a lot of information will enter, not all of it useful. Getting information out is hard. The redundancy of useless information makes this harder.

The limited search and filtering has the knock-on effect that instead of useless information just being wasted space, it actively prohibits your ability to find the useful information. The tagging system, which would help here, is too unwieldy and complicates itself through automated tags.

Zotero also has a strange way of handling duplicates. You can add the same file as many times as you like, and each one will be a separate PDF on your disk. There is a list where it shows you all of the duplicate items, but the way it handles these is unsatisfying. The only option is to merge them into a single item, meaning you'll often end up with a lot of PDFs attached to a single parent. You can't delete just one, because the duplicate items tab only lets you select all of them at once. This requires manual cleanup, which is time-consuming, since you now have to find those parents with multiple (identical) children. The whole process is fiddly due to the point-and-click nature of the program, and arguably it should have warned you about the duplicates when you added them in the first place.

The metadata that Zotero extracts using the connectors is mostly pretty good, but it will often get the bibtex entries very wrong, giving the illusion that it's making your life easier until you actually need them to write a peer-reviewed paper. Combine that with the duplication handling, and you'll end up with citation entries that are spurious for the version of the PDF it has assigned as default to the parent.

These smaller things would be less severe if it wasn't one click in the browser to add a paper.

But I think there's also something more fundamental here. Not everything needs to be saved. I want to be more engaged in the process of archiving my library, and not automate the parts that are essential for me to know what is in my library. Typing out metadata makes me momentarily think about what the paper is, and helps me remember that I have it. It also helps ensure that that metadata is meaningful, as I, the person who will be searching through the descriptors at a later date, will filter according to what I think is important and what my interests are. I'm also more likely to spot any duplicates in the process of doing so.

By making it a little bit harder to put information in, it makes it much easier to get information out.

Pulling out roots, sewing seeds, watering flowers

My new approach is to use the versatility of the PDF format to keep as much information in the file instead of adjacent to it. PDFs can do a surprising amount, and although many of the specs and features are questionable -- like integrating JavaScript execution, leading to things like Linux running in a PDF -- at it's core, it's just a deliberately limited PostScript file. I'd been told this by a few people before but never really understood what it meant. Opening a PDF file in a text editor very quickly demystifies the whole thing:

%PDF-1.7
%<80><80><80><80>

13 0 obj
<<
  /Type /Font
  /Subtype /Type0
  /BaseFont /YZVLWG+NimbusSanL-Bold
  /Encoding /Identity-H
  /DescendantFonts [14 0 R]
  /ToUnicode 16 0 R
>>
endobj
...

This is followed by lots of obj and stream and what looks like binary encoded data belonging to those streams that describes the content, fonts, and layout of the document.

I'm not going to explain the PDF spec here (I am still learning), but hopefully you look at the above and feel the same "oh right" feeling that I felt. Suddenly it makes sense.

The bit of the spec that I am interested in at the moment is the section on Metadata (Section 14.3 of the PDF 1.7 spec). My idea is to dump the information that Zotero was dumping into a SQLite database directly into the PDF, and then query it as a means of searching through my library. My thinking is, if I provide the tooling to easily pull this information out, I can use all my favourite command line tools to search and organise my library. A collection is a directory. Notes are files. Search is anything you want it to be.

Metadata can be encoded in the PDF file in one of two ways:

  1. The metadata stream: used to encode metadata via a stream. It can be specific to another component within the PDF (e.g. an embedded image), or for the entire document. Handling streams in PDFs requires quite a lot of work (see Table 6 in the PDF 1.7 spec for a list of the different filtering methods for stream data), but are consequently efficient and versatile as they can be compressed is various ways.

    The contents of these streams is Extensible Metadata Platform (XMP), itself an XML.

  2. The document information dictionary: a simple dictionary with key value pairs. Table 317 of the PDF 1.7 spec has a list of entries that may appear in this section. Importantly, despite there being a list of common entries, the spec allows any arbitrary entry to be included here.

The document information dictionary has an additional benefit for my purpose over the stream: it must appear in the trailer of the PDF document, i.e. at the end. One problem I was anticipating with my approach is that to extract e.g. the title of the PDF, I would need to parse the whole PDF to read the metadata, but that is not so -- if we know the dictionary of interest is at the end of the file, we simply search for it starting at the end.

In a PDF I had at hand, the trailer looks like this

trailer
<<
  /Size 56
  /Root 55 0 R
  /Info 53 0 R
  /ID [(B/5HKDg9V3MOW7//bPassg==) (B/5HKDg9V3MOW7//bPassg==)]
>>

The << and >> denote a PostScript map, with the keys of the map prefixed with / and the value anything that follows. The /Info key is the metadata key, and in this case is using an object reference -- denotes by the R -- to say the metadata is in the object labeled 53. If we look at that object:

53 0 obj
<<
  /Creator (Typst 0.12.0)
  /CreationDate (D:20250131163031Z)
  /ModDate (D:20250131163031Z)
>>
endobj

we see the metadata.

We can use pdfinfo, part of the standard TeX distributions, to pull this information out:

$ pdfinfo test.pdf
Creator:         Typst 0.12.0
CreationDate:    Fri Jan 31 16:30:31 2025 GMT
ModDate:         Fri Jan 31 16:30:31 2025 GMT
Custom Metadata: no
...

Modifying object 53 in a text editor will modify this output:

$ pdfinfo test.pdf | grep Creator
Creator:         Slartibartfast

There's no reason why we need to refer to an object here either. We can nest dictionaries within PostScript:

trailer
<<
  /Size 56
  /Root 55 0 R
  /Info <<
    /Creator (Mice)
    /CreationDate (D:20250131163031Z)
    /ModDate (D:20250131163031Z)
  >>
  /ID [(B/5HKDg9V3MOW7//bPassg==) (B/5HKDg9V3MOW7//bPassg==)]
>>

Now:

$ pdfinfo test.pdf | grep Creator
Creator:         Mice

If we add a new entry to this dictionary, pdfinfo will not show it but report that there is custom metadata present in the file. That is because pdfinfo is looking for the common metadata as defined in the spec, but adding custom metadata still gives a perfectly compliant PDF.

This is all just string editing, nothing special needed.

So here's the plan.

Bash the state

Squeeze all of the metadata associated with a paper into the PDF file, and use command line scripts to do the squeezing and the retrieval. These scripts will just be simple Bash scripts to begin with, making use of tools like pdfinfo and the brilliant exiftool.

This approach does have a performance bottleneck for querying the library. Since the metadata parsing tools parse the whole PDF, including the metadata of components within the file and not just of the file itself, this can be a slowish process when done multiple times. But as we saw above, the metadata can be found from the trailer, which is always at the end of the file, leading to an enormous optimisation that we can make. My plan is to develop to the tooling for this project organically, and solve problems as they come up, instead of anticipating them, but this one I will address later in this post.

The bits that are important to me to begin with:

  1. Putting information in: most of the papers I am interested in are from the arXiv, or PDFs that I can download. I want to avoid making it too easy to clutter the library, which is achieved by manually entering the metadata. So, to add a new paper I want to be able to do something like

    add_to_library URL_OR_FILEPATH AUTHORS YEAR TITLE[ ...]
    # example
    add_to_library \
        "https://arxiv.org/abs/1804.04024v1" \
        Doyle+Sethi 2018 Conway\'s doughnut

    I want this to create a file that contains the surnames of the author with the year, so that it's easy to find a paper by the first few authors. Pretty much everything else will get baked into the PDF.

  2. Organising: tags are excellent ways of associating papers with a given subject. I can more or less justify putting tags into the Keywords section of the PDF metadata, as defined by the spec. These will be space-delimited and kebab-case. To avoid fracturing tags as with my x-ray example, there will be a list of 'accepted tags' that the system will validate against.

    For 'collections', I can simply make new directories and symlink or even hard-link files around.

  3. Getting information out: I want to be able to easily query tags and find papers about a given subject.

    My first instinct is to reach for fzf to make it possible to search through the metadata (e.g. the titles) interactively, but I know I will likely end up using my own fuzzy finder library to make more specific tools.

  4. Citation information. A bibtex file at the root directory that I populate as and when needed is sufficient for now.

As it stands, I've implemented a handful of small bash scripts and have been trying this system for the last month or so. It's worked really well, and I rarely even have to resort to using anything other than the paper-adding scripts, since ls and grep help me find if I have a certain author very easily. Using ls -tr I can also list the papers in order of last access (something that Zotero doesn't do, to the best of my knowledge?), which is enormously helpful.

In total, I have the following scripts:

  • arxiv.sh: download and add a file from the arXiv.
  • add.sh: copy a file by path and add to the library.
  • list.sh: list all of the files in alphabetical order by first author, along with the tags that have been applied.
  • info.sh: get info about a single file.

These sit in the root directory of my library along with all of the PDF files. To avoid too much code duplication, all of the utility is implemented in a lib.sh, so that the scripts themselves are only a handful of lines that check arguments and then call functions in the library.

I also made a few little additions to my original plan:

The scripts for adding from various sources (arxiv.sh and add.sh) all look something like the below. I get them to append the command that I used to fetch the paper into a _sources file for data provenance, so that I could recreate my library by executing the _sources file, and, importantly, can use it to workout where I got a file from.

#!/bin/bash

set -e

ORIGINAL_COMMAND="$@"
# tags that are added automatically
DEFAULT_KEYWORDS="unread"

# do a string replace on abs so I can copy the arxiv PDF or abstract url
URL="${1//abs/pdf}" ; shift
AUTHOR="$1" ; shift
YEAR="$1" ; shift
TITLE="$@"

OUTFILE="${AUTHOR}_${YEAR}.pdf"

# download the file
curl "$URL" --output "$OUTFILE"

# append to _sources
echo "./arxiv.sh $ORIGINAL_COMMAND" >> _sources

# write the metadata
exiftool -Author="$AUTHOR $YEAR" \
    -Title="$TITLE" \
    -Keywords="${DEFAULT_KEYWORDS}" \
    -overwrite_original \
    "$OUTFILE"

I save the authors and year into the Author field as it seemed to make sense at the time. So when I query the metadata, I see:

$ pdfinfo Doyle+Sethi_2018.pdf
Title:           Conway's doughnut
Author:          Doyle+Sethi 2018
Keywords:        unread
Creator:         LaTeX with hyperref package
Producer:        pdfTeX-1.40.17
CreationDate:    Thu Apr 12 01:22:32 2018 BST
ModDate:         Thu Apr 12 01:22:32 2018 BST

The tagging system reads and modifies the Keywords entry and uses a newline separated list of 'allowed tags' from a file I call _tags. If I try to tag something with a tag that is not in this list, it will fail with an error.

$ ./tag curio Doyle+Sethi_2018.pdf
Error: Not a valid tag: curio

$ ./tag funny,mathematics Doyle+Sethi_2018.pdf
Doyle+Sethi_2018.pdf
    1 image files updated
Added tag: funny mathematics

A fun quirk of exiftool is that it thinks all files are images :p

I can list the library quite straight forwardly. I show only an excerpt below as I have 108 papers in the library at time of writing. I haven't moved everything over from Zotero yet because a lot of the papers I have there I don't think I actually want.

The output is colourful, but I don't have that feature implemented in my website yet for terminal output:

$ ./list.sh
...

Wilkins+Gallo, 2014
Title : the comptonisation of accretion disc X-ray emission: consequences for x-ray reflection and the geometry of AGN corona
Tags  : unread agn reflection x-ray corona
Wilkins+Gallo_2014.pdf

XRISM-Collaboration, 2025
Title : XRISM reveals low non-thermal pressure in the core of the hot relaxed galaxy cluster Abell 2029
Tags  : unread clusters xrism x-ray
XRISM-Collaboration_2025.pdf

...

To list only those things with a given tag, I can use:

$ ./list.sh | grep -A1 -B3 ixpe

I've squashed this into the list.sh script, so e.g. I can do something like:

$ ./list.sh author=Fabian

I've found I don't really use that very much.

I use Okular as my PDF reader, which is working absolutely fine, and it squashes all of my annotations into the file itself.

Quick list

For my 108 papers, it currently takes around 2 seconds to print all the information about them using list.sh. To speed things up a little bit I took what I learned about how metadata is encoded into PDFs and wrote a little Zig program that pulls this information out.

The program is extremely simple: it uses mmap to map the PDF info memory, then searches for the last occurrence of /Info and checks if it is a map or a reference to an object. If it is a reference, it searches backwards through the file for the object to find the map.

The map is then parsed into a key-value pairs, and formatted to standard out.

I've made the source code available here:

It's pretty fast. Using poop to benchmark:

bibl (main) $ poop 'pdfinfo Doyle+Sethi_2018.pdf' './zig-out/bin/bibl Doyle+Sethi_2018.pdf'
Benchmark 1 (585 runs): pdfinfo Doyle+Sethi_2018.pdf
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          7.91ms ± 1.26ms    5.82ms … 15.4ms          2 ( 0%)        0%
  peak_rss           13.9MB ±  111KB    13.5MB … 14.2MB          5 ( 1%)        0%
  cpu_cycles         8.19M  ± 5.93M        0   … 15.4M           0 ( 0%)        0%
  instructions       17.0M  ± 12.0M        0   … 29.8M           0 ( 0%)        0%
  cache_references   73.6K  ± 51.7K        0   …  134K           0 ( 0%)        0%
  cache_misses       23.6K  ± 16.5K        0   … 59.2K           0 ( 0%)        0%
  branch_misses      74.6K  ± 57.9K        0   …  146K           0 ( 0%)        0%
Benchmark 2 (10000 runs): ./zig-out/bin/bibl Doyle+Sethi_2018.pdf
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           229us ± 46.0us     171us … 1.06ms        502 ( 5%)        ⚡- 97.1% ±  0.3%
  peak_rss            958KB ± 1.29KB     913KB …  958KB         10 ( 0%)        ⚡- 93.1% ±  0.0%
  cpu_cycles         1.99K  ± 9.69K        0   … 68.3K         438 ( 4%)        ⚡-100.0% ±  1.4%
  instructions        938   ± 4.53K        0   … 23.6K         438 ( 4%)        ⚡-100.0% ±  1.4%
  cache_references   17.7   ± 87.2         0   …  770          438 ( 4%)        ⚡-100.0% ±  1.4%
  cache_misses       1.56   ± 10.4         0   …  265          428 ( 4%)        ⚡-100.0% ±  1.4%
  branch_misses      9.24   ± 47.2         0   …  363          438 ( 4%)        ⚡-100.0% ±  1.5%

This brings my time-to-list down to 1.2 seconds, which is noticeably faster. I'm currently using three invocations of grep together with cut to pull out the fields of interest, but if I specifically format the output of my little program to be more helpful, I can use three sed -n ${N}p instead.

This cuts the search down to 0.5 seconds, which is perfectly acceptable for now. At some point I will entirely replace the list.sh script (and the rest) with the Zig program, but for a first iteration, this suits me well, and lets me continue to prototype things quickly in Bash.

Small fish in a big pond

So, that's what I've been using to manage my papers. As I say, I think Zotero is the right tool for a lot of people, but I think my tool is right for me. By making it harder to put information in, all of the papers I have in in my library are genuinely interesting and easy to search for. The allowed-list of tags stops me from losing papers, and automatically applying the unread tag means I know which are the uncatalogued papers. Since it's all now command line based, I have access to all sorts of fun tools for interacting and querying the files, making the system highly interoperable.

My synchronisation method is currently a combination of rsync to a server at our flat, and snapper periodically backing things up.

I'm curious to see how this will go as I write my thesis. I will post any substantial updates as time goes by. Currently, I find I spend less time getting frustrated at duplicates, poor searches, and where my annotations are going, and more time reading papers, which was the ultimate goal.