Idioms and dictionaries
08 December 2024
When you don't know the meaning of a word or phrase, what do you do?
I think most people today will open a web browser, type define telluric
, see the results in their search engine, maybe there's a good definition there already, maybe not, so then they click through to a dictionary website, decline cookies, scroll down past some ads, and, finally, find the meaning of the word. At least, this is what I had been doing.
Some months back I began to use dict
. If you haven't encountered it before, then I think the description from the manual best sums up what dict
is:
dict is a client for the Dictionary Server Protocol (DICT), a TCP transaction based query/response protocol that provides access to dictionary definitions from a set of natural language dictionary databases.
The page is dated 15 February 1998. A month before MathML was released, for those who read my previous blog post. DICT as a protocol dates to 1997. This is another blog post that can be filed under "fergus rediscovers software from the 90s".
Dictionary Server Protocol
DICT, and the command line tool dict
, is extremely cool. To illustate an example:
$ dict telluric
1 definition found
From The Collaborative International Dictionary of English v.0.53 [gcide]:
Telluric \Tel*lu"ric\, a. [L. tellus, -uris, the earth: cf. F.
tellurique.]
1. Of or pertaining to the earth; proceeding from the earth.
[1913 Webster]
Amid these hot, telluric flames. --Carlyle.
[1913 Webster]
2. (Chem.) Of or pertaining to tellurium; derived from, or
resembling, tellurium; specifically, designating those
compounds in which the element has a higher valence as
contrasted with {tellurous} compounds; as, telluric acid,
which is analogous to sulphuric acid.
[1913 Webster]
{Telluric bismuth} (Min.), tetradymite.
{Telluric silver} (Min.), hessite.
[1913 Webster]
You give dict a word, and it gives you definitions. You can also add different dictionaries: I have the GNU Collaborative International Dictionary of English (GCIDE) and a thesaurus on my machine, both available via my distribution's package manager.
DICT has a client / server architecture. You can run your own DICT server using the dictd
service, and can configure it to serve all sorts of different dictionaries -- and this is how it works "locally", as in, your DICT server is only listening on localhost
. But if your dictionary becomes so large you can't keep it on your machine, or is changing so quickly it's a pain to maintain, then you can make TCP requests to get definitions from other places.
Large dictionaries are less of a problem than you might think. Because this software dates from before ubiquitous and large storage, it uses a special compression system with an index table to keep definitions. GCIDE has over 203 thousand definitions, but the DICT compressed .dz
dictionary is 13MB.
The "new" compression format isn't prohibitive, and it's easy to add your own dictionaries: dictfmt
can convert all sorts of different markup based dictionaries into the correct indexed format, and dictzip
will compress or decompress it for you. More than that, you can convert jargon files into the DICT dictionaries, which you can easily write by hand:
:head words:My definition.
I don't think you quite realise how useful that is. I didn't until recently. Stick all the definitions and weird things you need to remember for your interest (acryonyms, stars or planets, quotes, artists) into a jargon file with the above format, and put whatever you like as the definition. Then, run a quick
dictfmt --utf8 --allchars -s "Black Holes" -j black-holes < black-holes.txt
Add it to your dictd
local configuration, and now you've got a fast and very flexible lookup tool:
$ dict -s substring m87
From Black Holes [black-holes]:
M87* (Messier 87), Virgo A or NGC 4486
A supermassive black hole in the Messier 87 elliptical galaxy
M87 is located in the Virgo constellation. The source is notable as the
first supermassive black hole to be observed by the Event Horizon
Telescope Collaboration.
Estimated to be 7.22 * 10^9 Solar Masses.
Yes, that -s
is the search strategy. You can change how dict
resolves so you can have more flexible searches in your headwords. To list the strategies my local server supports:
idioms $ dict -S
Strategies available:
exact Match headwords exactly
prefix Match prefixes
nprefix Match prefixes (skip, count)
substring Match substring occurring anywhere in a headword
suffix Match suffixes
re POSIX 1003.2 (modern) regular expressions
regexp Old (basic) regular expressions
soundex Match using SOUNDEX algorithm
lev Match headwords within Levenshtein distance one
word Match separate words within headwords
first Match the first word within headwords
last Match the last word within headwords
Those strategies are configured on the server side. Yes, they can be plugins: if a search strategy is not working for you, no problem, make one that does.
I hope this has convinced you that DICT is extremely cool, and at the very least to try it.
A dictionary for idioms
The main part of this blog post is going to be about making a dictionary of idioms.
Using DICT has genuinely changed my relationship with language, as I can so easily and quickly explore new words. I've spent several evenings going down rabbit holes learning weird relationships between words I never realised before. I can pipe the output to other programs to help me find what I'm looking for, and trivially use a thesaurus when I'm writing. I've configured my shell to never delete history, so I have have a grep-able history of all of the words I've looked up in my dictionary.
When I'm in a seminar and someone says I word I don't fully understand (telluric was one such word), it's now trivial for me to get a short but informative definition, and even elements of the etymology. I'd really like to have the same thing for idioms.
Sourcing a collection of defined idioms is not that easy. There's The Idioms, a website which seems to be aimed at non-native English speakers for learning the language. There's about 1700 idioms on that page, with descriptions and examples. Their site is accessible and friendly static HTML. I could write a simple xidel and ripgrep expression to extract the idioms along with their meanings, and format it in the jargon format:
xidel -se '//div[@class="idiom"]' --xml "$PAGE_HTML" \
| rg '<strong><a href="(.+?)">(.*?)</a>.+Meaning:</strong> (.*?)</p>' -or ':$2:$3' \
>> "$JARGON_FILE"
Then greasen the wheels and put this into a loop to enumerate all of their pages (and put a few seconds delay between each wget
because you're a good denizen), and within a few minutes there's a complete dictionary of idioms that you can use in DICT.
So far, so good. Most of the idioms are, however, quite common, things like "God forbid", "as clear as mud", "spice things up", "bide time", or "to not hurt a fly". Some of the ones I would maybe have expected, such as "between hammer and anvil", or "flower of one's youth", aren't present in their collection.
In too deep
I was searching around for other lists of idioms and quickly stumbled on the Wiktionary Category of English idioms: a list of all pages that are in the category of English idioms. At the time of writing this page states:
The following 200 pages are in this category, out of 9,250 total.
Each one is nearly unique. There's a little overlap, for example "albatross around one's neck" and "albtratross round one's neck" are, for whatever reason, two seperate pages, but the amount of information present here is excellent.
A friend told me some time ago that you can download Wikipedia's latest textual content in a compressed format. Wikimedia prefers people access their data this way over thousands of network requests. The compression is BZIP2, which is a block-based compression algorithm. So, alongside the dumps are index files, which contains a line of information for each page in the form:
byte-offsets:page-id:title
This is very similar to what DICT is doing.
You would then look in the index file for the title or ID of the page you're interested in, skip to the offset in the compressed file and extract a block (between 100kb and 900kb depending on compression quality). This then contains some XML which tells you about the page, the last person to edit it, when it was last edited, any comments, and the page Wikitext.
Dumps exist for all sorts of different wikis, primarily categorised by their languages. I was able to download the entirety of the English language Wiktionary in a single multi-stream compressed 1.6G blob. The index, once extracted, is 315MB of text. But a crucial bit of information is missing, namely which pages are part of the English idioms category.
The way the wind is blowing
Since the category relations is metadata about Wiktionary's pages, we need another dump, this time from the relational database that Wiktionary uses as part of its search and presentation. The SQL dumps can be found at the same place as all the other downloadables, and I had a guess as to which one would have the category information in it. Turns out the one we are interested in is quite big:
7.6G enwiktionary-latest-categorylinks.sql
This database contains a mapping from page ID to category. There's another SQL dump that I downloaded first, which is also quite useful, namely enwiktionary-latest-category.sql
(only 50MB). This one contains the category ID, the category name, and how many pages are in the category. I was using this one to find which categories I was interested in, and could get the page IDs from the other.
I used a MariaDB (10.6.19, other versions didn't seem to work?) container with podman
and imported the dumps. The categorylinks dump took about 7 hours to import (!!!), but after that I could start querying things about how the pages relate:
MariaDB [wiki]> SELECT COUNT(*) FROM categorylinks;
+----------+
| count(*) |
+----------+
| 67605410 |
+----------+
1 row in set (1 min 19.407 sec)
MariaDB [wiki]> SELECT UNIQUE(cl_type) FROM categorylinks;
+---------+
| cl_type |
+---------+
| page |
| subcat |
| file |
+---------+
3 rows in set (1 min 22.917 sec)
MariaDB [wiki]> SELECT COUNT(*) FROM categorylinks WHERE cl_type = "page"
0+----------+
| COUNT(*) |
+----------+
| 63277124 |
+----------+
1 row in set (1 min 19.407 sec)
Queries were slow. There are over 63 million entries for pages in this database. If we look at those in the English_idioms
category:
MariaDB [wiki]> SELECT COUNT(*) FROM categorylinks WHERE cl_to="English_idioms"
+----------+
| COUNT(*) |
+----------+
| 9247 |
+----------+
1 row in set (0.037 sec)
Which is very close to the actual reported number on the live version of Wiktionary.
There's a few columns in the big database, but there's only two that we need, namely the page ID (cl_from
) and the idiom it is being mapped to (cl_to
). We can dump just the IDs to a file and move on to the next step:
SELECT cl_from
FROM categorylinks
WHERE cl_to="English_idioms"
INTO OUTFILE 'idiom-page-ids';
Lucky dip
The next things to know are the offsets in the compressed Wiktionary data, to find the content for each page filed as an English idiom. I wrote a quick-and-dirty little Python script to do that matching: it finds just those index entries that also appear in the database dumped IDs:
#!/usr/bin/python
with open("./idiom-page-ids", "r") as f:
data = [int(i) for i in f.read().split() if i]
# sets do lookups in constant time, whereas lists have O(n)
data = set(data)
idioms = []
with open("./enwiktionary-latest-pages-articles-multistream-index.txt", "r") as f:
line = f.readline()
i = 1
while line:
print(f"Reading line {i}", end="\r")
index = int(line.split(":")[1])
if index in data:
idioms.append(line)
line = f.readline()
i += 1
print(f"Matched {len(idioms)} idioms")
with open("./idiom-info", "w") as f:
f.write("".join(idioms))
Now it's just a matter of decompressing each chunk and reading out the XML for that page. I won't put the entire script here (it tangled quickly), but I'll share the non-trivial extraction function.
There a toy example that the Wikimedia website points to for using the mutli-stream compressed blobs. I used the same BZIP2 extraction code, but handled the XML slightly differently. The standard Python XML library expects the XML document to have no errors and to only have a single root. Since we're streaming compressed blocks, it's not guaranteed that we a) have fully valid XML, and b) that we have a single root. So we stream the decompressed XML in page chunks as well:
def extract_idioms(offset: int, ids: set[int]) -> list[IdiomRaw]:
decompresser = bz2.BZ2Decompressor()
with open("enwiktionary-latest-pages-articles-multistream.xml.bz2", "rb") as f:
f.seek(offset)
block = f.read(262144) # magic number from the toy example
ddata = decompresser.decompress(block).decode()
matched_pages = []
start = -1
while True:
start = ddata.find("<page>", start + 1)
if start == -1:
break
end = ddata.find("</page>", start + 1)
if end == -1:
break
page_content = ddata[start : end + 7]
try:
root = ElementTree.fromstring(page_content)
except Exception as e:
logger.exception("Skipping")
continue
identifier = int(root.find("./id").text)
if identifier in ids:
matched_pages.append(root)
# extract the idiom information of interest
idioms = []
for m in matched_pages:
idiom = IdiomRaw(m.find("./title").text, m.find("./revision/text").text)
idioms.append(idiom)
return idioms
This isn't perfect but it's pretty good; the only ones that failed were page IDs that I wasn't interested in, so I was able to pull out all 9247 idiom definitions this way.
A mess of pottage
Page content extracted, it's still not massively informative for making a dictionary of idioms. The title of the pages make good headwords, but the Wikitext is a little indecipherable. It looks something like this
===Alternative forms===
* {{alt|en|jumprope|jump-rope}}
===Pronunciation===
* {{enPR|jŭmp rōp}}, {{IPA|en|/dʒʌmp ɹəʊp/}}
** '''Noun:''' ''jump'' always stressed (&quot;used a '''''jump''''' rope&quot;)
** '''Verb:''' ''rope'' sometimes stressed, but stress on ''jump'' also common (&quot;let's jump '''''rope'''''&quot; or &quot;let's '''''jump'''''-rope&quot;)
*** Verb tenses always as follows: ''jumps '''rope'''''; '''''jump''' ropes''; ''jumping '''rope'''''; '''''jump''' roping''; ''jumped '''rope'''''; '''''jump''' roped''.
* {{audio|en|En-au-jump rope.ogg|a=AU}}
===Noun===
[[Image:Funchal statue girl in Botanic Garden A.jpg|thumb|Statue of jumping rope]]
{{en-noun|~}}
# {{lb|en|uncountable}} (also '''jump-roping''', '''jumping rope''') The [[activity]], [[game]] or [[exercise]] in which a person must [[jump]], [[bounce]] or [[skip]] [[repeatedly]] while a length of rope is [[swing|swung]] over and under, both ends held in the hands of the [[jumper]], or alternately, held by two other [[participant]]s.
# The [[length]] of [[rope]], sometimes with [[handle]]s, [[casing]] or other [[addition]]s, used in that [[activity]].
# {{lb|en|colloquial}} A single jump in this game or activity, counted as a measure of achievement.
#* '''2001''', [[w:Matt Groening|Matt Groening]], “[[w:The Cyber House Rules|The Cyber House Rules]]”, ''[[w:Futurama|Futurama]]'', [[w:List_of_Futurama_episodes#Season_3:_2001-2002|season 3, episode 11]], &lt;small&gt;[[infosphere:Transcript:The Cyber House Rules|transcript]]&lt;/small&gt;
#*: '''S&lt;small&gt;ALLY&lt;/small&gt;:''' One time, I did a hundred '''jump ropes'''.
====Synonyms====
* {{sense|the game}} {{l|en|skipping}}, {{l|en|skip rope}}
* {{sense|the rope}} {{l|en|skipping rope}}
* {{sense|single jump}} {{l|en|rope-jump}}
====Translations====
{{trans-top|game or activity}}
* Arabic: {{t|ar|نَط اَلْحَبْل|m}}
* Catalan: {{t|ca|saltar a corda|m}}
* Chinese:
*: Mandarin: {{t+|cmn|跳繩|tr=tiàoshéng}}
* Danish: {{t+|da|sjippetov|n}}
* Dutch: {{t+|nl|touwtjespringen|n}}
* Esperanto: {{t|eo|ŝnursaltado}}
* Finnish: {{t+|fi|naruhyppely}}
* French: {{t+|fr|saut à la corde|m}}, {{t+|fr|corde à sauter|f}}
* German: {{t+|de|Seilspringen|n}}
* Hungarian: {{t|hu|ugrókötelezés}}
* Icelandic: {{t|is|sippuband|n}}
* Japanese: {{t+|ja|縄跳び|tr=なわとび, nawatobi}}
* Marathi: {{t|mr|दोरीउड्या}}
* Mongolian: {{t|mn|дээсээр тоглох}}
* Norwegian: {{t|no|hoppetau}}
* Polish: {{t+|pl|skakanka|f}}
* Portuguese: {{t|pt|pula corda|m}} {{q|Brazil}}, {{t|pt|saltar à corda}}
* Russian: {{t+|ru|скака́лка|f}}
* Spanish: {{t+|es|comba|f}}
* Swedish: {{t+|sv|hopprep}}
* Tagalog: {{t|tl|lubid-luksuhan}}
{{trans-bottom}}
The curly braced terms are templates, which would be expanded into links or formatting. It's not easy to quickly see where the information related to this definition is. There's a few good libraries for parsing Wikitext in Python though, and I mentioned in particular mwparserfromhell as it is the one I ended up using. With it, you can parse the text and then filter for just the templated bits and replace them:
wikicode = mwparserfromhell.parse(text)
templates = wikicode.filter_templates()
for t im templates:
# replace the template with it's last parameters
wikicode.replace(t, str(t.params[-1]))
print(str(wikicode))
This will take something like {{l|en|rope-jump}}
, a link to the English definition of rope-jump
, and replace it with the string rope-jump
.
The translations are not particular of interest for my dictionary of idioms, so I added some code that drops the things I don't care about. In the end, I would keep the raw definition of the idiom itself (which comes under the heading "Noun", "Phrase", "Adjective", etc.), and some of the quote examples of it, similar to what GCIDE does.
The parser as it stands doesn't do much more than that, and it's not perfect. There's a lot of edge cases that I need to handle, but it's good enough.
Here's all idioms that contain the substring evil
:
$ dict -s substring -d wikioms -m evil
wikioms: "between the devil and the deep blue sea"
"blue devils" "dance with the devil" "devil dancing"
"devil in disguise" "devil lies in the details" "devil's advocate"
"devil's luck" "evil twin" "folk devil" "give the devil his due"
"lesser of two evils" "necessary evil" "speak of the devil"
"speak of the devil and he appears"
"speak of the devil and he shall appear" "talk of the devil"
"the devil" "the devil a one" "the devil is a liar"
"the devil is in the details" "what the devil"
You can see some of the similar entries again vis-a-vis "speak to the devil" and "talk to the devil".
Here's for shoe
:
$ dict -s substring -d wikioms -m shoe
wikioms: "act one's age, not one's shoe size"
"as ever trod shoe leather" "as ever trod shoe-leather"
"dead men's shoes" "fill someone's shoes" "horseshoe up one's ass"
"if the shoe fits" "if the shoe fits, wear it"
"in someone's shoes" "in the same shoes" "on a shoestring"
"pair of shoes" "pebble in one's shoe"
"put on one's dancing shoes" "put the same shoe on every foot"
shoe-leather "soft shoe" "stand in someone's shoes"
"step into someone's shoes" "the shoe is on the other foot"
"wait for the other shoe to drop" "walk a mile in someone's shoes"
"which foot the shoe is on"
I can look up a specific idiom, say primrose path:
$ dict -s substring -d wikioms "primrose path"
From Wiktionary Idioms [wikioms]:
primrose path
/noun/
- An easy and pleasant life; a self-indulgent or hedonistic life; such a
life that leads to damnation.
Many men in his position would have preferred the *primrose path* of
dalliance to the steep heights of duty; but Lord Arthur was too
conscientious to set pleasure above principle.
-- Oscar Wilde (1891)
It's really fast, and it's 2MB compressed!
To end, here's an unusual idiom: The rabbit died!
From Wiktionary Idioms [wikioms]:
the rabbit died
/phrase/
- (euphemistic)
Robin finally was able to announce: "Good news! *The rabbit died*!"
She said she was thrilled and that everybody was happy for her except
her mother.
-- Maurice Apprey (1993)
My Wikitext parsing didn't capture the etymology of this idiom on the first run. I modified it, and then learned it derives from the rabbit test, a pregnancy test from the 1930s. It's surprisingly morbid, but quite interesting:
The rabbit test became a widely used bioassay (animal-based test) to test for pregnancy. The term "rabbit test" was first recorded in 1949, and was the origin of a common euphemism, "the rabbit died", for a positive pregnancy test. The phrase was, in fact, based on a common misconception about the test. While many people assumed that the injected rabbit would die only if the woman was pregnant, in fact all rabbits used for the test died, as they had to be dissected in order to examine the ovaries.
I will at some point make this dictionary available for download under the GNU Free Documentation License and CC BY SA 4.0, in compliance with Wikimedia's licensing. For now, I have lot's more cleaning to do.