Computational Attacks on the Voynich Manuscript

Repetition of Words – What is Going On Going Going On?

August 11, 2024 JB 3 comments

An examination of the running text in the Manuscript reveals that there are some folios that contain lines with four or five occurrences of a word (there are never more). Here is the complete list (for Language B):

I was looking at this feature as potential evidence of the words on each line being sorted in some way, which could explain why there are repeats of words one after the other, and why some words prefer the starts of lines, and others the ends.

But the above shows that that cannot explain what is going on!

For reference, here are images of the text lines listed above.

Categories: f111v, f113v, f75r, f76r, f78r, f81r, Features, Repeating Sequences

In Honour of Voynich Day, 4th August

August 3, 2024 JB 14 comments

Brother Massimo Cipriani (b. 1401, d. 1486), lead scribe on the Voynich Manuscript, showing an earlier work and his Occitan enciphering apparatus.

Categories: People Tags: monk, Occitan, scribe

Word Positions on the Folios – Conjectures

January 27, 2023 JB 6 comments

Conjecture #1: The Text is Song/Poetry Transcriptions

Rhymes in songs and poetry feature words at the ends of lines that rhyme with other line end words. So a plausible explanation of why we see certain words mostly appear at the ends of lines is because they are words that easily rhyme with others.

Conjecture #2: The Words in the Text are Codes

We know that the VMS word lengths follow a binomial distribution, which suggests they are actually numbers. The fact that many words have the property that if you remove the first glyph the resulting word is also valid (i.e. observed in the text) strongly supports this. E.g. the word “12345” gives “2345”, which is a valid number.

Conjecture #3: The Text is a Transcription of Old Occitan

Old Occitan, which was on its way out at around the dating of the manuscript, is a nice candidate source language for various reasons (which I will leave for later). The Conjecture is that the VMS text words are numbers (Conjecture #2), which when looked up in a (long-lost) dictionary, equate to Occitan words. Since many of the existing manuscripts in Old Occitan are records of songs sung by the troubadours, this fits with Conjecture #1.

Categories: Theories

Word Positions on the Folios – Part Deux

December 30, 2022 JB Leave a comment

In the previous post, we looked at Voynich words that have a marked affinity for the first position in a line of text, with a restriction to words on the Currier B folios. Those words are:

Words that, when they occur, are often the first word on a line.

In this post, we’ll look at the words that have an affinity for the ends of the text lines, and words that are disinclined to appear line initial or terminal.

For this, we introduce the metric that is the fractional position of the word in the text line. For the word at the start of the line, this metric is 0.0, and for the word at the end of the line it is 1.0, and for a word half way along the line it is 0.5. We can plot this metric for each word.

Firstly, let’s look at the word “daiin” with the metric:

The upper half of the Figure shows the page positions of all instances of “daiin”, with the fractional line positions on the x axis. The lower half shows a histogram of the fractional line positions. We can see what we observed in the first post, that “daiin” has a strong affinity for the first line position (fraction 0.0). But we also see that it sometimes appears as the last position in lines of text (fraction 1.0).

Looking at all the words, we can categorize them into six categories as follows:

Words often in First Position – dai, daiin, dair, sai, saiin, sar, sol
Words often in Last Position – am, dy, oky, oly, qoky
Words never in First Position – ai, aiin, air, al, am, chcthy, chdy, checkhy, cheedy, cheky, cheody, chody, chy, dy, kai, kedy, keedy, okal, okedy, oky, oly, opchedy, otai, otal, otar, oty, raiin, shckhy
Words never in Last Position – ai, cheo, dai, kai, kaiin, okai, otai, qokai, qokain, qotai, sai, shckhy, sheedy
Words never in First or Last Position – ai, kai, otai, shckhy
The rest – ar, chckhy, chedy, cheey, cheol, chey, chol, dal, dar, dol, kar, lchedy, lchey, lkaiin, okaiin, okar, okeedy, okeey, ol, or, otaiin, otedy, oteedy, oteey, qokaiin, qokal, qokar, qokchdy, qokedy, qokeedy, qokeey, qokey, qokol, qol, qotaiin, qotal, qotar, qotedy, qoteedy, qoty, shdy, shedy, sheey, sheol, shey, shol, tedy

Words often in Last Position

Words that, when they appear, are often terminal on a line.

Words Never in First or Last Position

Words that, when they appear, are never line initial or terminal.

In the next post, we’ll look at possible explanations of these word positional data.

Categories: Features, Folios, Lines

Word Positions on the Folios

December 26, 2022 JB 5 comments

One of the puzzling aspects of the text in the Voynich Manuscript (VMS) is the unusual positioning of some words on the folios. In particular, some words appear more often at the beginnings of lines, and some more often at the ends. For example, “daiin” shows a particular fondness for the first position in the line:

Whereas “aiin” shows a strong aversion for the first position:

In contrast to “daiin” and “aiin”, the word “chedy” shows no preference for any particular position on the folio lines:

(There are some words that appear more often near the top of the folio, and some more often near the bottom, although this may simply be due to a topic change within the folio and a consequent change of vocabulary, so I don’t consider them further.)

These cases, the words that have an affinity for either the start of a line or the end of it, are curious and warrant some investigation.

For this study, I have restricted the data to only use folios in Currier B. For all words that appear on Currier B folios, I extract the glyph position of the word in the line:

In the example above, the line of text contains the word “chedy”, shown in red, at glyph position 24. The data are further restricted so that only lines of text are considered that have at least three words, so as to avoid labels and other short text lines. Finally, only words of length at least three glyphs, and that appear at least 40 times over all folios are considered. For every word in the resulting dataset, and for every time it appears on a folio, the glyph and line position of the appearance are stored for analysis.

With the data, we can look at some metrics. A simple metric is the number of times a word appears as the first word in a line, Nf, as a percentage of the number of times it appears, N. Let’s call this metric PI = (100*Nf/N). If the word only ever appears as the first word, PI will be 100%, and if it never appears first, PI will be 0%.

The histogram above shows the distribution for the metric PI. There are seven words with the property PI > 20% i.e. each of these words, when they appear, do so at least 1/5 of the time as the first word on folio line. The words are: dai daiin dair sai saiin sar sol

***Words that appear more than 20% of the time at the beginning of a line.***

Here are the detailed folio distributions for each of these:

In the next post, we’ll look at VMS words that have an aversion to being first on a line, and then see if we can deduce any patterns between the two groups. What governs where a word is “allowed” to appear on a line – is it related to its component parts, i.e. does the word’s prefix/midfix/suffix dictate where it should appear on a line?

Categories: Currier, Features

The Wheels hit a bump

August 22, 2021 JB 11 comments

To recap, the hypothesis is that the VMS text was written by use of a number of cipher wheels, each wheel containing a number of glyphs from which none or one was used. In addition, it was theorized that one of the wheels contained just the Gallows glyphs. The attractiveness of this hypothesis can be summarized as follows:

The length of words created using a set of wheels in this way should be binomial distributed. This is the case, and was first observed by Stolfi.
The number of words containing a gallows glyph should be about 50% (since the gallows wheel is chosen 50% of the time). This is approximately true (the VMS has about 60%).
The number of words in the VMS that begin with a gallows glyph is about 13%, and this closely matches the number obtained when the wheel containing the gallows glyphs is the third wheel.
The average length of a word that starts with a gallows glyph, should be shorter than the average length of all words that contain a gallows glyph (since the first two wheels were not used). This is also true, by one glyph on average.
If the gallows glyphs only appear on one wheel, then gallows glyphs should never appear next to one another in a word. This is approximately true: there is one case in the VMS where two gallows glyphs appear together;

This may in fact be two words: EVA ot and EVA kchedy. (Aside: the challenge of deciding where one VMS word ends and the next begins is well known – what is a space, how big is it, and did the transcribers get it right?!)

So far, so good. But from the wheels hypothesis we can make another prediction: for words containing a gallows glyph, there should be at most two glyphs preceding the gallows, if the gallows are all on the third wheel. This is not the case: there are many words that have more than two glyphs preceding the gallows, even if you count some glyph combinations such as EVA qo and EVA ch, sh as one glyph.

Another prediction we can make is for the number of words that end with a gallows glyph. This number can be calculated from the wheel number and layout, and it turns out to be much smaller than what is observed in the VMS. Specifically, in the VMS, there are 85 words that end in a gallows glyph (about 1% of all words), but only about 0.1% are predicted.

Categories: Algorithms Tags: cipher, Cipher Wheels, gallows, Stolfi, wheels

Grove Word Lengths

August 19, 2021 JB 10 comments

In the previous post, we looked at how the Grove words (words with an initial gallows glyph) are distributed in the VMS, and how their frequency is explained by the use of cipher wheels to generate VMS words.

Marco commented on that post with the astute observation that if this generation scheme is valid then gallows initial words should be shorter than other words, on average, as only wheels 3 onwards are used to create them.

Here are the data: these show the lengths of Grove words compared with the lengths of other words that contain at least one gallows glyph:

This confirms that, yes, Grove words are on average shorter than other gallows words (by about 1 glyph) – perhaps more evidence for the validity of this scheme?

For interest (as requested by Rene), here are the distributions for EVA l and EVA r:

Categories: gallows Tags: gallows, Grove

Fun with Grove Words and Cipher Wheels

August 18, 2021 JB 7 comments

What is a Grove word? The answer is a little fuzzy, but simplistically a Grove word is a VMS word that begins with one of the gallows glyphs. These words are often page or paragraph initial. Emma May Smith has a good explanation in her recent blog entry.

Mr. Grove observed the peculiar feature that some words beginning with a gallows glyph are also valid words if you remove the gallows glyph. For example, the word EVA kodaiin starts with gallows k, and odaiin is also a valid word.

It turns out that if you look at all words in the VMS, 46% of them have this property: remove the first glyph and you are left with a valid VMS word. Compare this with English, where only around 8% of words produce valid words if you remove the first letter. Making up the 46% we have 38% from non-Grove words (i.e. non-gallows initial), and 8% from Grove words.

To round out the statistics, about 13% of all VMS words have an initial gallows glyph.

Consider the nine wheels above, where one of the wheels contains gallows glyphs, and the other wheels contain other glyphs. These wheels can be selected in 2⁹ -1 i.e. 511 different ways, to make words of length between 1 and 9.

The probability of selecting wheel 3 as the first wheel for the word is about 12.5%. In other words, with these 9 wheels, 12.5% of the time we’d create a gallows-initial “Grove” word – very close to what we observe in the VMS (13%). In fact, this figure of 12.5% is independent of the number of cipher wheels: as long as there are at least three wheels and they are used left to right, and the gallows glyphs fully occupy the third wheel, then 12.5% of the generated words will be Grove types.

As a corollary, it’s clear that for Grove types generated with the wheels, removing the first glyph will produce a valid word, as it is equivalent to generating a word starting at wheel 4 or later.

So what of the 54% of VMS words that are non-Grove, i.e. removing the initial glyph does not produce a valid VMS word? This can be explained if the number of different words used and written in the VMS is simply less than the total number of possible words that the author’s wheels can produce. What is the expected vocabulary size if we know there are 7,552 words written in the VMS (Takeshi), and we are missing 54% of them? It is simply 1.54 x 7,552 = 11,630 words, or thereabouts.

(Aside: the wheels above could just as easily be represented and used as a table with nine columns.)

In summary, “Grove” words (gallows initial) are ~13% of all words in the Voynich manuscript, and this fraction is what you’d expect if the text was produced using cipher wheels.

Categories: Algorithms, cipher, English, gallows, Grove Tags: cipher, Cipher Wheels, English, gallows, Grove

Word Length Distributions

August 12, 2021 JB 7 comments

In the previous blog post, we looked at the distribution of word lengths in the EVA transcription, and compared it with the binomial distribution for 9, as per the work of Stolfi. They matched well enough, as I had denoted EVA ch, sh, ain, aiin and qo as single glyphs, in a similar fashion as Stolfi:

For this page, we will define symbol as Currier did; i.e. EVA ch ans sh will be counted as single symbols, and so are EVA cth, ckh, etc..
https://www.ic.unicamp.br/~stolfi/voynich/00-12-21-word-length-distr/

i.e. he reduced some of the EVA glyph sequences to single symbols.

Without making these reductions, so leaving the EVA transcription unchanged, the distributions of course tend to higher values. As a check of my sanity, Marco Ponzi was kind enough to send me a list of VMS words he’d extracted from the ZL transcription, so that I could compare it with the words I extracted from the Takaheshi EVA. In the following plot I show the three word length distributions: EVA, ZL and the reduced EVA with ch, sh, ain, aiin and qo as single glyphs.

Reassuringly, the EVA and ZL (green and blue curves) match quite well, as they should, and the Reduced matches Stolfi’s result. (Curiously, the ZL transcription has a total of 8078 different words, compared with 7552 for Takaheshi EVA – which warrants further investigation.)

The EVA distribution now matches a binomial of (n=12,p=0.5), i.e. using 12 cipher wheels with a probability of 50% for a glyph being used from each wheel.

Categories: Algorithms, cipher, Stolfi Tags: Binomial, Cipher Wheels, Word Lengths

Nine Cipher Wheels

August 9, 2021 JB 5 comments

UPDATE (12 Aug 2021): the plots and results discussed in this post used a version of EVA that replaces some common glyph sequences by a single glyph, namely ch, sh, ain, aiin, and qo. Clearly, this tends to reduce the average word length. A later post will discuss the distributions obtained with words without this simplification.

The lengths of VMS words follow the binomial distribution for 9, as observed by Stolfi, and as discussed in Rene’s recent paper. This binomial distribution can be obtained from a set of 9 cipher wheels, where each wheel has a 50% chance of contributing one of its glyphs to the word being assembled, and the lengths of the resulting words plotted:

Words lengths obtained from a set of 9 cipher wheels where a glyph is picked from each wheel with probability 0.5.

In the above plot, the orange line shows the distribution of word lengths from EVA, and the blue line shows the distribution of word lengths obtained by using the following set of 9 cipher wheels to generate a large number of random words:

Example cipher wheels used to generate words.

With the cipher wheels shown, about 50% of the generated words will contain a gallows glyph, and this is, perhaps not coincidentally, the case in the VMS text, too.

Using the same technique as applied in my earlier blog post, where I looked at the counts between gallows glyphs in the VMS text, we can look at the same distributions for words generated with the above wheels, assembled into lines of text, and ignoring spaces between words. The results are very similar, and shown below.

Comparing the counts from f to the next gallows in the VMS (left plot) and synthetic text generated with 9 wheels, one wheel having gallows glyphs (right plot).

Here are the others:

Categories: cipher, gallows, Rene Zandbergen, Stolfi Tags: cipher, gallows

Older Entries