This collection produces passwords using models built on a corpus of source text. It manages to guarantee generation of passwords with known entropy by building markov models whose transitions are guided by huffman trees, allowing the use of variable numbers of bits of entropy for each transition.
To run it, use
It also has a number of command-line arguments that you can use to control the password generation process.
You can specify the source text using the -t flag. A megabyte of text is a reasonable size. I’ve included the text of Dickens’ A Tale of Two Cities, and this is used by default. Extremely long source texts will take a while to build models from, and extremely short source texts (e.g., "banana" will work, but will generate extremely long passwords.
The tool builds a markov model mapping character-based n-grams to possible next letters. By default, it uses order 2. That is, the model will take a look at every instance of "pr", for instance, and see what could follow it.
You can also specify a larger order. If you specify order 3, then it will build its model using strings of length 3 as states; in this case, it would compute, e.g., what follows "pro" (and every other 3-letter string that occurs in the source text).
Larger models produce passwords that are more characteristic of the source text. Surprisingly, this isn’t usually such a great thing. Here are some 56-bit passwords of order 2:
Yess ott was whoc
ming to mys, ale, and ar
and the factinand cale
toner gre,' Throur and w
dis whishe coullows ing
inesperept his calty
he led, againeved--th
And here are some 56-bit passwords of order 8:
took horse for Dover; and then stood smoking, too! replied: My child, this world
out of these footsteps destination and by the wall where he
and alluvial mud, under such thought madame, drawing themselves through her f
by Lucie at her sole reliance, under a bank overhanging about. There, said he, i
in such differed principle. And look at, am I? Why don't you be afraid of, for he stooped, and leth
the nose became much more wicked foreign woman; what plate as to matter. But when I ask your
did a great Stilton cheese, until to-morrow. For to-morrow! Now, a judiciously sensibility she wound about he
young ladies made to stand up against, and notice, when two dusty men passionate, loving, thankful that if s
The second set are clearly more Dickensian... but would make terrible passwords! I mean, yes, you’d probably be the only girl on the block to have "Stilton cheese, until tomorrow" in your password, but at the end of the day, I bet you’d be happier typing "Yess ott was whoc" when you lift the lid on your laptop.
In fact, some people might even prefer the order-1 passwords, which look like this (again, 56-bit):
a ffty,'l g wed
maged he tof hed ored
arerer pen le wagh
hemishemat. t wher
s ton toticould the
waped t If h inge t
Well, maybe not. Order-2 seems like a nice compromise.
The generator uses a list of booleans to generate a password. Each character’s generation consumes zero or more booleans from this list. There is a bijection between the set of booleans of some length n and the set of passwords generated (assuming a fixed order and source text).
These booleans are generated by drawing randomness from "/dev/urandom" (sorry, Windows folks, tell me how to get random bits on your platform).
What this means is that passwords generated by this system are guaranteed to have the required entropy. That is, if I generate a 56-bit password this system, it’s guaranteed to have 56 bits of entropy, even though the attacker knows your source text and what order you’ve chosen.
For more about security and how it all works, take a look at my preprint on ArXiv.
Currently, the tool simply builds the model every time it’s called. This may seem inefficient, but the fact is that for order-2 models on 1-megabyte texts, it only takes a few seconds on my machine.
Larger models take longer to build, but they’re also huge, and would take up lots of space on disk, and aren’t really very useful (as noted above). If there’s really any demand for it, it wouldn’t be hard to add a means to compile source texts for fast password generation. Finally, this tool usually isn’t something that you run every day; waiting 30 seconds for password generation probably isn’t a big deal.
These passwords are as secure as the bits that come from "/dev/urandom". In fact, there’s a bijection between them.
The library contains lots of other stuff that I explored in the process of creating this; there’s some code for extracting the text of e-mails, and for html processing, and for word-based generation, etc. etc. etc. Hopefully, it’s all fairly easy to read.
Naturally, you can also call this code as a racket library.
atotc-model : model?
base-model : model?