Saturday, September 16, 2006

Neither GREP nor GOOGLE: Chestertonians search upside down.

Tomorrow, of course, is Sunday.
"The fact is," said Syme serenely, "the truth is I am a Sabbatarian. I have been specially sent here to see that you show a due observance of Sunday." [The Man Who Was Thursday CW6:496]
It is ACS blogg policy to "show a due observance of Sunday" - hence, no postings will be made tomorrow. (You may, however, read and comment at your pleasure.)

And I have a piece of news! This is my last posting standing in for Nancy Brown, who shall resume her excellent work on Monday.

Yes, I have sailed quite close to the edge of the envelope in mentioning technology along with GKC - so I will complete my discussion, which (as it turns out) is quite fittingly related to Nancy's recent column "Finding Clairity" which was also about a kind of search. This posting will complete my little discussion on words, and finding certain things with a computer, and provide a little surprise.

...it is the test of a good encyclopaedia that it does two rather different things at once. The man consulting it finds the thing he wants; he also finds how many thousand things there are that he does not want.
[GKC The Common Man 240]
Some time ago there was a cartoon version of "The Cat in the Hat" which I ought to have used in writing my doctoral dissertation. I cannot recall what happens in the story, but the two children are trying to find something, and the Cat attempts to assist them in finding the whatever-it-is. Anyway, the Cat introduces them to a method called "Calculatus Eliminatus" which (if I remember correctly) is the method by which one finds something by finding out where it isn't. Lots of good fun for human and feline.

If you play on the INTERNET, you know about "GOOGLE" and its rivals, which are tools for doing "searches" - they seek for words among web pages. Or, if you work with computers, you know about the famous search tool called GREP which seeks for words among files in the disks of the computer. Such tools could be said to work on the biblical principle of "seek and you shall find": like the man with a ring of keys, the computer takes the word being sought, and tries it against every part of the file, step by step, announcing each "successful match", until the end is reached. (We humans usually stop once we find a key that works. Computers do as they are told, no matter how boring their work.)

This wonderful mechanism works fine for basic kinds of searches like the GREP and GOOGLE kind. But for things like rRNA sequences, which have their own challenges, or for upside-down searches, like "Calculatus Eliminatus" - well, a different kind of trick is needed.

Some biologist-friends needed to find sequences in RNA from bacteria - sequences which were in one species but weren't in any other. So I looked into ways of using the computer to help them. The technical details I defer for now - but it is an interesting challenge to express the question in - er - "lit'ry" terms:
We are given a certain edition of a certain book of some decent size, such as GKC's The Everlasting Man. As a kind of puzzle, we wish to find a "key" word for each page, which acts just like the page number does. That is, a word which appears only once in the whole book. So if one were to mention that word (like maybe in a kind of cryptogram) one would know what page number was being referred to.
So we want to find a word which isn't anywhere else.

There are ways of solving such things, but I won't go into details here. The important thing about one technique is that it provides you with both the singular words, and also repeated words.

So, with a bit of trickery, I managed to make my software process some English text instead of RNA sequences. And I found that among the over 100,000 words of The Everlasting Man, there were around 4000 words which appear only once... For example, of the "many thousand things" I did not want, I learned that "Anselm" only appears on CW2:386 and "yesterday" on CW2:373. But I had already found out these singular words by a much easier route. So I asked the harder question:

What phrases repeat, and what is the longest repeated phrase?

I was shocked when I read the answer. The following 13 words appear in this order three times:

""Heaven and earth shall pass away, but my words shall not pass away."[quoting Luke 21:33; see CW2:327,392,393]

If you want to try finding them yourself, go ahead - you may have to stand on your head to get anywhere. But if you want to save yourself the effort, as I did, just use your mouse to highlight the above gap, and the answer will appear.

Those particular words seem quite fitting, considering the subject and title of GKC's book. And they are strangely related to the quote I mentioned previously about "the smallest part of a letter".

Well, that concludes the discussion for today. I am quite grateful to Nancy for the opportunity to "splash around" here - and I am happy to say this is not farewell. She has asked, and I have agreed, to make a weekly appearance here, with bits of Chestertonian wit and technology. Can you guess what weekday it will be?

2 comments:

  1. Next challenge: find the longest string of repeated ideas. Does GKC start with disucssing the need for "wonder", then move to the importance of "smallness / home / the particular" then does he comment on "sacramentalism"? Are his ideas interrelated and does he occasionally present strings of them in the same order?

    ReplyDelete
  2. We've started that work too, but it cannot be done mechanically, at least at the present stage of our resources. We call these things "GKC motifs" - the collection might be part of the annotations for The Everlasting Man - but more likely it will be an independent work.

    I'll post something on this over on my own blogg...

    ReplyDelete

Join our FaceBook fan page today!