Thursday, October 01, 2009

Thinking of Roses - and Signatures

Since today, Thursday October 1, is the feast day of the great Little Flower, Saint Thérèse of Lisieux, I naturally was going... eh?

Oh, excuse me. That's Doctor Little Flower!

Of course it is... what a great thought for us, this dear little sister who, though cloistered, helps out those who live far away... amazing.

Well, perhaps I ought to do something doctoral to celebrate.

All right. Let's try using my very own doctoral skills on Chesterton. Note: don't try this at home, or even on the INTERNET, I mean out here in the e-cosmos. I am a professional, and know what I'm getting into. You might get all kinds of nondeterministic effects, after all! Whew. All right, now that I took care of my warning, let's go.

I wondered whether there was some quick way to get a glimpse of Chesterton's vocabulary use over his writing career, and realized that my work on the uniqueness of rRNA strings could readily be applied to his writing. I thought it would be interesting to learn what words might appear exactly ONCE in any given year of his ILN essays. If we could acquire the signatures, grouped by years, for his ILN essays, we might get a hint of how his vocabulary altered.

What's a signature? That's what we called a portion of a rRNA sequence which we found to be unique to a given species. That is, the signature is some sequence of RNA bases which appears only in one species, and in no other, so it can therefore act as a signature of that species. In the same way, if there is a word which appears only in GKC's ILN essays for 1911 (say), but never in any other of his ILN essays, then that word is a signature for 1911. (As you will learn, "signatory" is a signature for 1911 - talk about paradoxes!)

So I dusted off the machinery, rubbed my hands a few times, said the usual starting prayers for software development - hey, I use 13th century metaphysics, since I want to get things done (see Heretics CW1:46 for more!) Then I proceeded with the experiment. Heh, heh, heh. (No, that's NOT my usual "hee hee" - that's the doctoral mad scientist laugh. We doctors take special classes to learn to do it effectively, along with how to wear those funny little beanies, and Latin, and all kinds of fun things. It's great.)

Since I am also an engineer, I used some tricks, and devised a tidy little linear-time algorithm (which took lots less time than it does to tell you) And then I wrote the program, and ran it. (Actually I run the program as I write it, which was something I learned to do long before I became a doctor.) And I got some interesting results - and then I also checked the results, since I know what happens when one does not check one's work... it makes one's boss very unhappy, and one's customer FURIOUS... But things looked good, so I decided I could risk telling you here.

Of course 1905 and 1936 are the smallest, since he only wrote for parts of those years, and the others (such as 1915 and 1920) are on the low side. As I examined the list I noted that there are some indications that a handful of words are still spelled incorrectly in AMBER, and there are a few hyphenation issues also. But I did some checks, and the signatures appear to be authentic:
For example, "aggregate" only appears in 1905 and "circumlocutions" in 1906 and "Ecuador" in 1909.... but there are a goodly number of others.

Here is the list of the years and signatures:
1905 171
1906 718
1907 643
1908 528
1909 694
1910 685
1911 630
1912 574
1913 508
1914 526
1915 347
1916 616
1917 419
1918 285
1919 328
1920 260
1921 303
1922 381
1923 452
1924 420
1925 433
1926 450
1927 469
1928 441
1929 392
1930 478
1931 518
1932 489
1933 495
1934 486
1935 406
1936 227

And here is a graph showing the same information:



Very curious, you say, but what does it mean?

Well... one might make any sort of argument about what all this means, but I am not trying to argue anything at all. I merely wanted to give a suitable tribute as a Chestertonian Computer Scientist (and a doctor) to our Doctor Little Flower for her feast day. I am sure she will have a good laugh with GKC and FBC about it.

3 comments:

  1. Doctor, what is the total "aggregate" of the signatures? I mean, the sum of the right-hand column? Because - if I understand your experiment correctly - that number (that is, the number of words which Chesterton was able to use ONLY ONCE in a career of professional writing) would be a wonderful snapshot of the depth of his vocabulary. Also, I'd be curious as to whether he used a given word only in a certain year but multiple times that year (say, for example, he used "Keiser" in 1914 only, but did so fifteen times in a certain series of essays).

    ReplyDelete
  2. Ah... the sum is 14772 (if I typed correctly), but there are a number of complications to the larger issue, as you may suspect.

    Remember that I am ONLY treating his ILN essays (grouped by year) - the signatures apply only within those ILN-years, not to the entire AMBER collection.

    I hesitate to quote exact figures, since I am all too aware of the typos, but I will state that my software has reported the total UNIQUE words appearing within AMBER as over 50,000. Bear in mind that this is an absolute wordcount, as it distinguishes "words" - hence it includes every variation (e.g. plurals of nouns, comparative/superlative of adjectives, case-endings of verbs) - though I have already removed the typographical upper/lower "case".

    (Yes, the signatures can (and sometimes do) appear multiple times within a given year, but I don't have that data reported.)

    Just for your amazement, I can tell you that the number of singular words - that is, words which appear once within AMBER, and so are formally signatures of their containing entity - that number is "over 20,000". That includes numbers, as appear in this hilarious line:

    While a teacher is considered enlightened and even advanced if he firmly refuses to teach more than five and a half babies how to dissect a dandelions a system of teaching is also considered enlightened and advanced if it can boast that 5,000,000.05 babies are all dissecting exactly the same sort of dandelions at exactly the same instant of time.
    [GKC ILN May 28 1921 CW32:175]

    Yes, I too am dissecting a dandelion here! Hee hee.

    ReplyDelete
  3. With respect, Dr. Thursday (and Joey), I doubt that these figures reveal much about the development of Chesterton's style or the extent of his vocabulary. Since the data were drawn from GKC'S essays for the ILN (which often -- though not always -- dealt with topical matters), the so-called "signature words" are likelier to reflect the newsworthy events of a particular year. (Remember how the word "chad" suddenly started appearing in newspapers and news magazines in the wake of the 2000 election? How often have you encountered it recently?)

    Why did Chesterton use the word "Ecuador" only in 1917? I have no idea. Perhaps there was an earthquake or a revolution there in 1917, or Chesterton might have used the name more or less at random in making one of his vivid analogies or illustrations. ("Now a man may become a millionaire by sending tinned sardines to Ecuador" -- I just made that up, but it's the sort of thing GKC might have written.)

    The figures do show that Chesterton had a large and supple vocabulary -- but surely we all know that!

    ReplyDelete

Join our FaceBook fan page today!