|
|
 | | From: | Guy Macon | | Subject: | Guy Macon's adventures with ASCII character frequency | | Date: | Sat, 22 Jan 2005 13:12:14 +0000 |
|
|
 | http://deafandblind.com/word_frequency.htm gives me the following data:
-------------------------------------------
Letter Frequency in the English Language e t a o i n s r h l d c u m f p g w y b v k x j q z
Letter Frequency in Press Reporting e t a o n i s r h l d c m u f p g w y b v k j x q z
Letter Frequency in Religious Writings e t i a o n s r h l d c u m f p y w g b v k x j q z
Letter Frequency in Scientific Writings e t a i o n s r h l c d u m f p g y b w v k x q j z
Letter Frequency of the most common first letters in a word t o a w b c d s f m r h i y e g l n o u j k
Letter Frequency of the most common second letter in a word h o e i a u n r t
Letter Frequency of the most common third letter in a word e s a r n i
Letter Frequency of the most common last letter in a word e s t d n r y f l o g h a k m p u w
More than half of all words end with e t d s
Letter Frequency of the letters most likely to follow the e r s n d
Digraph Frequency in the English Language th he an in er on re ed nd ha at en es of nt ea ti to io le is ou ar as de rt ve...
Trigraph Frequency in the English Language the and tha ent ion tio for nde has nce tis oft men...
Double Letter Frequency in the English Language ss ee tt ff ll mm oo...
-------------------------------------------
...but that's not exactly what I am looking for.
I want a list that includes the space, punctuation, numerals, upper case and lower case, not just letters. I strongly suspect that the space character is more common that E or e is, for example.
Does anyone know where I can find such a list?
How about one with all the possible Digraphs?
-- Guy Macon
|
|
 | | From: | Douglas A. Gwyn | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | Sat, 22 Jan 2005 17:48:57 -0500 |
|
|
 | Guy Macon wrote: > I want a list that includes the space, punctuation, > numerals, upper case and lower case, not just letters. > I strongly suspect that the space character is more > common that E or e is, for example.
The letter E occurs about 17% of the time on the average in "telegraphic" English text (which has no nonalphabetic characters, and spells out some punctuation). Since the average English word size is about five characters, if space is included as a word separator that means that space occurs about 17% of the time, and E occurs about 14% of the time.
> Does anyone know where I can find such a list?
If you want to determine relative character frequencies in a corpus of representative text stored in files, it is easy to do so with a simple computer program.
> How about one with all the possible Digraphs?
All digraphs are *possible*, QXIHML. If you want to determine their relative frequencies in a corpus of representative text stored in files, it is easy to do so with a simple computer program.
|
|
 | | From: | Joe Peschel | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | Sun, 23 Jan 2005 05:51:19 -0000 |
|
|
 | "Douglas A. Gwyn" wrote in news:-PCdnb_Qo_XMRW_cRVn- sQ@comcast.com:
> The letter E occurs about 17% of the time on the > average in "telegraphic" English text (which has > no nonalphabetic characters, and spells out some > punctuation).
Seventeen percent? Kullback found around 12.7-13.7 percent. What am I misunderstanding?
J
-- __________________________________________ When will Bush be tried for war crimes?
"Our enemies are innovative and resourceful, and so are we. They never stop thinking about new ways to harm our country and our people, and neither do we." --G. W. B.
Joe Peschel D.O.E. SysWorks http://members.aol.com/jpeschel/index.htm __________________________________________
|
|
 | | From: | Guy Macon | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | Sun, 23 Jan 2005 01:23:49 +0000 |
|
|
 | Douglas A. Gwyn wrote:
>Guy Macon wrote: > >> I want a list that includes the space, punctuation, >> numerals, upper case and lower case, not just letters. >> I strongly suspect that the space character is more >> common that E or e is, for example. > >The letter E occurs about 17% of the time on the >average in "telegraphic" English text (which has >no nonalphabetic characters, and spells out some >punctuation). Since the average English word >size is about five characters, if space is >included as a word separator that means that space >occurs about 17% of the time, and E occurs about >14% of the time. > >> Does anyone know where I can find such a list? > >If you want to determine relative character frequencies >in a corpus of representative text stored in files, it >is easy to do so with a simple computer program. > >> How about one with all the possible Digraphs? > >All digraphs are *possible*, QXIHML. If you want to >determine their relative frequencies in a corpus of >representative text stored in files, it is easy to do >so with a simple computer program.
Of course I can. That's trivial. Collecting the corpus and making an intelligent guess as to whether it is representitive is not. Given how many web pages have this data for a-z, I would be surprised if there wasn't at least one that has the data for all ASCII characters.
-- Guy Macon
|
|
 | | From: | Peter Pearson | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | Sat, 22 Jan 2005 18:06:09 -0800 |
|
|
 | Guy Macon wrote: > Douglas A. Gwyn wrote: >>Guy Macon wrote: >> >>> I want a list that includes the space, punctuation, >>> numerals, upper case and lower case, not just letters. >>> I strongly suspect that the space character is more >>> common that E or e is, for example. [snip] >>> Does anyone know where I can find such a list? >> >>If you want to determine relative character frequencies >>in a corpus of representative text stored in files, it >>is easy to do so with a simple computer program. [snip] > Of course I can. That's trivial. Collecting the corpus > and making an intelligent guess as to whether it is > representitive is not.
If everybody agrees that the frequencies vary with the "kind" of text, then perhaps we should ask how much better it is to have frequencies averaged over all kinds of text, rather than frequencies averaged over whatever kinds of text happen to be handy. My bet is that, if we're dealing with text samples of modest size, the sample-to-sample variation within a text kind is large compared with the kind-to-kind variation.
(I remember statistics on same-kind, same-author samples going way out of bounds on a message that included a short discussion of kayaking.)
-- Peter Pearson To get my email address, substitute: nowhere -> spamcop, invalid -> net
|
|
 | | From: | Guy Macon | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | Sun, 23 Jan 2005 03:18:12 +0000 |
|
|
 | Peter Pearson wrote:
>If everybody agrees that the frequencies vary with the "kind" >of text, then perhaps we should ask how much better it is to >have frequencies averaged over all kinds of text, rather than >frequencies averaged over whatever kinds of text happen to be >handy. My bet is that, if we're dealing with text samples of >modest size, the sample-to-sample variation within a text kind >is large compared with the kind-to-kind variation.
Good point.
I rather suspect that space will always be more frequent than A, A more frequent than V, and V more frequent than ~, but I am less confident about where 0 or 9 or . or ' should be. Just knowing that would be a big help in trying to crack ciphertexts.
|
|
 | | From: | Mok-Kong Shen | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | Sun, 23 Jan 2005 15:09:52 +0100 |
|
|
 |
Guy Macon wrote:
> I rather suspect that space will always be more frequent than A, > A more frequent than V, and V more frequent than ~, but I am > less confident about where 0 or 9 or . or ' should be. Just > knowing that would be a big help in trying to crack ciphertexts.
I don't know but I guess that frequency analysis probably wouldn't help you very much in cracking a good modern cipher. For encryption with classical ciphers, on the other hand, one would even today likely retain the traditional way of writing messages, i.e. confining onself to the use of 25 or 26 characters of the alphabet without spaces, I would think.
M. K. Shen
|
|
 | | From: | Guy Macon | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | Sun, 23 Jan 2005 14:59:15 +0000 |
|
|
 | Mok-Kong Shen wrote: > >Guy Macon wrote: > >> I rather suspect that space will always be more frequent than A, >> A more frequent than V, and V more frequent than ~, but I am >> less confident about where 0 or 9 or . or ' should be. Just >> knowing that would be a big help in trying to crack ciphertexts. > >I don't know but I guess that frequency analysis probably >wouldn't help you very much in cracking a good modern cipher.
Remember, I am a hobbyist. I am at the moment working on cracking a 4, 5, and 6-bit version of RC4 in less time than it would take to guess which of the possible permutations of the state array is in use.
|
|
 | | From: | Mok-Kong Shen | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | Mon, 24 Jan 2005 00:25:52 +0100 |
|
|
 |
Guy Macon wrote:
> Mok-Kong Shen wrote: > [snip] >>I don't know but I guess that frequency analysis probably >>wouldn't help you very much in cracking a good modern cipher. > > Remember, I am a hobbyist. I am at the moment working on cracking > a 4, 5, and 6-bit version of RC4 in less time than it would take to > guess which of the possible permutations of the state array is in use.
I just like to say that, according to some rather limited experimental data I know of, even the scaled down 4 bit version of RC4 is fairly good statistically. I can't know, though, whether this fact has any significance for your work or not.
M. K. Shen
|
|
 | | From: | Guy Macon | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | Mon, 24 Jan 2005 02:48:03 +0000 |
|
|
 | Mok-Kong Shen wrote:
>I just like to say that, according to some rather limited >experimental data I know of, even the scaled down 4 bit version >of RC4 is fairly good statistically. I can't know, though, >whether this fact has any significance for your work or not.
It does. Finding a weakness in 4-bit RC4 might point the way to finding a weakness in 8-bit RC4.
-- Guy Macon
|
|
 | | From: | Bill Unruh | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | 23 Jan 2005 01:48:56 GMT |
|
|
 | Guy Macon writes:
>Douglas A. Gwyn wrote:
>>Guy Macon wrote: >> >>> I want a list that includes the space, punctuation, >>> numerals, upper case and lower case, not just letters. >>> I strongly suspect that the space character is more >>> common that E or e is, for example. >> >>The letter E occurs about 17% of the time on the >>average in "telegraphic" English text (which has >>no nonalphabetic characters, and spells out some >>punctuation). Since the average English word >>size is about five characters, if space is >>included as a word separator that means that space >>occurs about 17% of the time, and E occurs about >>14% of the time. >> >>> Does anyone know where I can find such a list? >> >>If you want to determine relative character frequencies >>in a corpus of representative text stored in files, it >>is easy to do so with a simple computer program. >> >>> How about one with all the possible Digraphs? >> >>All digraphs are *possible*, QXIHML. If you want to >>determine their relative frequencies in a corpus of >>representative text stored in files, it is easy to do >>so with a simple computer program.
>Of course I can. That's trivial. Collecting the corpus >and making an intelligent guess as to whether it is >representitive is not. Given how many web pages have
And what makes you think that that putative web page has made an intelligent guess? There is no such thing as "a representative sample". English is a language for communicating in many different situations. Is Shakespeare "representative"? Is Gore Vidal? Is Playboy? Is the National Enquirer? Representative of what?
>this data for a-z, I would be surprised if there wasn't >at least one that has the data for all ASCII characters.
Why don't you just make it up. It would be just as representative.
|
|
 | | From: | Bryan Olson | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | Sun, 23 Jan 2005 05:46:53 GMT |
|
|
 | Guy Macon wrote: > Of course I can. That's trivial. Collecting the corpus > and making an intelligent guess as to whether it is > representitive is not. Given how many web pages have > this data for a-z, I would be surprised if there wasn't > at least one that has the data for all ASCII characters.
Have you looked at the corpora the compression guys use, such as the Brown Corpus, Canterbury CorpusCorpus and Calgary Corpus?
-- --Bryan
|
|
 | | From: | Guy Macon | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | Sun, 23 Jan 2005 14:12:21 +0000 |
|
|
 | Bryan Olson wrote: > > >Guy Macon wrote: >> Of course I can. That's trivial. Collecting the corpus >> and making an intelligent guess as to whether it is >> representitive is not. Given how many web pages have >> this data for a-z, I would be surprised if there wasn't >> at least one that has the data for all ASCII characters. > >Have you looked at the corpora the compression guys use, >such as the Brown Corpus, Canterbury CorpusCorpus and >Calgary Corpus?
I will now. thanks!
http://www.es.ac.uk/linguistics/clmt/w3c/corpus_ling/content/corpora/list/
|
|
 | | From: | Bill Unruh | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | 23 Jan 2005 19:51:24 GMT |
|
|
 | Guy Macon <_see.web.page_@_www.guymacon.com_> writes:
>Mok-Kong Shen wrote: >> >>Guy Macon wrote: >> >>> I rather suspect that space will always be more frequent than A, >>> A more frequent than V, and V more frequent than ~, but I am >>> less confident about where 0 or 9 or . or ' should be. Just >>> knowing that would be a big help in trying to crack ciphertexts. >> >>I don't know but I guess that frequency analysis probably >>wouldn't help you very much in cracking a good modern cipher.
>Remember, I am a hobbyist. I am at the moment working on cracking >a 4, 5, and 6-bit version of RC4 in less time than it would take to >guess which of the possible permutations of the state array is in use.
In that case you want the frequency analysis, not of English as some amorphous thing, but of the specific kinds of text you want to decrypt. Communications between zoologists are liable to have x and z occuring far more often than in communication between computer scientists (for whom various braces and brakets are liable to be far more common).
|
|
 | | From: | Joe Peschel | | Subject: | Re: Guy Macon's adventures with ASCII character frequency | | Date: | Sat, 22 Jan 2005 20:46:13 -0000 |
|
|
 | Guy Macon wrote in news:10v4k9okg1k7vfe@corp.supernews.com:
> I want a list that includes the space, punctuation, > numerals, upper case and lower case, not just letters. > I strongly suspect that the space character is more > common that E or e is, for example. > > Does anyone know where I can find such a list? >
Wouldn't it be better to create your own list?
J
-- __________________________________________ When will Bush be tried for war crimes?
"Our enemies are innovative and resourceful, and so are we. They never stop thinking about new ways to harm our country and our people, and neither do we." --G. W. B.
Joe Peschel D.O.E. SysWorks http://members.aol.com/jpeschel/index.htm __________________________________________
|
|
|