newsgroups-index (beta)

Current group: sci.crypt

Guy Macon's adventures with ASCII character frequency

Guy Macon's adventures with ASCII character frequency  
Guy Macon
 Re: Guy Macon's adventures with ASCII character frequency  
Douglas A. Gwyn
 Re: Guy Macon's adventures with ASCII character frequency  
Joe Peschel
 Re: Guy Macon's adventures with ASCII character frequency  
Guy Macon
 Re: Guy Macon's adventures with ASCII character frequency  
Peter Pearson
 Re: Guy Macon's adventures with ASCII character frequency  
Guy Macon
 Re: Guy Macon's adventures with ASCII character frequency  
Mok-Kong Shen
 Re: Guy Macon's adventures with ASCII character frequency  
Guy Macon
 Re: Guy Macon's adventures with ASCII character frequency  
Mok-Kong Shen
 Re: Guy Macon's adventures with ASCII character frequency  
Guy Macon
 Re: Guy Macon's adventures with ASCII character frequency  
Bill Unruh
 Re: Guy Macon's adventures with ASCII character frequency  
Bryan Olson
 Re: Guy Macon's adventures with ASCII character frequency  
Guy Macon
 Re: Guy Macon's adventures with ASCII character frequency  
Bill Unruh
 Re: Guy Macon's adventures with ASCII character frequency  
Joe Peschel
From:Guy Macon
Subject:Guy Macon's adventures with ASCII character frequency
Date:Sat, 22 Jan 2005 13:12:14 +0000

http://deafandblind.com/word_frequency.htm
gives me the following data:

-------------------------------------------

Letter Frequency in the English Language
e t a o i n s r h l d c u m f p g w y b v k x j q z

Letter Frequency in Press Reporting
e t a o n i s r h l d c m u f p g w y b v k j x q z

Letter Frequency in Religious Writings
e t i a o n s r h l d c u m f p y w g b v k x j q z

Letter Frequency in Scientific Writings
e t a i o n s r h l c d u m f p g y b w v k x q j z

Letter Frequency of the most common first letters in a word
t o a w b c d s f m r h i y e g l n o u j k

Letter Frequency of the most common second letter in a word
h o e i a u n r t

Letter Frequency of the most common third letter in a word
e s a r n i

Letter Frequency of the most common last letter in a word
e s t d n r y f l o g h a k m p u w

More than half of all words end with
e t d s

Letter Frequency of the letters most likely to follow the e
r s n d

Digraph Frequency in the English Language
th he an in er on re ed nd ha at en es of
nt ea ti to io le is ou ar as de rt ve...

Trigraph Frequency in the English Language
the and tha ent ion tio for nde has nce tis oft men...

Double Letter Frequency in the English Language
ss ee tt ff ll mm oo...

-------------------------------------------

...but that's not exactly what I am looking for.

I want a list that includes the space, punctuation,
numerals, upper case and lower case, not just letters.
I strongly suspect that the space character is more
common that E or e is, for example.

Does anyone know where I can find such a list?

How about one with all the possible Digraphs?

--
Guy Macon
From:Douglas A. Gwyn
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:Sat, 22 Jan 2005 17:48:57 -0500
Guy Macon wrote:
> I want a list that includes the space, punctuation,
> numerals, upper case and lower case, not just letters.
> I strongly suspect that the space character is more
> common that E or e is, for example.

The letter E occurs about 17% of the time on the
average in "telegraphic" English text (which has
no nonalphabetic characters, and spells out some
punctuation). Since the average English word
size is about five characters, if space is
included as a word separator that means that space
occurs about 17% of the time, and E occurs about
14% of the time.

> Does anyone know where I can find such a list?

If you want to determine relative character frequencies
in a corpus of representative text stored in files, it
is easy to do so with a simple computer program.

> How about one with all the possible Digraphs?

All digraphs are *possible*, QXIHML. If you want to
determine their relative frequencies in a corpus of
representative text stored in files, it is easy to do
so with a simple computer program.
From:Joe Peschel
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:Sun, 23 Jan 2005 05:51:19 -0000
"Douglas A. Gwyn" wrote in news:-PCdnb_Qo_XMRW_cRVn-
sQ@comcast.com:

> The letter E occurs about 17% of the time on the
> average in "telegraphic" English text (which has
> no nonalphabetic characters, and spells out some
> punctuation).

Seventeen percent? Kullback found around 12.7-13.7 percent. What am I
misunderstanding?

J

--
__________________________________________
When will Bush be tried for war crimes?

"Our enemies are innovative and resourceful, and so are we. They
never stop thinking about new ways to harm our country and our
people, and neither do we." --G. W. B.

Joe Peschel
D.O.E. SysWorks
http://members.aol.com/jpeschel/index.htm
__________________________________________
From:Guy Macon
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:Sun, 23 Jan 2005 01:23:49 +0000

Douglas A. Gwyn wrote:

>Guy Macon wrote:
>
>> I want a list that includes the space, punctuation,
>> numerals, upper case and lower case, not just letters.
>> I strongly suspect that the space character is more
>> common that E or e is, for example.
>
>The letter E occurs about 17% of the time on the
>average in "telegraphic" English text (which has
>no nonalphabetic characters, and spells out some
>punctuation). Since the average English word
>size is about five characters, if space is
>included as a word separator that means that space
>occurs about 17% of the time, and E occurs about
>14% of the time.
>
>> Does anyone know where I can find such a list?
>
>If you want to determine relative character frequencies
>in a corpus of representative text stored in files, it
>is easy to do so with a simple computer program.
>
>> How about one with all the possible Digraphs?
>
>All digraphs are *possible*, QXIHML. If you want to
>determine their relative frequencies in a corpus of
>representative text stored in files, it is easy to do
>so with a simple computer program.

Of course I can. That's trivial. Collecting the corpus
and making an intelligent guess as to whether it is
representitive is not. Given how many web pages have
this data for a-z, I would be surprised if there wasn't
at least one that has the data for all ASCII characters.

--
Guy Macon
From:Peter Pearson
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:Sat, 22 Jan 2005 18:06:09 -0800
Guy Macon wrote:
> Douglas A. Gwyn wrote:
>>Guy Macon wrote:
>>
>>> I want a list that includes the space, punctuation,
>>> numerals, upper case and lower case, not just letters.
>>> I strongly suspect that the space character is more
>>> common that E or e is, for example.
[snip]
>>> Does anyone know where I can find such a list?
>>
>>If you want to determine relative character frequencies
>>in a corpus of representative text stored in files, it
>>is easy to do so with a simple computer program.
[snip]
> Of course I can. That's trivial. Collecting the corpus
> and making an intelligent guess as to whether it is
> representitive is not.

If everybody agrees that the frequencies vary with the "kind"
of text, then perhaps we should ask how much better it is to
have frequencies averaged over all kinds of text, rather than
frequencies averaged over whatever kinds of text happen to be
handy. My bet is that, if we're dealing with text samples of
modest size, the sample-to-sample variation within a text kind
is large compared with the kind-to-kind variation.

(I remember statistics on same-kind, same-author samples going
way out of bounds on a message that included a short discussion
of kayaking.)

--
Peter Pearson
To get my email address, substitute:
nowhere -> spamcop, invalid -> net
From:Guy Macon
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:Sun, 23 Jan 2005 03:18:12 +0000

Peter Pearson wrote:

>If everybody agrees that the frequencies vary with the "kind"
>of text, then perhaps we should ask how much better it is to
>have frequencies averaged over all kinds of text, rather than
>frequencies averaged over whatever kinds of text happen to be
>handy. My bet is that, if we're dealing with text samples of
>modest size, the sample-to-sample variation within a text kind
>is large compared with the kind-to-kind variation.

Good point.

I rather suspect that space will always be more frequent than A,
A more frequent than V, and V more frequent than ~, but I am
less confident about where 0 or 9 or . or ' should be. Just
knowing that would be a big help in trying to crack ciphertexts.
From:Mok-Kong Shen
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:Sun, 23 Jan 2005 15:09:52 +0100


Guy Macon wrote:

> I rather suspect that space will always be more frequent than A,
> A more frequent than V, and V more frequent than ~, but I am
> less confident about where 0 or 9 or . or ' should be. Just
> knowing that would be a big help in trying to crack ciphertexts.

I don't know but I guess that frequency analysis probably
wouldn't help you very much in cracking a good modern cipher.
For encryption with classical ciphers, on the other hand, one
would even today likely retain the traditional way of writing
messages, i.e. confining onself to the use of 25 or 26
characters of the alphabet without spaces, I would think.

M. K. Shen
From:Guy Macon
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:Sun, 23 Jan 2005 14:59:15 +0000

Mok-Kong Shen wrote:
>
>Guy Macon wrote:
>
>> I rather suspect that space will always be more frequent than A,
>> A more frequent than V, and V more frequent than ~, but I am
>> less confident about where 0 or 9 or . or ' should be. Just
>> knowing that would be a big help in trying to crack ciphertexts.
>
>I don't know but I guess that frequency analysis probably
>wouldn't help you very much in cracking a good modern cipher.

Remember, I am a hobbyist. I am at the moment working on cracking
a 4, 5, and 6-bit version of RC4 in less time than it would take to
guess which of the possible permutations of the state array is in use.
From:Mok-Kong Shen
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:Mon, 24 Jan 2005 00:25:52 +0100


Guy Macon wrote:

> Mok-Kong Shen wrote:
>
[snip]
>>I don't know but I guess that frequency analysis probably
>>wouldn't help you very much in cracking a good modern cipher.
>
> Remember, I am a hobbyist. I am at the moment working on cracking
> a 4, 5, and 6-bit version of RC4 in less time than it would take to
> guess which of the possible permutations of the state array is in use.

I just like to say that, according to some rather limited
experimental data I know of, even the scaled down 4 bit version
of RC4 is fairly good statistically. I can't know, though,
whether this fact has any significance for your work or not.

M. K. Shen
From:Guy Macon
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:Mon, 24 Jan 2005 02:48:03 +0000

Mok-Kong Shen wrote:

>I just like to say that, according to some rather limited
>experimental data I know of, even the scaled down 4 bit version
>of RC4 is fairly good statistically. I can't know, though,
>whether this fact has any significance for your work or not.

It does. Finding a weakness in 4-bit RC4 might point the way to
finding a weakness in 8-bit RC4.

--
Guy Macon
From:Bill Unruh
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:23 Jan 2005 01:48:56 GMT
Guy Macon writes:


>Douglas A. Gwyn wrote:

>>Guy Macon wrote:
>>
>>> I want a list that includes the space, punctuation,
>>> numerals, upper case and lower case, not just letters.
>>> I strongly suspect that the space character is more
>>> common that E or e is, for example.
>>
>>The letter E occurs about 17% of the time on the
>>average in "telegraphic" English text (which has
>>no nonalphabetic characters, and spells out some
>>punctuation). Since the average English word
>>size is about five characters, if space is
>>included as a word separator that means that space
>>occurs about 17% of the time, and E occurs about
>>14% of the time.
>>
>>> Does anyone know where I can find such a list?
>>
>>If you want to determine relative character frequencies
>>in a corpus of representative text stored in files, it
>>is easy to do so with a simple computer program.
>>
>>> How about one with all the possible Digraphs?
>>
>>All digraphs are *possible*, QXIHML. If you want to
>>determine their relative frequencies in a corpus of
>>representative text stored in files, it is easy to do
>>so with a simple computer program.

>Of course I can. That's trivial. Collecting the corpus
>and making an intelligent guess as to whether it is
>representitive is not. Given how many web pages have

And what makes you think that that putative web page has made an
intelligent guess? There is no such thing as "a representative sample".
English is a language for communicating in many different situations. Is
Shakespeare "representative"? Is Gore Vidal? Is Playboy? Is the National
Enquirer? Representative of what?


>this data for a-z, I would be surprised if there wasn't
>at least one that has the data for all ASCII characters.

Why don't you just make it up. It would be just as representative.
From:Bryan Olson
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:Sun, 23 Jan 2005 05:46:53 GMT
Guy Macon wrote:
> Of course I can. That's trivial. Collecting the corpus
> and making an intelligent guess as to whether it is
> representitive is not. Given how many web pages have
> this data for a-z, I would be surprised if there wasn't
> at least one that has the data for all ASCII characters.

Have you looked at the corpora the compression guys use,
such as the Brown Corpus, Canterbury CorpusCorpus and
Calgary Corpus?


--
--Bryan
From:Guy Macon
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:Sun, 23 Jan 2005 14:12:21 +0000
Bryan Olson wrote:
>
>
>Guy Macon wrote:
>> Of course I can. That's trivial. Collecting the corpus
>> and making an intelligent guess as to whether it is
>> representitive is not. Given how many web pages have
>> this data for a-z, I would be surprised if there wasn't
>> at least one that has the data for all ASCII characters.
>
>Have you looked at the corpora the compression guys use,
>such as the Brown Corpus, Canterbury CorpusCorpus and
>Calgary Corpus?

I will now. thanks!

http://www.es.ac.uk/linguistics/clmt/w3c/corpus_ling/content/corpora/list/
From:Bill Unruh
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:23 Jan 2005 19:51:24 GMT
Guy Macon <_see.web.page_@_www.guymacon.com_> writes:


>Mok-Kong Shen wrote:
>>
>>Guy Macon wrote:
>>
>>> I rather suspect that space will always be more frequent than A,
>>> A more frequent than V, and V more frequent than ~, but I am
>>> less confident about where 0 or 9 or . or ' should be. Just
>>> knowing that would be a big help in trying to crack ciphertexts.
>>
>>I don't know but I guess that frequency analysis probably
>>wouldn't help you very much in cracking a good modern cipher.

>Remember, I am a hobbyist. I am at the moment working on cracking
>a 4, 5, and 6-bit version of RC4 in less time than it would take to
>guess which of the possible permutations of the state array is in use.

In that case you want the frequency analysis, not of English as some
amorphous thing, but of the specific kinds of text you want to decrypt.
Communications between zoologists are liable to have x and z occuring far
more often than in communication between computer scientists (for whom
various braces and brakets are liable to be far more common).
From:Joe Peschel
Subject:Re: Guy Macon's adventures with ASCII character frequency
Date:Sat, 22 Jan 2005 20:46:13 -0000
Guy Macon wrote in
news:10v4k9okg1k7vfe@corp.supernews.com:

> I want a list that includes the space, punctuation,
> numerals, upper case and lower case, not just letters.
> I strongly suspect that the space character is more
> common that E or e is, for example.
>
> Does anyone know where I can find such a list?
>

Wouldn't it be better to create your own list?

J

--
__________________________________________
When will Bush be tried for war crimes?

"Our enemies are innovative and resourceful, and so are we. They
never stop thinking about new ways to harm our country and our
people, and neither do we." --G. W. B.

Joe Peschel
D.O.E. SysWorks
http://members.aol.com/jpeschel/index.htm
__________________________________________
   

Copyright © 2006 newsgroups-index   -   All rights reserved   -   Impressum