newsgroups-index (beta)

Current group: comp.realtime

MTBF for Windows and Linux

MTBF for Windows and Linux  
Ian Mckay
 Re: MTBF for Windows and Linux  
Joseph Gwinn
 Re: MTBF for Windows and Linux  
Paul Keinanen
 Re: MTBF for Windows and Linux  
Joseph Gwinn
 Re: MTBF for Windows and Linux  
Marc Girod
 Re: MTBF for Windows and Linux  
Lukas Ruf
 Re: MTBF for Windows and Linux  
Steve Jorgensen
 Re: MTBF for Windows and Linux  
Juhan Leemet
 Re: MTBF for Windows and Linux  
smilemac
 Re: MTBF for Windows and Linux  
David Lightstone
 Re: MTBF for Windows and Linux  
leslie
 Re: MTBF for Windows and Linux  
Paul E. Bennett
 Re: MTBF for Windows and Linux  
Nick Landsberg
 Re: MTBF for Windows and Linux  
smilemac
 Re: MTBF for Windows and Linux  
Nick Landsberg
 Re: MTBF for Windows and Linux  
Joseph Gwinn
 Re: MTBF for Windows and Linux  
Nick Landsberg
 Re: MTBF for Windows and Linux  
Joseph Gwinn
 Re: MTBF for Windows and Linux  
Andrew Gabb
 Re: MTBF for Windows and Linux  
Paul E. Bennett
 Re: MTBF for Windows and Linux  
Bernhard Holzmayer
From:Ian Mckay
Subject:MTBF for Windows and Linux
Date:25 Nov 2004 12:24:59 -0800
hi,

Im doing a project on the suitability (or lack there-of) of Windows
and Linux for use in critical systems. Try as i might i can't find any
articles discussing this topic (Only heated discussions in the odd
forum). I am particularly interested in finding figures for the mean
time between failure for these two OS's as well as common faults that
are well known. If anyone can point me in the right direction that
would be very helpful.
From:Joseph Gwinn
Subject:Re: MTBF for Windows and Linux
Date:Fri, 26 Nov 2004 04:36:32 GMT
In article ,
imm102@hotmail.com (Ian Mckay) wrote:

> hi,
>
> Im doing a project on the suitability (or lack there-of) of Windows
> and Linux for use in critical systems. Try as i might i can't find any
> articles discussing this topic (Only heated discussions in the odd
> forum). I am particularly interested in finding figures for the mean
> time between failure for these two OS's as well as common faults that
> are well known. If anyone can point me in the right direction that
> would be very helpful.

The whole idea of computing MTBFs for software is suspect: software is
not hardware. Software doesn't fail because long use wears something
out, it fails because a new situation or sequence of events caused a new
path in the code to be taken, and this newly executed code was somehow
flawed.

That said, Windows fails far more often and floridly than Linux,
although neither is intended for high reliability applications.

Is realtime performance also needed? Will embedded hardware be
controlled?

When you say "critical systems", what do you mean? What is the
application, and what are the consequences of failure? Who dies?

Joe Gwinn
From:Paul Keinanen
Subject:Re: MTBF for Windows and Linux
Date:Fri, 26 Nov 2004 09:55:27 +0200
On Fri, 26 Nov 2004 04:36:32 GMT, Joseph Gwinn
wrote:

>The whole idea of computing MTBFs for software is suspect: software is
>not hardware. Software doesn't fail because long use wears something
>out, it fails because a new situation or sequence of events caused a new
>path in the code to be taken, and this newly executed code was somehow
>flawed.

Software can "wear" out after long use. A typical example would be
dynamic memory pool fragmentation in a badly designed system, which
might cause a failure after a year or two, when there is not a single
contiguous block with sufficient size available for allocation.

Running 1000 identical units in parallel in a reliability test for one
week or a month would indicate no failures.

Paul

From:Joseph Gwinn
Subject:Re: MTBF for Windows and Linux
Date:Fri, 26 Nov 2004 14:58:25 GMT
In article ,
Paul Keinanen wrote:

> On Fri, 26 Nov 2004 04:36:32 GMT, Joseph Gwinn
> wrote:
>
> >The whole idea of computing MTBFs for software is suspect: software is
> >not hardware. Software doesn't fail because long use wears something
> >out, it fails because a new situation or sequence of events caused a new
> >path in the code to be taken, and this newly executed code was somehow
> >flawed.
>
> Software can "wear" out after long use. A typical example would be
> dynamic memory pool fragmentation in a badly designed system, which
> might cause a failure after a year or two, when there is not a single
> contiguous block with sufficient size available for allocation.

Worn-out hardware is not fixed by a reboot, so the analogy seems
strained.

I've never seen a system with bad memory management take a year to fail,
unless what took the year was either the wait for a sufficiently heavy
system load, or the wait for sufficiently bad luck. But, the heavier
the load, the worse the luck.

Bad design of the memory manager code is the root cause here, and the
"new situation" is mempool exhaustion or fragmentation.


> Running 1000 identical units in parallel in a reliability test for one
> week or a month would indicate no failures.

True. This is in exact contrast to hardware, where one estimates the
failure rate by running many units in parallel, often under some kind of
environmental stress, and counts the bodies. With modern hardware, the
failure rates are too low for any other approach to be practical.


Joe Gwinn
From:Marc Girod
Subject:Re: MTBF for Windows and Linux
Date:Fri, 26 Nov 2004 08:26:58 GMT
>>>>> "PK" == Paul Keinanen writes:

PK> Software can "wear" out after long use. A typical example would be
PK> dynamic memory pool fragmentation in a badly designed system,
PK> which might cause a failure after a year or two, when there is not
PK> a single contiguous block with sufficient size available for
PK> allocation.

Right. Nice try.

One may either consider this as a bit far-fetched metaphor, or on the
contrary see that this demonstrates a difference in granularity: this
is about "wearing" at the level of atoms in a crystal.
It means that if we ever end up building complex software (a few
orders of magnitude more complex than today most complex systems),
this kind of effect will obviously generalize...

--
Marc Girod P.O. Box 323 Voice: +358-71 80 25581
Nokia BI 00045 NOKIA Group Mobile: +358-50 38 78415
Valimo 21 / B616 Finland Fax: +358-71 80 64474
From:Lukas Ruf
Subject:Re: MTBF for Windows and Linux
Date:26 Nov 2004 10:38:04 +0100
Joseph Gwinn wrote on [Fri, 26 Nov 2004 04:36:32 GMT]
> In article ,
> imm102@hotmail.com (Ian Mckay) wrote:
>
> > hi,
> >
> > Im doing a project on the suitability (or lack there-of) of
> > Windows and Linux for use in critical systems. Try as i might i
> > can't find any articles discussing this topic (Only heated
> > discussions in the odd forum). I am particularly interested in
> > finding figures for the mean time between failure for these two
> > OS's as well as common faults that are well known. If anyone can
> > point me in the right direction that would be very helpful.
>
> The whole idea of computing MTBFs for software is suspect: software
> is not hardware. Software doesn't fail because long use wears
> something out, it fails because a new situation or sequence of
> events caused a new path in the code to be taken, and this newly
> executed code was somehow flawed.
>
> That said, Windows fails far more often and floridly than Linux,
> although neither is intended for high reliability applications.
>

Although I agree with Joseph's statement on MTBF I'd like to add my
point of view:

For high-availability systems it is also crucial that they do not
need to be taken off-line for any strange reason like simple
software updates.

That said: I had to update my Windows Server recently. I took me a
couple of hours because stupid Windows had to reboot after nearly
every patch. At the same time I run a set of Linux servers. They
are updated every morning. One 'top' that is currently running on
screen reports:

top - 10:33:28 up 182 days, 38 min, .....
^^^

I do not remember having had my Windows Server installation running
for that long without the need to reboot.

wbr,
Lukas
--
Lukas Ruf | Wanna know anything about raw |
| IP? -> |
eMail Style Guide: |
------------ And now a word from our sponsor ------------------
For a quality usenet news server, try DNEWS, easy to install,
fast, efficient and reliable. For home servers or carrier class
installations with millions of users it will allow you to grow!
---- See http://netwinsite.com/sponsor/sponsor_dnews.htm ----
From:Steve Jorgensen
Subject:Re: MTBF for Windows and Linux
Date:Fri, 26 Nov 2004 20:18:44 GMT
On 26 Nov 2004 10:38:04 +0100, Lukas Ruf wrote:

>Joseph Gwinn wrote on [Fri, 26 Nov 2004 04:36:32 GMT]
>> In article ,
>> imm102@hotmail.com (Ian Mckay) wrote:
>>
>> > hi,
>> >
>> > Im doing a project on the suitability (or lack there-of) of
>> > Windows and Linux for use in critical systems. Try as i might i
>> > can't find any articles discussing this topic (Only heated
>> > discussions in the odd forum). I am particularly interested in
>> > finding figures for the mean time between failure for these two
>> > OS's as well as common faults that are well known. If anyone can
>> > point me in the right direction that would be very helpful.
>>
>> The whole idea of computing MTBFs for software is suspect: software
>> is not hardware. Software doesn't fail because long use wears
>> something out, it fails because a new situation or sequence of
>> events caused a new path in the code to be taken, and this newly
>> executed code was somehow flawed.
>>
>> That said, Windows fails far more often and floridly than Linux,
>> although neither is intended for high reliability applications.
>>
>
>Although I agree with Joseph's statement on MTBF I'd like to add my
>point of view:
>
>For high-availability systems it is also crucial that they do not
>need to be taken off-line for any strange reason like simple
>software updates.
>
>That said: I had to update my Windows Server recently. I took me a
>couple of hours because stupid Windows had to reboot after nearly
>every patch. At the same time I run a set of Linux servers. They
>are updated every morning. One 'top' that is currently running on
>screen reports:
>
> top - 10:33:28 up 182 days, 38 min, .....
> ^^^
>
>I do not remember having had my Windows Server installation running
>for that long without the need to reboot.

As a mitigating factor on this point, Microsoft has a tool available to allow
a set of updates each ostensibly requiring a reboot to be applied together
using a single reboot. I don't recall the name of the tool just now, but a
search if the MS site should bring it up.
From:Juhan Leemet
Subject:Re: MTBF for Windows and Linux
Date:Sat, 27 Nov 2004 18:05:39 -0300
On Fri, 26 Nov 2004 04:36:32 +0000, Joseph Gwinn wrote:
> In article ,
> imm102@hotmail.com (Ian Mckay) wrote:
>
>> hi,
>>
>> Im doing a project on the suitability (or lack there-of) of Windows
>> and Linux for use in critical systems. Try as i might i can't find any
>> articles discussing this topic (Only heated discussions in the odd
>> forum). I am particularly interested in finding figures for the mean
>> time between failure for these two OS's as well as common faults that
>> are well known. If anyone can point me in the right direction that
>> would be very helpful.
>
> The whole idea of computing MTBFs for software is suspect: software is
> not hardware. Software doesn't fail because long use wears something
> out, it fails because a new situation or sequence of events caused a new
> path in the code to be taken, and this newly executed code was somehow
> flawed.

Yes, you are right, for static (unchanging) software, which is rare.
Usually, software succumbs to what a friend of mine calls "bit-rot", the
introduction of bugs through tinkering, either fixes or customization. I
have unfortunately worked on software whose basic architecture was so
strained by the extensions and customizations that it virtually collapsed
of its own weight (of internal contradictions). In that case, a redesign
and reimplementation should have been done to extend original functions.

The software implementation may also "drift" into failure modes as a
result of a change in the environment causing the software to begin
encountering buried bugs, never before encountered in (bad?) testing or
previous operation. This can also happen to firmware.

--
Juhan Leemet
Logicognosis, Inc.
From:smilemac
Subject:Re: MTBF for Windows and Linux
Date:Sun, 28 Nov 2004 00:35:25 +0800

"Joseph Gwinn"
??????:JoeGwinn-648999.23363225112004@netnews.comcast.net...
> In article ,
> imm102@hotmail.com (Ian Mckay) wrote:
>
> > hi,
> >
> > Im doing a project on the suitability (or lack there-of) of Windows
> > and Linux for use in critical systems. Try as i might i can't find any
> > articles discussing this topic (Only heated discussions in the odd
> > forum). I am particularly interested in finding figures for the mean
> > time between failure for these two OS's as well as common faults that
> > are well known. If anyone can point me in the right direction that
> > would be very helpful.
>
> The whole idea of computing MTBFs for software is suspect: software is
> not hardware. Software doesn't fail because long use wears something
> out, it fails because a new situation or sequence of events caused a new
> path in the code to be taken, and this newly executed code was somehow
> flawed.
>

Software will also wear something out, state accumlating, memory leak caused
from bad coding, hard disk size limitation exceeding, and any changes of
external environment. All these intend to cause the software fail.

smilemac
From:David Lightstone
Subject:Re: MTBF for Windows and Linux
Date:Sat, 27 Nov 2004 18:18:05 GMT

"smilemac" wrote in message
news:coaa09$5881@imsp212.netvigator.com...
>
> "Joseph Gwinn"
> ??????:JoeGwinn-648999.23363225112004@netnews.comcast.net...
> > In article ,
> > imm102@hotmail.com (Ian Mckay) wrote:
> >
> > > hi,
> > >
> > > Im doing a project on the suitability (or lack there-of) of Windows
> > > and Linux for use in critical systems. Try as i might i can't find any
> > > articles discussing this topic (Only heated discussions in the odd
> > > forum). I am particularly interested in finding figures for the mean
> > > time between failure for these two OS's as well as common faults that
> > > are well known. If anyone can point me in the right direction that
> > > would be very helpful.
> >
> > The whole idea of computing MTBFs for software is suspect: software is
> > not hardware. Software doesn't fail because long use wears something
> > out, it fails because a new situation or sequence of events caused a new
> > path in the code to be taken, and this newly executed code was somehow
> > flawed.
> >
>
> Software will also wear something out, state accumlating, memory leak
caused
> from bad coding, hard disk size limitation exceeding, and any changes of
> external environment. All these intend to cause the software fail.

When did a poor algorithm become a degrading component?

The manifestations certainly are similar to those of something wearing out,
no doubt about it. Lets not confuse a flaw with the manifestation. The
software was always "broken" its failure was just not noticed before!!!!!!

>
> smilemac
>
>
>
From:leslie
Subject:Re: MTBF for Windows and Linux
Date:Sat, 27 Nov 2004 21:43:37 GMT
Ian Mckay (imm102@hotmail.com) wrote:
: hi,
:
: Im doing a project on the suitability (or lack there-of) of Windows
: and Linux for use in critical systems. Try as i might i can't find any
: articles discussing this topic (Only heated discussions in the odd
: forum). I am particularly interested in finding figures for the mean
: time between failure for these two OS's as well as common faults that
: are well known. If anyone can point me in the right direction that
: would be very helpful.
:

Here's an article against using Windows for mission-critical applications
such as pipeline control:

http://www.seattleweekly.com/features/0002/tech-scigliano.shtml
Seattle Weekly - tech: A worm in the works

"Could disasters loom as more and more pipeline operators switch
to Windows NT?

[snip]

But not one that's likely to pan out. That's because, according to
Olympic IT manager Dan Swathman, the pipeline's SCADA system does
not run on Windows, which might have made it vulnerable to the
e-mail-borne Worm.Explore.zip: "Our current SCADA system is on VMS,
from Digital Equipment [now Compaq], running on an Alpha Chip. GMI
makes the system. A couple computers are dedicated to it." And,
Swathman adds, these computers are not connected to the Windows-based
e-mail and office systems through which the worm could have gotten in.

That's reassuring--but the broader picture may not be. Across this
country, and as far away as China, pipeline systems are switching from
VMS and Unix systems to versatile, ubiquitous, user-friendly Windows
NT. "The scada market has been moving towards Windows NT as the
dominant operating system," Oil & Gas Journal reported (3/24/97)..."


The systems in operation at the time of the explosion were still VAXes,
not Alphas:

http://www.ntsb.gov/publictn/2002/PAR0202.htm
NTSB Abstract PAR-02/02

http://www.ntsb.gov/publictn/2002/PAR0202.pdf

"The Olympic Pipeline SCADA system consisted of Teledyne Brown
Engineering 20 SCADA Vector software, version 3.6.1., running on
two Digital Equipment Corporation (DEC) VAX Model 4000-300 computers
with VMS operating system Version 7.1..."

The VAXes were replaced by dual-CPU Alphas in 2001. I was a member of
the project team that performed the upgrade.

The U.S. Navy's Smart Ship program uses Windows-based systems:

http://www.gcn.com/vol19_no27/dod/2868-1.html
Navy carrier to run Win 2000

http://www.cnn.com/2000/TECH/computing/08/08/carrier.windows.idg/
CNN.com - Technology -
Futuristic Windows version to control aircraft carrier - August 8, 2000

"...The CVN-77 win is a key triumph for Microsoft in the defense
industry, because it sets the stage for the company's participation in
the Navy's long-term, three-phase future carrier design program. "This
is not just the one ship. It will decide the architectures for the
next three ships," Roach said. Microsoft's agreement also includes a
back-fit program for seven other carriers, bringing the total to 10."

At least some PLCs are also used, per this description of the "Smart Ship"
system...

http://www.e-d-i.com/products_control.html
L-3 Communications SPD Technologies - Control Systems

VMS systems support multi-site cluster which can tolerate the loss of
an entire datacenter:

http://h71000.www7.hp.com/openvms/brochures/commerzbank/
hp Alphaserver technology helps Commerzbank tolerate disaster
on September 11

"testing disaster tolerance

While most large organizations today have plans for Disaster Tolerance
(DT), few have to put them to the test. The North American
headquarters of Commerzbank, located less than 100 yards from the
World Trade Center in New York City, put its DT plan into action on
September 11, 2001. Because Commerzbank relies on OpenVMS wide-area
clustering, volume shadowing and AlphaServer GS160 systems from HP,
the bank was able to function on September 11 because its critical
banking applications continued to run at the primary site and were
available from the bank's remote site..."


http://www.enterpriseitplanet.com/storage/features/article.php/3396941
OpenVMS Gets a Case of the DT's

OpenVMS Gets a Case of the DT's
August 18, 2004
By Drew Robb

It's not uncommon for alcoholics to suffer from the DT's (Delirium
Tremens -- severe alcohol withdrawal characterized by agitation,
violence, anxiety, insomnia, muscle cramps, tremor, delusion,
hallucinations, and fever). But whoever heard of an operating system
(OS) suffering from the malady? Well, the OpenVMS OS apparently has an
acute case of the DT's. Though in this instance, we are talking about
disaster recovery.

Disaster Tolerance (DT) is a concept that extends beyond disaster
recovery (DR). Traditional DR focuses on minimizing downtime then
picking up the pieces and reconstructing any lost data afterwards..."


http://makeashorterlink.com/?V48121DA9
OpenVMS Survives and Thrives - Computerworld

There have been some recent articles on VMS:

The original URL, wrapped to 2 lines:

http://www.computerworld.com/softwaretopics/software/story/
0,10801,97032,00.html
OpenVMS Survives and Thrives - Computerworld

OpenVMS Survives and Thrives

The 'legacy' operating system maintains a substantial base in large
organizations, and there's promise of new interest as it moves to
64-bit Itanium.

[snip]

According to David Freund, an analyst at IT research firm Illuminata
Inc. in Nashua, N.H., several financial services businesses in the
towers and numerous others in the immediate vicinity had OpenVMS
disaster-tolerant clusters with backup sites outside the area. Every
one of them had their operations running just moments after the
catastrophe, says Freund.

Following that awful day, OpenVMS seems to have gained new prominence.
In some IT circles, it's now regarded as the creme de la creme in
disaster recovery and high availability, according to users and
analysts.

"OpenVMS uptimes can be measured in years," says Stenz. "This is
certainly preferable to a culture of rebooting and disruption that
plagues other platforms due to viruses, Trojans, denial-of-service
attacks and endless patching of systems..."


HTH,

--Jerry Leslie
Note: leslie@jrlvax.houston.rr.com is invalid for email
"VMS: Uptime measured with a calendar instead of a stopwatch"
From:Paul E. Bennett
Subject:Re: MTBF for Windows and Linux
Date:Sun, 28 Nov 2004 19:55:25 +0000
leslie wrote:

> Here's an article against using Windows for mission-critical applications
> such as pipeline control:
>
> http://www.seattleweekly.com/features/0002/tech-scigliano.shtml
> Seattle Weekly - tech: A worm in the works
>
> "Could disasters loom as more and more pipeline operators switch
> to Windows NT?

Anyone considering MS Windows based products for any mission critical
appllcation should take a look at the licence conditions for starters. If
you do use it for such applications you are standing alone and may be seen
as liable for the consequences of any unfavourable outcome. All MS terms
and conditions actuially forbid use of their products in mission critical
applications without consulktation with MS first.

> The systems in operation at the time of the explosion were still VAXes,
> not Alphas:

I suppose those who are responsibvle for critical systems on other OS's
might like to comment on what the situation is with their set-up. As I only
use such systems at the HMI level, not the "Safety Critical" level, there
is little risk involved even if the whole HMI failed.

As I indicated in my last posting on this topic, utilising a multi-layer
approach to critical systems will ensure that the critical parts are
concentrated in sub-systems that you can prove will operate safely and
correctly under all non-optimal situations.

--
********************************************************************
Paul E. Bennett ....................
Forth based HIDECS Consultancy .....
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************
From:Nick Landsberg
Subject:Re: MTBF for Windows and Linux
Date:Thu, 25 Nov 2004 20:54:56 GMT
Ian Mckay wrote:

> hi,
>
> Im doing a project on the suitability (or lack there-of) of Windows
> and Linux for use in critical systems. Try as i might i can't find any
> articles discussing this topic (Only heated discussions in the odd
> forum). I am particularly interested in finding figures for the mean
> time between failure for these two OS's as well as common faults that
> are well known. If anyone can point me in the right direction that
> would be very helpful.

Google for either "Carrier Grade Linux" or "High Availability
Linux" or "real time Linux".

I doubt that anyone has published MTBF or MTTR, however.
The Carrier Grade Linux effort is too new to have many
results to report, and do you really
think MS would publish data of that kind about Windows?

If you do get any data, just be skeptical and look
at it with a jaundiced eye. Are 12 one minute outages
a year better or worse than one 12 minute outage a year?

The answer is "it probably depends on the application."

NPL

--
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch
From:smilemac
Subject:Re: MTBF for Windows and Linux
Date:Sun, 28 Nov 2004 00:50:00 +0800
"Nick Landsberg"
??????:AErpd.974448$Gx4.550056@bgtnsc04-news.ops.worldnet.att.net...
> Ian Mckay wrote:
>
> > hi,
> >
> > Im doing a project on the suitability (or lack there-of) of Windows
> > and Linux for use in critical systems. Try as i might i can't find any
> > articles discussing this topic (Only heated discussions in the odd
> > forum). I am particularly interested in finding figures for the mean
> > time between failure for these two OS's as well as common faults that
> > are well known. If anyone can point me in the right direction that
> > would be very helpful.
>
> Google for either "Carrier Grade Linux" or "High Availability
> Linux" or "real time Linux".
>
> I doubt that anyone has published MTBF or MTTR, however.
> The Carrier Grade Linux effort is too new to have many
> results to report, and do you really
> think MS would publish data of that kind about Windows?
>
> If you do get any data, just be skeptical and look
> at it with a jaundiced eye. Are 12 one minute outages
> a year better or worse than one 12 minute outage a year?
>
> The answer is "it probably depends on the application."
>
> NPL
>
> --
> "It is impossible to make anything foolproof
> because fools are so ingenious"
> - A. Bloch

There's no any meanings to find out the figure difference of MTBF between
Linux and M$. No one standalone OS is reliable enough for a "critical
system". Reliabilty must be built on the hot standby style architecture.

smilemac
From:Nick Landsberg
Subject:Re: MTBF for Windows and Linux
Date:Sat, 27 Nov 2004 20:19:09 GMT
smilemac wrote:
> "Nick Landsberg"
> ??????:AErpd.974448$Gx4.550056@bgtnsc04-news.ops.worldnet.att.net...
>
>>Ian Mckay wrote:
>>
>>
>>>hi,
>>>
>>>Im doing a project on the suitability (or lack there-of) of Windows
>>>and Linux for use in critical systems. Try as i might i can't find any
>>>articles discussing this topic (Only heated discussions in the odd
>>>forum). I am particularly interested in finding figures for the mean
>>>time between failure for these two OS's as well as common faults that
>>>are well known. If anyone can point me in the right direction that
>>>would be very helpful.
>>
>>Google for either "Carrier Grade Linux" or "High Availability
>>Linux" or "real time Linux".
>>
>>I doubt that anyone has published MTBF or MTTR, however.
>>The Carrier Grade Linux effort is too new to have many
>>results to report, and do you really
>>think MS would publish data of that kind about Windows?
>>
>>If you do get any data, just be skeptical and look
>>at it with a jaundiced eye. Are 12 one minute outages
>>a year better or worse than one 12 minute outage a year?
>>
>>The answer is "it probably depends on the application."
>>
>>NPL
>>
>>--
>>"It is impossible to make anything foolproof
>>because fools are so ingenious"
>> - A. Bloch
>
>
> There's no any meanings to find out the figure difference of MTBF between
> Linux and M$. No one standalone OS is reliable enough for a "critical
> system". Reliabilty must be built on the hot standby style architecture.
>
> smilemac
>
>

Agreed.

Considering that commercially available server hardware is
usually on the order of 3-9's reliability (about 8-9 hours
a year down time), you _already_ need hot standby to get
anywhere near 5-9's. Assuming of course, that you have
software assist to detect the failure and switch
to the hot standby. I, personally, prefer a "dual-hot"
configuration where each side is taking half the load
under normal conditions but can pick up the whole
load if its "mate" fails. There are several reasons
which I won't get into here.

In addition, studies in telephone systems some years
ago indicated that only about 20% of system failures
were due to hardware. There were 40% due to software
and another 40% due to pilot error. How this relates
to this day and age is debatable since telephone
systems were built using home-grown hardened hardware
back in those days.

One key parameter here is the restoration time.
While the first system is down, you are "at risk"
of the backup system taking a fault (HW, SW, other).
The componenents of restoration time include:
- Fault detection (how long before someone realizes
that it's down?)
- HW replacement (if it was a hardware fault).
Add travel time if you need to go get it out
of some warehouse somewhere or have it shipped.
- OS reboot
- Application startup. (The OS being up and running
is not enough. If this system is dedicated to a
certain set of applications, it isn't fully functional
until all the applications are up and running.)

The OS can only try to address its own failure rates
and try to address the reboot time.

Still and all, I would prefer an OS with an
MTBF of 6 months to one with an MTBF of 1 month.
(Not naming any names.)

NPL

--
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch
From:Joseph Gwinn
Subject:Re: MTBF for Windows and Linux
Date:Sat, 27 Nov 2004 23:07:37 GMT
In article <1j5qd.65611$7i4.28989@bgtnsc05-news.ops.worldnet.att.net>,
Nick Landsberg wrote:

> smilemac wrote:
> > "Nick Landsberg"
> > ??????:AErpd.974448$Gx4.550056@bgtnsc04-news.ops.worldnet.att.net...
> >
> >>Ian Mckay wrote:
> >>
> >>
> >>>hi,
> >>>
> >>>Im doing a project on the suitability (or lack there-of) of Windows
> >>>and Linux for use in critical systems. Try as i might i can't find any
> >>>articles discussing this topic (Only heated discussions in the odd
> >>>forum). I am particularly interested in finding figures for the mean
> >>>time between failure for these two OS's as well as common faults that
> >>>are well known. If anyone can point me in the right direction that
> >>>would be very helpful.
> >>
> >>Google for either "Carrier Grade Linux" or "High Availability
> >>Linux" or "real time Linux".
> >>
> >>I doubt that anyone has published MTBF or MTTR, however.
> >>The Carrier Grade Linux effort is too new to have many
> >>results to report, and do you really
> >>think MS would publish data of that kind about Windows?
> >>
> >>If you do get any data, just be skeptical and look
> >>at it with a jaundiced eye. Are 12 one minute outages
> >>a year better or worse than one 12 minute outage a year?
> >>
> >>The answer is "it probably depends on the application."
> >>
> >>NPL
> >>
> >>--
> >>"It is impossible to make anything foolproof
> >>because fools are so ingenious"
> >> - A. Bloch
> >
> >
> > There's no any meanings to find out the figure difference of MTBF between
> > Linux and M$. No one standalone OS is reliable enough for a "critical
> > system". Reliabilty must be built on the hot standby style architecture.
> >
> > smilemac
> >
> >
>
> Agreed.
>
> [snip]
>
> In addition, studies in telephone systems some years
> ago indicated that only about 20% of system failures
> were due to hardware. There were 40% due to software
> and another 40% due to pilot error. How this relates
> to this day and age is debatable since telephone
> systems were built using home-grown hardened hardware
> back in those days.

I've heard of these studies too. Do you (or anybody else) have the
references? I'd like to get copies.


> [snip]
>
> Still and all, I would prefer an OS with an
> MTBF of 6 months to one with an MTBF of 1 month.
> (Not naming any names.)

Again, I've always used one month, as this seems to be the goal for
desktop operating systems. The usual form of the question I ask is the
following: "Is it OK if crashes once a month, requiring a
hard reset?" If the answer is yes, then it's plausible to use a desktop
OS for the purpose. If the answer is no, then one must use a real RTOS,
perhaps even one suited for full safety certification, depending on the
nature of .

By the way, do you know the reference for the one month and six months
figures?

Joe Gwinn
From:Nick Landsberg
Subject:Re: MTBF for Windows and Linux
Date:Sun, 28 Nov 2004 02:25:45 GMT
Joseph Gwinn wrote:

> In article <1j5qd.65611$7i4.28989@bgtnsc05-news.ops.worldnet.att.net>,
> Nick Landsberg wrote:
>
>
>>smilemac wrote:
>>
>>>"Nick Landsberg"
>>>??????:AErpd.974448$Gx4.550056@bgtnsc04-news.ops.worldnet.att.net...
>>>
>>>
>>>>Ian Mckay wrote:
>>>>
>>>>
>>>>
>>>>>hi,
>>>>>
>>>>>Im doing a project on the suitability (or lack there-of) of Windows
>>>>>and Linux for use in critical systems. Try as i might i can't find any
>>>>>articles discussing this topic (Only heated discussions in the odd
>>>>>forum). I am particularly interested in finding figures for the mean
>>>>>time between failure for these two OS's as well as common faults that
>>>>>are well known. If anyone can point me in the right direction that
>>>>>would be very helpful.
>>>>
>>>>Google for either "Carrier Grade Linux" or "High Availability
>>>>Linux" or "real time Linux".
>>>>
>>>>I doubt that anyone has published MTBF or MTTR, however.
>>>>The Carrier Grade Linux effort is too new to have many
>>>>results to report, and do you really
>>>>think MS would publish data of that kind about Windows?
>>>>
>>>>If you do get any data, just be skeptical and look
>>>>at it with a jaundiced eye. Are 12 one minute outages
>>>>a year better or worse than one 12 minute outage a year?
>>>>
>>>>The answer is "it probably depends on the application."
>>>>
>>>>NPL
>>>>
>>>>--
>>>>"It is impossible to make anything foolproof
>>>>because fools are so ingenious"
>>>> - A. Bloch
>>>
>>>
>>>There's no any meanings to find out the figure difference of MTBF between
>>>Linux and M$. No one standalone OS is reliable enough for a "critical
>>>system". Reliabilty must be built on the hot standby style architecture.
>>>
>>>smilemac
>>>
>>>
>>
>>Agreed.
>>
>>[snip]
>>
>>In addition, studies in telephone systems some years
>>ago indicated that only about 20% of system failures
>>were due to hardware. There were 40% due to software
>>and another 40% due to pilot error. How this relates
>>to this day and age is debatable since telephone
>>systems were built using home-grown hardened hardware
>>back in those days.
>
>
> I've heard of these studies too. Do you (or anybody else) have the
> references? I'd like to get copies.

The studies I was referring to were done by AT&T for the
4ESS and 5ESS switches. As far as I know, they were not published
outside of AT&T. (And I do not have access to them any
more.)

>
>
>
>>[snip]
>>
>>Still and all, I would prefer an OS with an
>>MTBF of 6 months to one with an MTBF of 1 month.
>>(Not naming any names.)
>
>
> Again, I've always used one month, as this seems to be the goal for
> desktop operating systems. The usual form of the question I ask is the
> following: "Is it OK if crashes once a month, requiring a
> hard reset?" If the answer is yes, then it's plausible to use a desktop
> OS for the purpose. If the answer is no, then one must use a real RTOS,
> perhaps even one suited for full safety certification, depending on the
> nature of .
>
> By the way, do you know the reference for the one month and six months
> figures?

I was just throwing out what seemed like reasonable figures
for the classes of systems which you mentioned (desktop vs.
"other"). Consider them hearsay until proven otherwise.

I do not work in the "hard real-time world" but have
had chances to review some of their work (and made
good use of the learnings they had to share). I work in
a "soft real-time world." i.e. rather than 5-10 second
switchover, I get the luxury of 5-10 minutes.

For this we use Solaris and we have learned enough about
it's idiosyncracies that we've achieved 5-9's in a
"all active" (either mated pair or N+K) configuration (if we
discount planned down time for software upgrades).

The SUN literature claims a HW reliability of 3-9's
for their "hardened" servers (NEBS certified). We have not verified
this in enough field tests to be meaningful.

Since 3-9's is about 8 hours a year, I assumed that this
was 2 hardware outages of 4 hours restoration time
each, but I could be wrong.

(Note: This is not meant as a testimonial for SUN. It's
just what our experience has been with them. We have
other groups within the company which swear by HP, and still
others which insist on using VXworks (sp?) for any real-time
stuff. YMMV.)

NPL

>
> Joe Gwinn


--
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch
From:Joseph Gwinn
Subject:Re: MTBF for Windows and Linux
Date:Sun, 28 Nov 2004 18:38:08 GMT
In article ,
Nick Landsberg wrote:

> Joseph Gwinn wrote:
>
> > In article <1j5qd.65611$7i4.28989@bgtnsc05-news.ops.worldnet.att.net>,
> > Nick Landsberg wrote:
> >
> >
> >>smilemac wrote:
> >>
> >>>"Nick Landsberg"
> >>>??????:AErpd.974448$Gx4.550056@bgtnsc04-news.ops.worldnet.att.net...
> >>>
> >>>[snip]
> >>[snip]
> >>
> >>In addition, studies in telephone systems some years
> >>ago indicated that only about 20% of system failures
> >>were due to hardware. There were 40% due to software
> >>and another 40% due to pilot error. How this relates
> >>to this day and age is debatable since telephone
> >>systems were built using home-grown hardened hardware
> >>back in those days.
> >
> >
> > I've heard of these studies too. Do you (or anybody else) have the
> > references? I'd like to get copies.
>
> The studies I was referring to were done by AT&T for the
> 4ESS and 5ESS switches. As far as I know, they were not published
> outside of AT&T. (And I do not have access to them any
> more.)

Ahh. I vaguely recall some articles in the Bell System Technical
Journal on just this subject. That's where it would be documented
publicly, if anywhere.


> >>[snip]
> >>
> >>Still and all, I would prefer an OS with an
> >>MTBF of 6 months to one with an MTBF of 1 month.
> >>(Not naming any names.)
> >
> >
> > Again, I've always used one month, as this seems to be the goal for
> > desktop operating systems. The usual form of the question I ask is the
> > following: "Is it OK if crashes once a month, requiring a
> > hard reset?" If the answer is yes, then it's plausible to use a desktop
> > OS for the purpose. If the answer is no, then one must use a real RTOS,
> > perhaps even one suited for full safety certification, depending on the
> > nature of .
> >
> > By the way, do you know the reference for the one month and six months
> > figures?
>
> I was just throwing out what seemed like reasonable figures
> for the classes of systems which you mentioned (desktop vs.
> "other"). Consider them hearsay until proven otherwise.

My one-month estimate comes from reading lots of articles in the
computer press where this seemed to be the shining ideal. I have also
seen reports saying that this or that shop had made measurements, and
got just those kinds of measurements, but was never able to get the
details.


> I do not work in the "hard real-time world" but have
> had chances to review some of their work (and made
> good use of the learnings they had to share). I work in
> a "soft real-time world." i.e. rather than 5-10 second
> switchover, I get the luxury of 5-10 minutes.

Who is "their"?

The degree of realtime performance and the switchover time are
unrelated. In other words, it does not follow that the more stringent
the realtime the faster the switchover.

One example would be the firmware in an ordinary disk drive. This
firmware is stringently realtime, being used to implement servo control
of disk and head motion. However, if the disk is used in a office
worker's desktop system, nobody will die if the disk breaks and has to
be replaced, even if this takes a day or two.

Despite the example, my experience has been that Windows crashes are by
far the more common cause of lost files than hardware failure, but
still...

Hope the office worker believes in making backups, though. Sooner or
later, he's going to need it.


> For this we use Solaris and we have learned enough about
> it's idiosyncracies that we've achieved 5-9's in a
> "all active" (either mated pair or N+K) configuration (if we
> discount planned down time for software upgrades).
>
> The SUN literature claims a HW reliability of 3-9's
> for their "hardened" servers (NEBS certified). We have not verified
> this in enough field tests to be meaningful.
>
> Since 3-9's is about 8 hours a year, I assumed that this
> was 2 hardware outages of 4 hours restoration time
> each, but I could be wrong.
>
> (Note: This is not meant as a testimonial for SUN. It's
> just what our experience has been with them. We have
> other groups within the company which swear by HP, and still
> others which insist on using VxWorks for any real-time
> stuff. YMMV.)

Yes. I've used or seen used Sun, SGI, HP, Stratus, Tandem, DEC VAX,
Data General, various makes and models of single-board computers, et al,
in the dual-redundant or triple-redundant (where a single string
suffices) topology. All worked, and the redundancy converted hardware
reliability into a maintenance cost issue only, rather than a system
availability issue.

The one triple-redundant system was to allow any two strings to be
online while the third string was in pieces on the floor, being repaired.

I've also used VxWorks in many systems, including a quad-redundant naval
system that had to survive massive battle damage (three cruise-missile
strikes).


The bottom line is that the system availability was achieved by use of
redundant hardware with application code designed to make full use of
the replicated hardware (not all of which is computers); the
availability of the overall system vastly exceeded that of the hardware,
the operating systems, and the application software.

That said, this only works if the operating system is both realtime
enough for the intended application, and reliable enough that the
overall system isn't killed by double faults. Use of redundant
computers helps greatly, but does have its limits, especially in the
face of common-mode faults.

Then there is the issue of surviving software development and system
integration.

Joe Gwinn
From:Andrew Gabb
Subject:Re: MTBF for Windows and Linux
Date:Sat, 27 Nov 2004 22:54:48 +1030
Ian Mckay wrote:
> Im doing a project on the suitability (or lack there-of) of Windows
> and Linux for use in critical systems. Try as i might i can't find any
> articles discussing this topic (Only heated discussions in the odd
> forum). I am particularly interested in finding figures for the mean
> time between failure for these two OS's as well as common faults that
> are well known. If anyone can point me in the right direction that
> would be very helpful.

I know of sites where the MTBF of Windows (and the apps used) is a
month or more. True!

Of course, they don't do much and it's almost always the same thing.
Hmmm, sounds a bit like Linux.

Seriously though, the reliability of OSs depends on so many factors,
you'd have trouble categorising them, let alone coming up with MTBF
or other reliability/availability figures that anyone in the know
would actually believe.

The only ones I've seen have always come from sources who have a
serious conflict of interest anyway. Like when I was called in by a
fairly typical small business (with about 7 staff/PCs), where
someone was pissing in their ear about Linux, complete with a
comparison article. It would have been a disaster for them, but it
would have made the (potential) service provider quite wealthy. Go
figure.

FWIW I also know that for some orgs, Linux *can* be a cost effective
part of their system.

Andrew
--
Andrew Gabb
email: agabb@tpgi.com.au Adelaide, South Australia
phone: +61 8 8342-1021, fax: +61 8 8269-3280
-----
From:Paul E. Bennett
Subject:Re: MTBF for Windows and Linux
Date:Fri, 26 Nov 2004 19:46:18 +0000
Ian Mckay wrote:

> hi,
>
> Im doing a project on the suitability (or lack there-of) of Windows
> and Linux for use in critical systems. Try as i might i can't find any
> articles discussing this topic (Only heated discussions in the odd
> forum). I am particularly interested in finding figures for the mean
> time between failure for these two OS's as well as common faults that
> are well known. If anyone can point me in the right direction that
> would be very helpful.

Well, you have already had some interesting responses which have all
pointed out that such figures will be difficult to find. Individual
SysAdmins may have records enough but then you will have to query them
regarding versions. Remeber, any change made to the software changes the
system and in such OS's may not be readily ammenable to re-validation.

If you really are interested in bringing up a "critical system" application
and need to depend on the OS then you should obtain a full source listing
of the OS and go through it with a fine toothed comb to understand what it
is doing and ensure that there are no hidden gotcha's.

Your system architecture is going to be one of the most important aspects
of the design decisions to be made. Think of it as a number of layers from
sensors and actuators at the lowest level, through group controllers and up
to the HMI network. The sensor/actuator level is where you need the
strongest safety interlocking. The group controllers can issue those
elements below it with a permit to operate and be considered as an
intermediate safety interlocking. The HMI layer can then use any convenient
HMI software so long as you can provide reasonable security of operator
identification.

The lowest systems in the layer will probably not need an OS at all. The
intermediate layer (group controllers) may be simple PLC type system with
networking interfaces. The HMI any Unix/Linux/Windows networking terminal
and/or server system. At the upper layer criticality is eliminated as an
issue.

--
********************************************************************
Paul E. Bennett ....................
Forth based HIDECS Consultancy .....
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************
From:Bernhard Holzmayer
Subject:Re: MTBF for Windows and Linux
Date:Mon, 29 Nov 2004 11:30:22 +0100
Ian Mckay wrote:

> hi,
>
> Im doing a project on the suitability (or lack there-of) of Windows
> and Linux for use in critical systems. Try as i might i can't find any
> articles discussing this topic (Only heated discussions in the odd
> forum). I am particularly interested in finding figures for the mean
> time between failure for these two OS's as well as common faults that
> are well known. If anyone can point me in the right direction that
> would be very helpful.

Hi Ian,

if one of these OSs shall be used in a critical system, there are certainly
a lot of restrictions which apply and which influence the general situation
so much that you cannot take figures from the mainstream.

We use a WinNT/Win2000 based system in a production line. It's critical in
the sense that production must be stopped if the system fails (which is
expensive for the customer, and therfore undesired).
It runs in combination of a VxWorks machine which does the real-time
critical parts, thus Windows is critical because it must run (and do some
work), but it's only (very) soft realtime requirements which Windows has to
fulfil.

That's what you should do: define your system as narrow as possible, and
restrict it to the minimum scenario. It's the only chance to get comparable
figures.

Then find out, which MTBF would be acceptable in your system.
And then, design your application software including OS, so that it is able
to achieve a significantly better rate.

To my opinion, you cannot do better.

MTBF is only a statistical measure. It doesn't say anything about the real
time-to-next-failure. Even if Linux had a ten times better MTBF, who
guarantees that it doesn't fail first?

Therefore, instead of trying to find a MTBF or other rate, define scenarios
which could make your system fail.
Then calculate the cost and/or possibilty to repair the system.

If it means that the reset button needs to be pressed, and a day's work is
spoiled, the loss may be not too severe.
If a satellite is traveling through space and fails on its way to Saturn,
that might cost millions and years of work.

I wouldn't rely on the better MTBF in that case...

Probably you have to convince your partners/chiefs/... that it's not MTBF
which you need.
It's certainly a relation between cost and return during
planning/development/maintenance, and then there is the acceptance of the
customer/user who must buy your system.

When our system was designed, it was obvious that Windows was not the
best-performing system under discussion, not even the second...
But it was obvious too, that the customer's wouldn't accept another system
(at that time).

Bernhard
   

Copyright © 2006 newsgroups-index   -   All rights reserved   -   Impressum