|
|
 | | From: | Ian Mckay | | Subject: | MTBF for Windows and Linux | | Date: | 25 Nov 2004 12:24:59 -0800 |
|
|
 | hi,
Im doing a project on the suitability (or lack there-of) of Windows and Linux for use in critical systems. Try as i might i can't find any articles discussing this topic (Only heated discussions in the odd forum). I am particularly interested in finding figures for the mean time between failure for these two OS's as well as common faults that are well known. If anyone can point me in the right direction that would be very helpful.
|
|
 | | From: | Joseph Gwinn | | Subject: | Re: MTBF for Windows and Linux | | Date: | Fri, 26 Nov 2004 04:36:32 GMT |
|
|
 | In article , imm102@hotmail.com (Ian Mckay) wrote:
> hi, > > Im doing a project on the suitability (or lack there-of) of Windows > and Linux for use in critical systems. Try as i might i can't find any > articles discussing this topic (Only heated discussions in the odd > forum). I am particularly interested in finding figures for the mean > time between failure for these two OS's as well as common faults that > are well known. If anyone can point me in the right direction that > would be very helpful.
The whole idea of computing MTBFs for software is suspect: software is not hardware. Software doesn't fail because long use wears something out, it fails because a new situation or sequence of events caused a new path in the code to be taken, and this newly executed code was somehow flawed.
That said, Windows fails far more often and floridly than Linux, although neither is intended for high reliability applications.
Is realtime performance also needed? Will embedded hardware be controlled?
When you say "critical systems", what do you mean? What is the application, and what are the consequences of failure? Who dies?
Joe Gwinn
|
|
 | | From: | Paul Keinanen | | Subject: | Re: MTBF for Windows and Linux | | Date: | Fri, 26 Nov 2004 09:55:27 +0200 |
|
|
 | On Fri, 26 Nov 2004 04:36:32 GMT, Joseph Gwinn wrote:
>The whole idea of computing MTBFs for software is suspect: software is >not hardware. Software doesn't fail because long use wears something >out, it fails because a new situation or sequence of events caused a new >path in the code to be taken, and this newly executed code was somehow >flawed.
Software can "wear" out after long use. A typical example would be dynamic memory pool fragmentation in a badly designed system, which might cause a failure after a year or two, when there is not a single contiguous block with sufficient size available for allocation.
Running 1000 identical units in parallel in a reliability test for one week or a month would indicate no failures.
Paul
|
|
 | | From: | Joseph Gwinn | | Subject: | Re: MTBF for Windows and Linux | | Date: | Fri, 26 Nov 2004 14:58:25 GMT |
|
|
 | In article , Paul Keinanen wrote:
> On Fri, 26 Nov 2004 04:36:32 GMT, Joseph Gwinn > wrote: > > >The whole idea of computing MTBFs for software is suspect: software is > >not hardware. Software doesn't fail because long use wears something > >out, it fails because a new situation or sequence of events caused a new > >path in the code to be taken, and this newly executed code was somehow > >flawed. > > Software can "wear" out after long use. A typical example would be > dynamic memory pool fragmentation in a badly designed system, which > might cause a failure after a year or two, when there is not a single > contiguous block with sufficient size available for allocation.
Worn-out hardware is not fixed by a reboot, so the analogy seems strained.
I've never seen a system with bad memory management take a year to fail, unless what took the year was either the wait for a sufficiently heavy system load, or the wait for sufficiently bad luck. But, the heavier the load, the worse the luck.
Bad design of the memory manager code is the root cause here, and the "new situation" is mempool exhaustion or fragmentation.
> Running 1000 identical units in parallel in a reliability test for one > week or a month would indicate no failures.
True. This is in exact contrast to hardware, where one estimates the failure rate by running many units in parallel, often under some kind of environmental stress, and counts the bodies. With modern hardware, the failure rates are too low for any other approach to be practical.
Joe Gwinn
|
|
 | | From: | Marc Girod | | Subject: | Re: MTBF for Windows and Linux | | Date: | Fri, 26 Nov 2004 08:26:58 GMT |
|
|
 | >>>>> "PK" == Paul Keinanen writes:
PK> Software can "wear" out after long use. A typical example would be PK> dynamic memory pool fragmentation in a badly designed system, PK> which might cause a failure after a year or two, when there is not PK> a single contiguous block with sufficient size available for PK> allocation.
Right. Nice try.
One may either consider this as a bit far-fetched metaphor, or on the contrary see that this demonstrates a difference in granularity: this is about "wearing" at the level of atoms in a crystal. It means that if we ever end up building complex software (a few orders of magnitude more complex than today most complex systems), this kind of effect will obviously generalize...
-- Marc Girod P.O. Box 323 Voice: +358-71 80 25581 Nokia BI 00045 NOKIA Group Mobile: +358-50 38 78415 Valimo 21 / B616 Finland Fax: +358-71 80 64474
|
|
 | | From: | Lukas Ruf | | Subject: | Re: MTBF for Windows and Linux | | Date: | 26 Nov 2004 10:38:04 +0100 |
|
|
 | Joseph Gwinn wrote on [Fri, 26 Nov 2004 04:36:32 GMT] > In article , > imm102@hotmail.com (Ian Mckay) wrote: > > > hi, > > > > Im doing a project on the suitability (or lack there-of) of > > Windows and Linux for use in critical systems. Try as i might i > > can't find any articles discussing this topic (Only heated > > discussions in the odd forum). I am particularly interested in > > finding figures for the mean time between failure for these two > > OS's as well as common faults that are well known. If anyone can > > point me in the right direction that would be very helpful. > > The whole idea of computing MTBFs for software is suspect: software > is not hardware. Software doesn't fail because long use wears > something out, it fails because a new situation or sequence of > events caused a new path in the code to be taken, and this newly > executed code was somehow flawed. > > That said, Windows fails far more often and floridly than Linux, > although neither is intended for high reliability applications. >
Although I agree with Joseph's statement on MTBF I'd like to add my point of view:
For high-availability systems it is also crucial that they do not need to be taken off-line for any strange reason like simple software updates.
That said: I had to update my Windows Server recently. I took me a couple of hours because stupid Windows had to reboot after nearly every patch. At the same time I run a set of Linux servers. They are updated every morning. One 'top' that is currently running on screen reports:
top - 10:33:28 up 182 days, 38 min, ..... ^^^
I do not remember having had my Windows Server installation running for that long without the need to reboot.
wbr, Lukas -- Lukas Ruf | Wanna know anything about raw | | IP? -> | eMail Style Guide: | ------------ And now a word from our sponsor ------------------ For a quality usenet news server, try DNEWS, easy to install, fast, efficient and reliable. For home servers or carrier class installations with millions of users it will allow you to grow! ---- See http://netwinsite.com/sponsor/sponsor_dnews.htm ----
|
|
 | | From: | Steve Jorgensen | | Subject: | Re: MTBF for Windows and Linux | | Date: | Fri, 26 Nov 2004 20:18:44 GMT |
|
|
 | On 26 Nov 2004 10:38:04 +0100, Lukas Ruf wrote:
>Joseph Gwinn wrote on [Fri, 26 Nov 2004 04:36:32 GMT] >> In article , >> imm102@hotmail.com (Ian Mckay) wrote: >> >> > hi, >> > >> > Im doing a project on the suitability (or lack there-of) of >> > Windows and Linux for use in critical systems. Try as i might i >> > can't find any articles discussing this topic (Only heated >> > discussions in the odd forum). I am particularly interested in >> > finding figures for the mean time between failure for these two >> > OS's as well as common faults that are well known. If anyone can >> > point me in the right direction that would be very helpful. >> >> The whole idea of computing MTBFs for software is suspect: software >> is not hardware. Software doesn't fail because long use wears >> something out, it fails because a new situation or sequence of >> events caused a new path in the code to be taken, and this newly >> executed code was somehow flawed. >> >> That said, Windows fails far more often and floridly than Linux, >> although neither is intended for high reliability applications. >> > >Although I agree with Joseph's statement on MTBF I'd like to add my >point of view: > >For high-availability systems it is also crucial that they do not >need to be taken off-line for any strange reason like simple >software updates. > >That said: I had to update my Windows Server recently. I took me a >couple of hours because stupid Windows had to reboot after nearly >every patch. At the same time I run a set of Linux servers. They >are updated every morning. One 'top' that is currently running on >screen reports: > > top - 10:33:28 up 182 days, 38 min, ..... > ^^^ > >I do not remember having had my Windows Server installation running >for that long without the need to reboot.
As a mitigating factor on this point, Microsoft has a tool available to allow a set of updates each ostensibly requiring a reboot to be applied together using a single reboot. I don't recall the name of the tool just now, but a search if the MS site should bring it up.
|
|
 | | From: | Juhan Leemet | | Subject: | Re: MTBF for Windows and Linux | | Date: | Sat, 27 Nov 2004 18:05:39 -0300 |
|
|
 | On Fri, 26 Nov 2004 04:36:32 +0000, Joseph Gwinn wrote: > In article , > imm102@hotmail.com (Ian Mckay) wrote: > >> hi, >> >> Im doing a project on the suitability (or lack there-of) of Windows >> and Linux for use in critical systems. Try as i might i can't find any >> articles discussing this topic (Only heated discussions in the odd >> forum). I am particularly interested in finding figures for the mean >> time between failure for these two OS's as well as common faults that >> are well known. If anyone can point me in the right direction that >> would be very helpful. > > The whole idea of computing MTBFs for software is suspect: software is > not hardware. Software doesn't fail because long use wears something > out, it fails because a new situation or sequence of events caused a new > path in the code to be taken, and this newly executed code was somehow > flawed.
Yes, you are right, for static (unchanging) software, which is rare. Usually, software succumbs to what a friend of mine calls "bit-rot", the introduction of bugs through tinkering, either fixes or customization. I have unfortunately worked on software whose basic architecture was so strained by the extensions and customizations that it virtually collapsed of its own weight (of internal contradictions). In that case, a redesign and reimplementation should have been done to extend original functions.
The software implementation may also "drift" into failure modes as a result of a change in the environment causing the software to begin encountering buried bugs, never before encountered in (bad?) testing or previous operation. This can also happen to firmware.
-- Juhan Leemet Logicognosis, Inc.
|
|
 | | From: | smilemac | | Subject: | Re: MTBF for Windows and Linux | | Date: | Sun, 28 Nov 2004 00:35:25 +0800 |
|
|
 | "Joseph Gwinn" ??????:JoeGwinn-648999.23363225112004@netnews.comcast.net... > In article , > imm102@hotmail.com (Ian Mckay) wrote: > > > hi, > > > > Im doing a project on the suitability (or lack there-of) of Windows > > and Linux for use in critical systems. Try as i might i can't find any > > articles discussing this topic (Only heated discussions in the odd > > forum). I am particularly interested in finding figures for the mean > > time between failure for these two OS's as well as common faults that > > are well known. If anyone can point me in the right direction that > > would be very helpful. > > The whole idea of computing MTBFs for software is suspect: software is > not hardware. Software doesn't fail because long use wears something > out, it fails because a new situation or sequence of events caused a new > path in the code to be taken, and this newly executed code was somehow > flawed. >
Software will also wear something out, state accumlating, memory leak caused from bad coding, hard disk size limitation exceeding, and any changes of external environment. All these intend to cause the software fail.
smilemac
|
|
 | | From: | David Lightstone | | Subject: | Re: MTBF for Windows and Linux | | Date: | Sat, 27 Nov 2004 18:18:05 GMT |
|
|
 | "smilemac" wrote in message news:coaa09$5881@imsp212.netvigator.com... > > "Joseph Gwinn" > ??????:JoeGwinn-648999.23363225112004@netnews.comcast.net... > > In article , > > imm102@hotmail.com (Ian Mckay) wrote: > > > > > hi, > > > > > > Im doing a project on the suitability (or lack there-of) of Windows > > > and Linux for use in critical systems. Try as i might i can't find any > > > articles discussing this topic (Only heated discussions in the odd > > > forum). I am particularly interested in finding figures for the mean > > > time between failure for these two OS's as well as common faults that > > > are well known. If anyone can point me in the right direction that > > > would be very helpful. > > > > The whole idea of computing MTBFs for software is suspect: software is > > not hardware. Software doesn't fail because long use wears something > > out, it fails because a new situation or sequence of events caused a new > > path in the code to be taken, and this newly executed code was somehow > > flawed. > > > > Software will also wear something out, state accumlating, memory leak caused > from bad coding, hard disk size limitation exceeding, and any changes of > external environment. All these intend to cause the software fail.
When did a poor algorithm become a degrading component?
The manifestations certainly are similar to those of something wearing out, no doubt about it. Lets not confuse a flaw with the manifestation. The software was always "broken" its failure was just not noticed before!!!!!!
> > smilemac > > >
|
|
 | | From: | leslie | | Subject: | Re: MTBF for Windows and Linux | | Date: | Sat, 27 Nov 2004 21:43:37 GMT |
|
|
 | Ian Mckay (imm102@hotmail.com) wrote: : hi, : : Im doing a project on the suitability (or lack there-of) of Windows : and Linux for use in critical systems. Try as i might i can't find any : articles discussing this topic (Only heated discussions in the odd : forum). I am particularly interested in finding figures for the mean : time between failure for these two OS's as well as common faults that : are well known. If anyone can point me in the right direction that : would be very helpful. :
Here's an article against using Windows for mission-critical applications such as pipeline control:
http://www.seattleweekly.com/features/0002/tech-scigliano.shtml Seattle Weekly - tech: A worm in the works
"Could disasters loom as more and more pipeline operators switch to Windows NT?
[snip]
But not one that's likely to pan out. That's because, according to Olympic IT manager Dan Swathman, the pipeline's SCADA system does not run on Windows, which might have made it vulnerable to the e-mail-borne Worm.Explore.zip: "Our current SCADA system is on VMS, from Digital Equipment [now Compaq], running on an Alpha Chip. GMI makes the system. A couple computers are dedicated to it." And, Swathman adds, these computers are not connected to the Windows-based e-mail and office systems through which the worm could have gotten in.
That's reassuring--but the broader picture may not be. Across this country, and as far away as China, pipeline systems are switching from VMS and Unix systems to versatile, ubiquitous, user-friendly Windows NT. "The scada market has been moving towards Windows NT as the dominant operating system," Oil & Gas Journal reported (3/24/97)..."
The systems in operation at the time of the explosion were still VAXes, not Alphas:
http://www.ntsb.gov/publictn/2002/PAR0202.htm NTSB Abstract PAR-02/02
http://www.ntsb.gov/publictn/2002/PAR0202.pdf
"The Olympic Pipeline SCADA system consisted of Teledyne Brown Engineering 20 SCADA Vector software, version 3.6.1., running on two Digital Equipment Corporation (DEC) VAX Model 4000-300 computers with VMS operating system Version 7.1..."
The VAXes were replaced by dual-CPU Alphas in 2001. I was a member of the project team that performed the upgrade.
The U.S. Navy's Smart Ship program uses Windows-based systems:
http://www.gcn.com/vol19_no27/dod/2868-1.html Navy carrier to run Win 2000
http://www.cnn.com/2000/TECH/computing/08/08/carrier.windows.idg/ CNN.com - Technology - Futuristic Windows version to control aircraft carrier - August 8, 2000
"...The CVN-77 win is a key triumph for Microsoft in the defense industry, because it sets the stage for the company's participation in the Navy's long-term, three-phase future carrier design program. "This is not just the one ship. It will decide the architectures for the next three ships," Roach said. Microsoft's agreement also includes a back-fit program for seven other carriers, bringing the total to 10."
At least some PLCs are also used, per this description of the "Smart Ship" system...
http://www.e-d-i.com/products_control.html L-3 Communications SPD Technologies - Control Systems
VMS systems support multi-site cluster which can tolerate the loss of an entire datacenter:
http://h71000.www7.hp.com/openvms/brochures/commerzbank/ hp Alphaserver technology helps Commerzbank tolerate disaster on September 11
"testing disaster tolerance
While most large organizations today have plans for Disaster Tolerance (DT), few have to put them to the test. The North American headquarters of Commerzbank, located less than 100 yards from the World Trade Center in New York City, put its DT plan into action on September 11, 2001. Because Commerzbank relies on OpenVMS wide-area clustering, volume shadowing and AlphaServer GS160 systems from HP, the bank was able to function on September 11 because its critical banking applications continued to run at the primary site and were available from the bank's remote site..."
http://www.enterpriseitplanet.com/storage/features/article.php/3396941 OpenVMS Gets a Case of the DT's
OpenVMS Gets a Case of the DT's August 18, 2004 By Drew Robb
It's not uncommon for alcoholics to suffer from the DT's (Delirium Tremens -- severe alcohol withdrawal characterized by agitation, violence, anxiety, insomnia, muscle cramps, tremor, delusion, hallucinations, and fever). But whoever heard of an operating system (OS) suffering from the malady? Well, the OpenVMS OS apparently has an acute case of the DT's. Though in this instance, we are talking about disaster recovery.
Disaster Tolerance (DT) is a concept that extends beyond disaster recovery (DR). Traditional DR focuses on minimizing downtime then picking up the pieces and reconstructing any lost data afterwards..."
http://makeashorterlink.com/?V48121DA9 OpenVMS Survives and Thrives - Computerworld
There have been some recent articles on VMS:
The original URL, wrapped to 2 lines:
http://www.computerworld.com/softwaretopics/software/story/ 0,10801,97032,00.html OpenVMS Survives and Thrives - Computerworld
OpenVMS Survives and Thrives
The 'legacy' operating system maintains a substantial base in large organizations, and there's promise of new interest as it moves to 64-bit Itanium.
[snip]
According to David Freund, an analyst at IT research firm Illuminata Inc. in Nashua, N.H., several financial services businesses in the towers and numerous others in the immediate vicinity had OpenVMS disaster-tolerant clusters with backup sites outside the area. Every one of them had their operations running just moments after the catastrophe, says Freund.
Following that awful day, OpenVMS seems to have gained new prominence. In some IT circles, it's now regarded as the creme de la creme in disaster recovery and high availability, according to users and analysts.
"OpenVMS uptimes can be measured in years," says Stenz. "This is certainly preferable to a culture of rebooting and disruption that plagues other platforms due to viruses, Trojans, denial-of-service attacks and endless patching of systems..."
HTH,
--Jerry Leslie Note: leslie@jrlvax.houston.rr.com is invalid for email "VMS: Uptime measured with a calendar instead of a stopwatch"
|
|
 | | From: | Paul E. Bennett | | Subject: | Re: MTBF for Windows and Linux | | Date: | Sun, 28 Nov 2004 19:55:25 +0000 |
|
|
 | leslie wrote:
> Here's an article against using Windows for mission-critical applications > such as pipeline control: > > http://www.seattleweekly.com/features/0002/tech-scigliano.shtml > Seattle Weekly - tech: A worm in the works > > "Could disasters loom as more and more pipeline operators switch > to Windows NT?
Anyone considering MS Windows based products for any mission critical appllcation should take a look at the licence conditions for starters. If you do use it for such applications you are standing alone and may be seen as liable for the consequences of any unfavourable outcome. All MS terms and conditions actuially forbid use of their products in mission critical applications without consulktation with MS first.
> The systems in operation at the time of the explosion were still VAXes, > not Alphas:
I suppose those who are responsibvle for critical systems on other OS's might like to comment on what the situation is with their set-up. As I only use such systems at the HMI level, not the "Safety Critical" level, there is little risk involved even if the whole HMI failed.
As I indicated in my last posting on this topic, utilising a multi-layer approach to critical systems will ensure that the critical parts are concentrated in sub-systems that you can prove will operate safely and correctly under all non-optimal situations.
-- ******************************************************************** Paul E. Bennett .................... Forth based HIDECS Consultancy ..... Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE...... Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details. Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. ********************************************************************
|
|
 | | From: | Nick Landsberg | | Subject: | Re: MTBF for Windows and Linux | | Date: | Thu, 25 Nov 2004 20:54:56 GMT |
|
|
 | Ian Mckay wrote:
> hi, > > Im doing a project on the suitability (or lack there-of) of Windows > and Linux for use in critical systems. Try as i might i can't find any > articles discussing this topic (Only heated discussions in the odd > forum). I am particularly interested in finding figures for the mean > time between failure for these two OS's as well as common faults that > are well known. If anyone can point me in the right direction that > would be very helpful.
Google for either "Carrier Grade Linux" or "High Availability Linux" or "real time Linux".
I doubt that anyone has published MTBF or MTTR, however. The Carrier Grade Linux effort is too new to have many results to report, and do you really think MS would publish data of that kind about Windows?
If you do get any data, just be skeptical and look at it with a jaundiced eye. Are 12 one minute outages a year better or worse than one 12 minute outage a year?
The answer is "it probably depends on the application."
NPL
-- "It is impossible to make anything foolproof because fools are so ingenious" - A. Bloch
|
|
 | | From: | smilemac | | Subject: | Re: MTBF for Windows and Linux | | Date: | Sun, 28 Nov 2004 00:50:00 +0800 |
|
|
 | "Nick Landsberg" ??????:AErpd.974448$Gx4.550056@bgtnsc04-news.ops.worldnet.att.net... > Ian Mckay wrote: > > > hi, > > > > Im doing a project on the suitability (or lack there-of) of Windows > > and Linux for use in critical systems. Try as i might i can't find any > > articles discussing this topic (Only heated discussions in the odd > > forum). I am particularly interested in finding figures for the mean > > time between failure for these two OS's as well as common faults that > > are well known. If anyone can point me in the right direction that > > would be very helpful. > > Google for either "Carrier Grade Linux" or "High Availability > Linux" or "real time Linux". > > I doubt that anyone has published MTBF or MTTR, however. > The Carrier Grade Linux effort is too new to have many > results to report, and do you really > think MS would publish data of that kind about Windows? > > If you do get any data, just be skeptical and look > at it with a jaundiced eye. Are 12 one minute outages > a year better or worse than one 12 minute outage a year? > > The answer is "it probably depends on the application." > > NPL > > -- > "It is impossible to make anything foolproof > because fools are so ingenious" > - A. Bloch
There's no any meanings to find out the figure difference of MTBF between Linux and M$. No one standalone OS is reliable enough for a "critical system". Reliabilty must be built on the hot standby style architecture.
smilemac
|
|
 | | From: | Nick Landsberg | | Subject: | Re: MTBF for Windows and Linux | | Date: | Sat, 27 Nov 2004 20:19:09 GMT |
|
|
 | smilemac wrote: > "Nick Landsberg" > ??????:AErpd.974448$Gx4.550056@bgtnsc04-news.ops.worldnet.att.net... > >>Ian Mckay wrote: >> >> >>>hi, >>> >>>Im doing a project on the suitability (or lack there-of) of Windows >>>and Linux for use in critical systems. Try as i might i can't find any >>>articles discussing this topic (Only heated discussions in the odd >>>forum). I am particularly interested in finding figures for the mean >>>time between failure for these two OS's as well as common faults that >>>are well known. If anyone can point me in the right direction that >>>would be very helpful. >> >>Google for either "Carrier Grade Linux" or "High Availability >>Linux" or "real time Linux". >> >>I doubt that anyone has published MTBF or MTTR, however. >>The Carrier Grade Linux effort is too new to have many >>results to report, and do you really >>think MS would publish data of that kind about Windows? >> >>If you do get any data, just be skeptical and look >>at it with a jaundiced eye. Are 12 one minute outages >>a year better or worse than one 12 minute outage a year? >> >>The answer is "it probably depends on the application." >> >>NPL >> >>-- >>"It is impossible to make anything foolproof >>because fools are so ingenious" >> - A. Bloch > > > There's no any meanings to find out the figure difference of MTBF between > Linux and M$. No one standalone OS is reliable enough for a "critical > system". Reliabilty must be built on the hot standby style architecture. > > smilemac > >
Agreed.
Considering that commercially available server hardware is usually on the order of 3-9's reliability (about 8-9 hours a year down time), you _already_ need hot standby to get anywhere near 5-9's. Assuming of course, that you have software assist to detect the failure and switch to the hot standby. I, personally, prefer a "dual-hot" configuration where each side is taking half the load under normal conditions but can pick up the whole load if its "mate" fails. There are several reasons which I won't get into here.
In addition, studies in telephone systems some years ago indicated that only about 20% of system failures were due to hardware. There were 40% due to software and another 40% due to pilot error. How this relates to this day and age is debatable since telephone systems were built using home-grown hardened hardware back in those days.
One key parameter here is the restoration time. While the first system is down, you are "at risk" of the backup system taking a fault (HW, SW, other). The componenents of restoration time include: - Fault detection (how long before someone realizes that it's down?) - HW replacement (if it was a hardware fault). Add travel time if you need to go get it out of some warehouse somewhere or have it shipped. - OS reboot - Application startup. (The OS being up and running is not enough. If this system is dedicated to a certain set of applications, it isn't fully functional until all the applications are up and running.)
The OS can only try to address its own failure rates and try to address the reboot time.
Still and all, I would prefer an OS with an MTBF of 6 months to one with an MTBF of 1 month. (Not naming any names.)
NPL
-- "It is impossible to make anything foolproof because fools are so ingenious" - A. Bloch
|
|
 | | From: | Joseph Gwinn | | Subject: | Re: MTBF for Windows and Linux | | Date: | Sat, 27 Nov 2004 23:07:37 GMT |
|
|
 | In article <1j5qd.65611$7i4.28989@bgtnsc05-news.ops.worldnet.att.net>, Nick Landsberg wrote:
> smilemac wrote: > > "Nick Landsberg" > > ??????:AErpd.974448$Gx4.550056@bgtnsc04-news.ops.worldnet.att.net... > > > >>Ian Mckay wrote: > >> > >> > >>>hi, > >>> > >>>Im doing a project on the suitability (or lack there-of) of Windows > >>>and Linux for use in critical systems. Try as i might i can't find any > >>>articles discussing this topic (Only heated discussions in the odd > >>>forum). I am particularly interested in finding figures for the mean > >>>time between failure for these two OS's as well as common faults that > >>>are well known. If anyone can point me in the right direction that > >>>would be very helpful. > >> > >>Google for either "Carrier Grade Linux" or "High Availability > >>Linux" or "real time Linux". > >> > >>I doubt that anyone has published MTBF or MTTR, however. > >>The Carrier Grade Linux effort is too new to have many > >>results to report, and do you really > >>think MS would publish data of that kind about Windows? > >> > >>If you do get any data, just be skeptical and look > >>at it with a jaundiced eye. Are 12 one minute outages > >>a year better or worse than one 12 minute outage a year? > >> > >>The answer is "it probably depends on the application." > >> > >>NPL > >> > >>-- > >>"It is impossible to make anything foolproof > >>because fools are so ingenious" > >> - A. Bloch > > > > > > There's no any meanings to find out the figure difference of MTBF between > > Linux and M$. No one standalone OS is reliable enough for a "critical > > system". Reliabilty must be built on the hot standby style architecture. > > > > smilemac > > > > > > Agreed. > > [snip] > > In addition, studies in telephone systems some years > ago indicated that only about 20% of system failures > were due to hardware. There were 40% due to software > and another 40% due to pilot error. How this relates > to this day and age is debatable since telephone > systems were built using home-grown hardened hardware > back in those days.
I've heard of these studies too. Do you (or anybody else) have the references? I'd like to get copies.
> [snip] > > Still and all, I would prefer an OS with an > MTBF of 6 months to one with an MTBF of 1 month. > (Not naming any names.)
Again, I've always used one month, as this seems to be the goal for desktop operating systems. The usual form of the question I ask is the following: "Is it OK if crashes once a month, requiring a hard reset?" If the answer is yes, then it's plausible to use a desktop OS for the purpose. If the answer is no, then one must use a real RTOS, perhaps even one suited for full safety certification, depending on the nature of .
By the way, do you know the reference for the one month and six months figures?
Joe Gwinn
|
|
 | | From: | Nick Landsberg | | Subject: | Re: MTBF for Windows and Linux | | Date: | Sun, 28 Nov 2004 02:25:45 GMT |
|
|
 | Joseph Gwinn wrote:
> In article <1j5qd.65611$7i4.28989@bgtnsc05-news.ops.worldnet.att.net>, > Nick Landsberg wrote: > > >>smilemac wrote: >> >>>"Nick Landsberg" >>>??????:AErpd.974448$Gx4.550056@bgtnsc04-news.ops.worldnet.att.net... >>> >>> >>>>Ian Mckay wrote: >>>> >>>> >>>> >>>>>hi, >>>>> >>>>>Im doing a project on the suitability (or lack there-of) of Windows >>>>>and Linux for use in critical systems. Try as i might i can't find any >>>>>articles discussing this topic (Only heated discussions in the odd >>>>>forum). I am particularly interested in finding figures for the mean >>>>>time between failure for these two OS's as well as common faults that >>>>>are well known. If anyone can point me in the right direction that >>>>>would be very helpful. >>>> >>>>Google for either "Carrier Grade Linux" or "High Availability >>>>Linux" or "real time Linux". >>>> >>>>I doubt that anyone has published MTBF or MTTR, however. >>>>The Carrier Grade Linux effort is too new to have many >>>>results to report, and do you really >>>>think MS would publish data of that kind about Windows? >>>> >>>>If you do get any data, just be skeptical and look >>>>at it with a jaundiced eye. Are 12 one minute outages >>>>a year better or worse than one 12 minute outage a year? >>>> >>>>The answer is "it probably depends on the application." >>>> >>>>NPL >>>> >>>>-- >>>>"It is impossible to make anything foolproof >>>>because fools are so ingenious" >>>> - A. Bloch >>> >>> >>>There's no any meanings to find out the figure difference of MTBF between >>>Linux and M$. No one standalone OS is reliable enough for a "critical >>>system". Reliabilty must be built on the hot standby style architecture. >>> >>>smilemac >>> >>> >> >>Agreed. >> >>[snip] >> >>In addition, studies in telephone systems some years >>ago indicated that only about 20% of system failures >>were due to hardware. There were 40% due to software >>and another 40% due to pilot error. How this relates >>to this day and age is debatable since telephone >>systems were built using home-grown hardened hardware >>back in those days. > > > I've heard of these studies too. Do you (or anybody else) have the > references? I'd like to get copies.
The studies I was referring to were done by AT&T for the 4ESS and 5ESS switches. As far as I know, they were not published outside of AT&T. (And I do not have access to them any more.)
> > > >>[snip] >> >>Still and all, I would prefer an OS with an >>MTBF of 6 months to one with an MTBF of 1 month. >>(Not naming any names.) > > > Again, I've always used one month, as this seems to be the goal for > desktop operating systems. The usual form of the question I ask is the > following: "Is it OK if crashes once a month, requiring a > hard reset?" If the answer is yes, then it's plausible to use a desktop > OS for the purpose. If the answer is no, then one must use a real RTOS, > perhaps even one suited for full safety certification, depending on the > nature of . > > By the way, do you know the reference for the one month and six months > figures?
I was just throwing out what seemed like reasonable figures for the classes of systems which you mentioned (desktop vs. "other"). Consider them hearsay until proven otherwise.
I do not work in the "hard real-time world" but have had chances to review some of their work (and made good use of the learnings they had to share). I work in a "soft real-time world." i.e. rather than 5-10 second switchover, I get the luxury of 5-10 minutes.
For this we use Solaris and we have learned enough about it's idiosyncracies that we've achieved 5-9's in a "all active" (either mated pair or N+K) configuration (if we discount planned down time for software upgrades).
The SUN literature claims a HW reliability of 3-9's for their "hardened" servers (NEBS certified). We have not verified this in enough field tests to be meaningful.
Since 3-9's is about 8 hours a year, I assumed that this was 2 hardware outages of 4 hours restoration time each, but I could be wrong.
(Note: This is not meant as a testimonial for SUN. It's just what our experience has been with them. We have other groups within the company which swear by HP, and still others which insist on using VXworks (sp?) for any real-time stuff. YMMV.)
NPL
> > Joe Gwinn
-- "It is impossible to make anything foolproof because fools are so ingenious" - A. Bloch
|
|
 | | From: | Joseph Gwinn | | Subject: | Re: MTBF for Windows and Linux | | Date: | Sun, 28 Nov 2004 18:38:08 GMT |
|
|
 | In article , Nick Landsberg wrote:
> Joseph Gwinn wrote: > > > In article <1j5qd.65611$7i4.28989@bgtnsc05-news.ops.worldnet.att.net>, > > Nick Landsberg wrote: > > > > > >>smilemac wrote: > >> > >>>"Nick Landsberg" > >>>??????:AErpd.974448$Gx4.550056@bgtnsc04-news.ops.worldnet.att.net... > >>> > >>>[snip] > >>[snip] > >> > >>In addition, studies in telephone systems some years > >>ago indicated that only about 20% of system failures > >>were due to hardware. There were 40% due to software > >>and another 40% due to pilot error. How this relates > >>to this day and age is debatable since telephone > >>systems were built using home-grown hardened hardware > >>back in those days. > > > > > > I've heard of these studies too. Do you (or anybody else) have the > > references? I'd like to get copies. > > The studies I was referring to were done by AT&T for the > 4ESS and 5ESS switches. As far as I know, they were not published > outside of AT&T. (And I do not have access to them any > more.)
Ahh. I vaguely recall some articles in the Bell System Technical Journal on just this subject. That's where it would be documented publicly, if anywhere.
> >>[snip] > >> > >>Still and all, I would prefer an OS with an > >>MTBF of 6 months to one with an MTBF of 1 month. > >>(Not naming any names.) > > > > > > Again, I've always used one month, as this seems to be the goal for > > desktop operating systems. The usual form of the question I ask is the > > following: "Is it OK if crashes once a month, requiring a > > hard reset?" If the answer is yes, then it's plausible to use a desktop > > OS for the purpose. If the answer is no, then one must use a real RTOS, > > perhaps even one suited for full safety certification, depending on the > > nature of . > > > > By the way, do you know the reference for the one month and six months > > figures? > > I was just throwing out what seemed like reasonable figures > for the classes of systems which you mentioned (desktop vs. > "other"). Consider them hearsay until proven otherwise.
My one-month estimate comes from reading lots of articles in the computer press where this seemed to be the shining ideal. I have also seen reports saying that this or that shop had made measurements, and got just those kinds of measurements, but was never able to get the details.
> I do not work in the "hard real-time world" but have > had chances to review some of their work (and made > good use of the learnings they had to share). I work in > a "soft real-time world." i.e. rather than 5-10 second > switchover, I get the luxury of 5-10 minutes.
Who is "their"?
The degree of realtime performance and the switchover time are unrelated. In other words, it does not follow that the more stringent the realtime the faster the switchover.
One example would be the firmware in an ordinary disk drive. This firmware is stringently realtime, being used to implement servo control of disk and head motion. However, if the disk is used in a office worker's desktop system, nobody will die if the disk breaks and has to be replaced, even if this takes a day or two.
Despite the example, my experience has been that Windows crashes are by far the more common cause of lost files than hardware failure, but still...
Hope the office worker believes in making backups, though. Sooner or later, he's going to need it.
> For this we use Solaris and we have learned enough about > it's idiosyncracies that we've achieved 5-9's in a > "all active" (either mated pair or N+K) configuration (if we > discount planned down time for software upgrades). > > The SUN literature claims a HW reliability of 3-9's > for their "hardened" servers (NEBS certified). We have not verified > this in enough field tests to be meaningful. > > Since 3-9's is about 8 hours a year, I assumed that this > was 2 hardware outages of 4 hours restoration time > each, but I could be wrong. > > (Note: This is not meant as a testimonial for SUN. It's > just what our experience has been with them. We have > other groups within the company which swear by HP, and still > others which insist on using VxWorks for any real-time > stuff. YMMV.)
Yes. I've used or seen used Sun, SGI, HP, Stratus, Tandem, DEC VAX, Data General, various makes and models of single-board computers, et al, in the dual-redundant or triple-redundant (where a single string suffices) topology. All worked, and the redundancy converted hardware reliability into a maintenance cost issue only, rather than a system availability issue.
The one triple-redundant system was to allow any two strings to be online while the third string was in pieces on the floor, being repaired.
I've also used VxWorks in many systems, including a quad-redundant naval system that had to survive massive battle damage (three cruise-missile strikes).
The bottom line is that the system availability was achieved by use of redundant hardware with application code designed to make full use of the replicated hardware (not all of which is computers); the availability of the overall system vastly exceeded that of the hardware, the operating systems, and the application software.
That said, this only works if the operating system is both realtime enough for the intended application, and reliable enough that the overall system isn't killed by double faults. Use of redundant computers helps greatly, but does have its limits, especially in the face of common-mode faults.
Then there is the issue of surviving software development and system integration.
Joe Gwinn
|
|
 | | From: | Andrew Gabb | | Subject: | Re: MTBF for Windows and Linux | | Date: | Sat, 27 Nov 2004 22:54:48 +1030 |
|
|
 | Ian Mckay wrote: > Im doing a project on the suitability (or lack there-of) of Windows > and Linux for use in critical systems. Try as i might i can't find any > articles discussing this topic (Only heated discussions in the odd > forum). I am particularly interested in finding figures for the mean > time between failure for these two OS's as well as common faults that > are well known. If anyone can point me in the right direction that > would be very helpful.
I know of sites where the MTBF of Windows (and the apps used) is a month or more. True!
Of course, they don't do much and it's almost always the same thing. Hmmm, sounds a bit like Linux.
Seriously though, the reliability of OSs depends on so many factors, you'd have trouble categorising them, let alone coming up with MTBF or other reliability/availability figures that anyone in the know would actually believe.
The only ones I've seen have always come from sources who have a serious conflict of interest anyway. Like when I was called in by a fairly typical small business (with about 7 staff/PCs), where someone was pissing in their ear about Linux, complete with a comparison article. It would have been a disaster for them, but it would have made the (potential) service provider quite wealthy. Go figure.
FWIW I also know that for some orgs, Linux *can* be a cost effective part of their system.
Andrew -- Andrew Gabb email: agabb@tpgi.com.au Adelaide, South Australia phone: +61 8 8342-1021, fax: +61 8 8269-3280 -----
|
|
 | | From: | Paul E. Bennett | | Subject: | Re: MTBF for Windows and Linux | | Date: | Fri, 26 Nov 2004 19:46:18 +0000 |
|
|
 | Ian Mckay wrote:
> hi, > > Im doing a project on the suitability (or lack there-of) of Windows > and Linux for use in critical systems. Try as i might i can't find any > articles discussing this topic (Only heated discussions in the odd > forum). I am particularly interested in finding figures for the mean > time between failure for these two OS's as well as common faults that > are well known. If anyone can point me in the right direction that > would be very helpful.
Well, you have already had some interesting responses which have all pointed out that such figures will be difficult to find. Individual SysAdmins may have records enough but then you will have to query them regarding versions. Remeber, any change made to the software changes the system and in such OS's may not be readily ammenable to re-validation.
If you really are interested in bringing up a "critical system" application and need to depend on the OS then you should obtain a full source listing of the OS and go through it with a fine toothed comb to understand what it is doing and ensure that there are no hidden gotcha's.
Your system architecture is going to be one of the most important aspects of the design decisions to be made. Think of it as a number of layers from sensors and actuators at the lowest level, through group controllers and up to the HMI network. The sensor/actuator level is where you need the strongest safety interlocking. The group controllers can issue those elements below it with a permit to operate and be considered as an intermediate safety interlocking. The HMI layer can then use any convenient HMI software so long as you can provide reasonable security of operator identification.
The lowest systems in the layer will probably not need an OS at all. The intermediate layer (group controllers) may be simple PLC type system with networking interfaces. The HMI any Unix/Linux/Windows networking terminal and/or server system. At the upper layer criticality is eliminated as an issue.
-- ******************************************************************** Paul E. Bennett .................... Forth based HIDECS Consultancy ..... Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE...... Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details. Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. ********************************************************************
|
|
 | | From: | Bernhard Holzmayer | | Subject: | Re: MTBF for Windows and Linux | | Date: | Mon, 29 Nov 2004 11:30:22 +0100 |
|
|
 | Ian Mckay wrote:
> hi, > > Im doing a project on the suitability (or lack there-of) of Windows > and Linux for use in critical systems. Try as i might i can't find any > articles discussing this topic (Only heated discussions in the odd > forum). I am particularly interested in finding figures for the mean > time between failure for these two OS's as well as common faults that > are well known. If anyone can point me in the right direction that > would be very helpful.
Hi Ian,
if one of these OSs shall be used in a critical system, there are certainly a lot of restrictions which apply and which influence the general situation so much that you cannot take figures from the mainstream.
We use a WinNT/Win2000 based system in a production line. It's critical in the sense that production must be stopped if the system fails (which is expensive for the customer, and therfore undesired). It runs in combination of a VxWorks machine which does the real-time critical parts, thus Windows is critical because it must run (and do some work), but it's only (very) soft realtime requirements which Windows has to fulfil.
That's what you should do: define your system as narrow as possible, and restrict it to the minimum scenario. It's the only chance to get comparable figures.
Then find out, which MTBF would be acceptable in your system. And then, design your application software including OS, so that it is able to achieve a significantly better rate.
To my opinion, you cannot do better.
MTBF is only a statistical measure. It doesn't say anything about the real time-to-next-failure. Even if Linux had a ten times better MTBF, who guarantees that it doesn't fail first?
Therefore, instead of trying to find a MTBF or other rate, define scenarios which could make your system fail. Then calculate the cost and/or possibilty to repair the system.
If it means that the reset button needs to be pressed, and a day's work is spoiled, the loss may be not too severe. If a satellite is traveling through space and fails on its way to Saturn, that might cost millions and years of work.
I wouldn't rely on the better MTBF in that case...
Probably you have to convince your partners/chiefs/... that it's not MTBF which you need. It's certainly a relation between cost and return during planning/development/maintenance, and then there is the acceptance of the customer/user who must buy your system.
When our system was designed, it was obvious that Windows was not the best-performing system under discussion, not even the second... But it was obvious too, that the customer's wouldn't accept another system (at that time).
Bernhard
|
|
|