newsgroups-index (beta)

Current group: comp.arch

Firewire for remote DMA?

Firewire for remote DMA?  
Paul Rubin
 Re: Firewire for remote DMA? Use InifiniBand  
Del Cecchi
 Re: Firewire for remote DMA?  
Pete Zaitcev (OTID3)
 Re: Firewire for remote DMA?  
Paul Rubin
 Re: Firewire for remote DMA?  
Erno Kuusela
 Re: Firewire for remote DMA?  
Greg Lindahl
 Re: Firewire for remote DMA?  
Pete Zaitcev (OTID3)
 Re: Firewire for remote DMA?  
Paul Rubin
 Re: Firewire for remote DMA?  
Niels_Jørgen_Kruse
 Re: Firewire for remote DMA?  
Del Cecchi
 Re: Firewire for remote DMA?  
Paul Rubin
 Re: Firewire for remote DMA?  
Jason Ozolins
 Re: Firewire for remote DMA?  
Del Cecchi
 Re: Firewire for remote DMA?  
Phillip Fayers
 Re: Firewire for remote DMA?  
Patrick Geoffray
 Re: Firewire for remote DMA?  
Stephen Fuld
 Re: Firewire for remote DMA?  
Patrick Geoffray
 Re: Firewire for remote DMA?  
Terje Mathisen
From:Paul Rubin
Subject:Firewire for remote DMA?
Date:09 Jan 2005 03:21:26 -0800
Hi, I don't know if this is the right newsgroup for this kind of
question, but I'm wondering what people think of the idea of using
Firewire-800 as a pseudo-RDMA mechanism when most of the traffic is
small packets (say 64 bytes). The idea is it's a lot cheaper than
Myrinet or Infiniband, while (this is what I'm asking) hopefully
avoiding the protocol overhead of Ethernet. With the most common
Firewire interfaces, you can initiatate a transfer write directly into
the target computer's memory at the address of your choice (yes,
that's a huge security hole if your cluster has untrusted boxes). I
don't know if you can also read from the target's memory at your
choice of address, but most Firewire cards support multiple buses, so
you could use one bus for reading and one for writing.

Any thoughts about this? I haven't studied Firewire in detail; the
above is just impressions that I've gotten.
From:Del Cecchi
Subject:Re: Firewire for remote DMA? Use InifiniBand
Date:Tue, 11 Jan 2005 15:48:31 -0600
Paul Rubin wrote:
> Hi, I don't know if this is the right newsgroup for this kind of
> question, but I'm wondering what people think of the idea of using
> Firewire-800 as a pseudo-RDMA mechanism when most of the traffic is
> small packets (say 64 bytes). The idea is it's a lot cheaper than
> Myrinet or Infiniband, while (this is what I'm asking) hopefully
> avoiding the protocol overhead of Ethernet. With the most common
> Firewire interfaces, you can initiatate a transfer write directly into
> the target computer's memory at the address of your choice (yes,
> that's a huge security hole if your cluster has untrusted boxes). I
> don't know if you can also read from the target's memory at your
> choice of address, but most Firewire cards support multiple buses, so
> you could use one bus for reading and one for writing.
>
> Any thoughts about this? I haven't studied Firewire in detail; the
> above is just impressions that I've gotten.

I had a nice talk with someone from one of the IB vendors. Apparently
the "oem pricing" for an end node is around 100 USD for the 4x (10 Gb)
infiniBand HCA if I understood correctly.

A switch is required for more than 2 nodes, and that costs a few
hundred, like 300 usd, per port and comes in a box. So that costs a few
thousand per minimum increment depending on what vintage hardware. Old
stuff came in chunks of 8 and new stuff comes in 24 port increments.
The switch includes subnet management and all that stuff.

Maybe in a while the older switches will start showing up used on ebay.

del cecchi
From:Pete Zaitcev (OTID3)
Subject:Re: Firewire for remote DMA?
Date:Sun, 09 Jan 2005 12:54:46 -0800
On Sun, 09 Jan 2005 03:21:26 -0800, Paul Rubin wrote:

> Hi, I don't know if this is the right newsgroup for this kind of
> question, but I'm wondering what people think of the idea of using
> Firewire-800 as a pseudo-RDMA mechanism when most of the traffic is
> small packets (say 64 bytes). The idea is it's a lot cheaper than
> Myrinet or Infiniband, while (this is what I'm asking) hopefully
> avoiding the protocol overhead of Ethernet. [...]

I can see two problems with this idea.

First, the so-called protocol overhead of Ethernet is largely a myth.
If your fabric does not allow short packets to overtake or split large
packets, your latency is going to be 100 times larger than the theoretical
numbers. Also, the way the software works has a large influence. This
perception is undoubtedly colored by my experience, which was with database
and storage clusters. But I tend to be skeptical of fancy interconnects
in general. I do not remember what the MTU of FireWire is and if it
allows packets to be split.

Second, there's no such thing as a big-ass FireWire switch, unlike, say
288-port Infiniband switches. This makes your cluster laughably small.

My friend Wim's group at Oracle has their DB running on top of FireWire.
It works well for 2 and 4 node clusters. But there's not enough advantage
to make it a shipping product, even though small scale clusters for
databases make a whole lot more sense than small scale HPC clusters,
for the failover purposes. If FireWire were ubiquitous on server motherboards,
then such a product would be more interesting.

-- Pete
From:Paul Rubin
Subject:Re: Firewire for remote DMA?
Date:09 Jan 2005 20:21:40 -0800
"Pete Zaitcev (OTID3)" writes:
> > question, but I'm wondering what people think of the idea of using
> > Firewire-800 as a pseudo-RDMA mechanism when most of the traffic is
> I can see two problems with this idea.>
> First, the so-called protocol overhead of Ethernet is largely a myth.
> If your fabric does not allow short packets to overtake or split large
> packets, your latency is going to be 100 times larger than the theoretical
> numbers. Also, the way the software works has a large influence. This
> perception is undoubtedly colored by my experience, which was with database
> and storage clusters. But I tend to be skeptical of fancy interconnects
> in general. I do not remember what the MTU of FireWire is and if it
> allows packets to be split.

Thanks, this is the kind of answer I was looking for. I think it may
be reasonable to set up the software so all packets are small. Larger
transfers could use an additional channel, or 1G ethernet. The
application is closer to a database cluster than a numerical
supercomputer. I don't envision huge cluster sizes, but yeah, 2-4
nodes is a bit limiting.

I had thought most ethernet RDMA implementations went through the TCP
stack, which has messages in both directions, multiple layers of
protocol, and two round trips through the kernel. I saw some figures
indicating about 20k null RPC's per second across 1G ethernet, or 50
microseconds per RPC. I'd hoped for something much faster than that.

> If FireWire were ubiquitous on server motherboards, then such a
> product would be more interesting.

It's not ubiquitous, but it's fairly common, and PCI cards are cheap.
From:Erno Kuusela
Subject:Re: Firewire for remote DMA?
Date:10 Jan 2005 15:04:33 +0200
Paul Rubin writes:

> I saw some figures
> indicating about 20k null RPC's per second across 1G ethernet, or 50
> microseconds per RPC. I'd hoped for something much faster than that.

superficial googling yields one to think firewire has an intrinsic
latency in the hundres of microseconds...

-- erno
From:Greg Lindahl
Subject:Re: Firewire for remote DMA?
Date:10 Jan 2005 10:19:00 -0800
In article ,
Pete Zaitcev (OTID3) wrote:

>First, the so-called protocol overhead of Ethernet is largely a myth.
>If your fabric does not allow short packets to overtake or split large
>packets, your latency is going to be 100 times larger than the theoretical
>numbers.

Sorry, are these 2 sentences supposed to be related? Overhead and
latency are different things. The high overhead in Ethernet is the
host cpu time spent processing packets. Latency is, well, latency.

The end of the second sentence is plain wrong. Firewire 800 runs at
800 megabits. This means a 1024 byte packet will delay a short packet
by ~ 10 microseconds. Now that's a latency. If you think about the
latency plus overhead of a packet, you'll find that this sum doesn't
go up by 100 times for TCP/IP, because the so-called protocol overhead
of TCP/IP is so large. [Of course, you can run cut-down protocols on
Ethernet, but I think you had TCP/IP in mind.]

> Also, the way the software works has a large influence. This
> perception is undoubtedly colored by my experience, which was with database
> and storage clusters.

It's worth mentioning that most database/storage clusters have a mix
of short packets (control and lock messages) and long packets (disk
blocks), so they exhibit this behavior all the time. HPC clusters
don't necessarily have this issue. Also note that you could have 2
high speed networks for your database cluster, one for short and one
for long packets. The price of the database software is so high that
this additional hardware cost tends to be in the noise.

-- greg
From:Pete Zaitcev (OTID3)
Subject:Re: Firewire for remote DMA?
Date:Mon, 10 Jan 2005 14:47:45 -0800
On Mon, 10 Jan 2005 10:19:00 -0800, Greg Lindahl wrote:

> [Of course, you can run cut-down protocols on
> Ethernet, but I think you had TCP/IP in mind.]

I was remembering attempts to run so-called "ST/SST" which I thought at the
time being competitive with iSCSI (good thing I didn't bet my career
on that guess).

You're right about the correct meaning and usage of the word "overhead".
Somehow I assumed the original poster meant "unnecessary latency introduced
by all the extra processing" rather than "usecs of CPU times per I/O",
which may be more common. I should've not done those assumptions.

--- Pete
From:Paul Rubin
Subject:Re: Firewire for remote DMA?
Date:10 Jan 2005 16:08:40 -0800
"Pete Zaitcev (OTID3)" writes:
> You're right about the correct meaning and usage of the word "overhead".
> Somehow I assumed the original poster meant "unnecessary latency introduced
> by all the extra processing" rather than "usecs of CPU times per I/O",
> which may be more common. I should've not done those assumptions.

I did mean extra latency. CPU time matters as well, but I was
thinking the extra latency came from the context switches through the
kernel TCP stack, and also the round trip packet delays because
of the ACK packets etc.
From:Niels_Jørgen_Kruse
Subject:Re: Firewire for remote DMA?
Date:Tue, 11 Jan 2005 00:52:16 +0100
Pete Zaitcev (OTID3) wrote:

> On Sun, 09 Jan 2005 03:21:26 -0800, Paul Rubin wrote:
>
> > Hi, I don't know if this is the right newsgroup for this kind of
> > question, but I'm wondering what people think of the idea of using
> > Firewire-800 as a pseudo-RDMA mechanism when most of the traffic is
> > small packets (say 64 bytes). The idea is it's a lot cheaper than
> > Myrinet or Infiniband, while (this is what I'm asking) hopefully
> > avoiding the protocol overhead of Ethernet. [...]
>
> I can see two problems with this idea.
>
> First, the so-called protocol overhead of Ethernet is largely a myth.
> If your fabric does not allow short packets to overtake or split large
> packets, your latency is going to be 100 times larger than the theoretical
> numbers. Also, the way the software works has a large influence. This
> perception is undoubtedly colored by my experience, which was with database
> and storage clusters. But I tend to be skeptical of fancy interconnects
> in general. I do not remember what the MTU of FireWire is and if it
> allows packets to be split.

IIRC FireWire divides bandwidth into a number of channels, which are
interleaved on the wire. If you don't allocate all bandwidth for the
transfer of a large piece of data, a smaller can overtake.

> Second, there's no such thing as a big-ass FireWire switch, unlike, say
> 288-port Infiniband switches. This makes your cluster laughably small.

Right, all bandwidth is shared on FireWire. However, with multiple
Firewirecards in each node, you could have overlapping small clusters.
If most communication is gridlike (ie. neighbour to neighbour), I
suppose it could be made to work OK.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
From:Del Cecchi
Subject:Re: Firewire for remote DMA?
Date:Tue, 11 Jan 2005 08:30:20 -0600
Niels Jørgen Kruse wrote:
> Pete Zaitcev (OTID3) wrote:
>
>
>>On Sun, 09 Jan 2005 03:21:26 -0800, Paul Rubin wrote:
>>
>>
>>>Hi, I don't know if this is the right newsgroup for this kind of
>>>question, but I'm wondering what people think of the idea of using
>>>Firewire-800 as a pseudo-RDMA mechanism when most of the traffic is
>>>small packets (say 64 bytes). The idea is it's a lot cheaper than
>>>Myrinet or Infiniband, while (this is what I'm asking) hopefully
>>>avoiding the protocol overhead of Ethernet. [...]
>>
>>I can see two problems with this idea.
>>
>>First, the so-called protocol overhead of Ethernet is largely a myth.
>>If your fabric does not allow short packets to overtake or split large
>>packets, your latency is going to be 100 times larger than the theoretical
>>numbers. Also, the way the software works has a large influence. This
>>perception is undoubtedly colored by my experience, which was with database
>>and storage clusters. But I tend to be skeptical of fancy interconnects
>>in general. I do not remember what the MTU of FireWire is and if it
>>allows packets to be split.
>
>
> IIRC FireWire divides bandwidth into a number of channels, which are
> interleaved on the wire. If you don't allocate all bandwidth for the
> transfer of a large piece of data, a smaller can overtake.
>
>
>>Second, there's no such thing as a big-ass FireWire switch, unlike, say
>>288-port Infiniband switches. This makes your cluster laughably small.
>
>
> Right, all bandwidth is shared on FireWire. However, with multiple
> Firewirecards in each node, you could have overlapping small clusters.
> If most communication is gridlike (ie. neighbour to neighbour), I
> suppose it could be made to work OK.
>
And to comment on a post from the point to point bus thread that was
supposed to be here, if one controlled both ends of the link, one would
be free to write different software to drive the "Ethernet" NIC, that
would reduce the "overhead" associated with TCPIP. Note that some terms
are in quotes since this would no longer really be ethernet and I'm not
sure how much better the software could be.

del cecchi
From:Paul Rubin
Subject:Re: Firewire for remote DMA?
Date:11 Jan 2005 06:44:01 -0800
Del Cecchi writes:
> And to comment on a post from the point to point bus thread that was
> supposed to be here, if one controlled both ends of the link, one
> would be free to write different software to drive the "Ethernet" NIC,
> that would reduce the "overhead" associated with TCPIP. Note that some
> terms are in quotes since this would no longer really be ethernet and
> I'm not sure how much better the software could be.

Of course it would still be ethernet. TCP/IP is a protocol that runs
on top of ethernet, just like IPX or PUP various other protocols that
have been used instead of TCP/IP. I don't know whether doing
something other than tcp/ip would improve things much either, though.

Characteristic of IP and Ethernet is that both are unreliable
protocols (they are allowed to drop packets) so TCP is needed to put
an acknowledge/retry layer over IP. One improvement (I'm not a TCP
expert though, so maybe it already can work like this) could be to put
the TCP acknowledgement and the application level response into the
same packet. That would get rid of the separate response packet, and
as well, the response packet would itself not need to be acked (there
would be a retry request if it didn't arrive).
From:Jason Ozolins
Subject:Re: Firewire for remote DMA?
Date:Wed, 12 Jan 2005 10:47:12 +1100
Paul Rubin wrote:
> Characteristic of IP and Ethernet is that both are unreliable
> protocols (they are allowed to drop packets) so TCP is needed to put
> an acknowledge/retry layer over IP. One improvement (I'm not a TCP
> expert though, so maybe it already can work like this) could be to put
> the TCP acknowledgement and the application level response into the
> same packet. That would get rid of the separate response packet, and
> as well, the response packet would itself not need to be acked (there
> would be a retry request if it didn't arrive).

Try Googling for "delayed ACK". :-)

-Jason
From:Del Cecchi
Subject:Re: Firewire for remote DMA?
Date:Tue, 11 Jan 2005 10:55:22 -0600
Paul Rubin wrote:
> Del Cecchi writes:
>
>>And to comment on a post from the point to point bus thread that was
>>supposed to be here, if one controlled both ends of the link, one
>>would be free to write different software to drive the "Ethernet" NIC,
>>that would reduce the "overhead" associated with TCPIP. Note that some
>>terms are in quotes since this would no longer really be ethernet and
>>I'm not sure how much better the software could be.
>
>
> Of course it would still be ethernet. TCP/IP is a protocol that runs
> on top of ethernet, just like IPX or PUP various other protocols that
> have been used instead of TCP/IP. I don't know whether doing
> something other than tcp/ip would improve things much either, though.
>
> Characteristic of IP and Ethernet is that both are unreliable
> protocols (they are allowed to drop packets) so TCP is needed to put
> an acknowledge/retry layer over IP. One improvement (I'm not a TCP
> expert though, so maybe it already can work like this) could be to put
> the TCP acknowledgement and the application level response into the
> same packet. That would get rid of the separate response packet, and
> as well, the response packet would itself not need to be acked (there
> would be a retry request if it didn't arrive).

It is sort of scary when a circuit designer like me talks about
software. I probably get all sorts of stuff wrong. What I meant was
that since, as several long threads in the past have shown, there is a
lot of code between the application and the bare metal of the Ethernet
card that perhaps could be reduced in complexity for the specific
application you are interested in. And this reduction might accomplish
a reduction in latency for your specific application.

What you end up with might well not meet the definition of Ethernet.

del cecchi
From:Phillip Fayers
Subject:Re: Firewire for remote DMA?
Date:Wed, 12 Jan 2005 12:12:12 +0000
Del Cecchi wrote:

> It is sort of scary when a circuit designer like me talks about
> software. I probably get all sorts of stuff wrong.

I'm sure that there are a whole load of people on this newsgroup
who are less qualified than you, like me for instance.

> What I meant was
> that since, as several long threads in the past have shown, there is a
> lot of code between the application and the bare metal of the Ethernet
> card that perhaps could be reduced in complexity for the specific
> application you are interested in. And this reduction might accomplish
> a reduction in latency for your specific application.

My guess is that the reason there is a fair bit of latency in various
parts of the machine -> machine packet path is that, well, it wasn't
that important to deal with it. People have been concentrating on
throughput on the network for sometime. Every now and then someone
spots a new application and they go back and redesign.

One example I can think of is the SunOS TCP/IP stack. I started using
Suns back in the SPARCstation 1 days and one of those machine could
saturate a 10Mb/s network without too much trouble. Then someone went
and invented the web and suddenly you couldn't saturate the bandwidth
because you couldn't handle the required number of connections. The
set up/tear down time of network connections was too slow. So the
engineers pulled the software stack apart and eliminated the bottle
necks. If I remember right the Sun stack used to do 3 data copies
for each packet (user to kernel, another to checksum, another to copy
to the device) which the engineers hacked down to 1.

People are now applying the same effort to low latency ethernet,
with various solutions getting the machine to machine latency down
from >40us to <10us. Not quite in the <4us latency of dedicated
high performance connects but pretty close.

--
Phillip Fayers School of Psychology, Cardiff University
Fayers@cf.ac.uk http://www.astro.cf.ac.uk/pub/Phillip.Fayers/
Tel: +44 (0)29 2087 9337 Attribute these comments to me not UWC.
From:Patrick Geoffray
Subject:Re: Firewire for remote DMA?
Date:Tue, 11 Jan 2005 12:19:42 -0500
Del Cecchi wrote:
> It is sort of scary when a circuit designer like me talks about
> software. I probably get all sorts of stuff wrong. What I meant was

Not bad at all. I wish I could talk about circuit design the same way.

> that since, as several long threads in the past have shown, there is a
> lot of code between the application and the bare metal of the Ethernet
> card that perhaps could be reduced in complexity for the specific
> application you are interested in. And this reduction might accomplish
> a reduction in latency for your specific application.

It has been done in the past multiple times (The GAMMA project for
example), the main problem being that the life span of a particular
revision of a particular chip on a Ethernet NIC is quite short, and you
have to reverse engineer the interface every time because you cannot get
the specs. Time consuming for acadmic and not profitable for business
unless you do the hardware part too.

You can achieve decent latency by talking directly to the hardware, but
then the Ethernet switching latency becomes much more important, and
there is no real way to avoid it. According to my Ethernet expert, The
spanning tree part of the Ethernet spec pretty much requires
store-and-forward in the switch.

> What you end up with might well not meet the definition of Ethernet.

You have to meet the Ethernet specs if you want to go through a switch,
but you can have some freedom if you restrict yourself to point-to-point
links, which may be just fine for small clusters.

Patrick
From:Stephen Fuld
Subject:Re: Firewire for remote DMA?
Date:Tue, 11 Jan 2005 21:57:15 GMT

"Patrick Geoffray" wrote in message
news:cs11rc$qtb@flex.myri-local.com...
> Del Cecchi wrote:
>> It is sort of scary when a circuit designer like me talks about software.
>> I probably get all sorts of stuff wrong. What I meant was
>
> Not bad at all. I wish I could talk about circuit design the same way.
>
>> that since, as several long threads in the past have shown, there is a
>> lot of code between the application and the bare metal of the Ethernet
>> card that perhaps could be reduced in complexity for the specific
>> application you are interested in. And this reduction might accomplish a
>> reduction in latency for your specific application.
>
> It has been done in the past multiple times (The GAMMA project for
> example), the main problem being that the life span of a particular
> revision of a particular chip on a Ethernet NIC is quite short, and you
> have to reverse engineer the interface every time because you cannot get
> the specs. Time consuming for acadmic and not profitable for business
> unless you do the hardware part too.
>
> You can achieve decent latency by talking directly to the hardware, but
> then the Ethernet switching latency becomes much more important, and there
> is no real way to avoid it. According to my Ethernet expert, The spanning
> tree part of the Ethernet spec pretty much requires store-and-forward in
> the switch.

Isn't "most" of the overhead in the TCP part of TCP/IP and thus could one
develop a replacement for TCP that was slimmed down to the needs of the
particular application and save a lot of overhead, yet still use standard
Ethernet switches?

--
- Stephen Fuld
e-mail address disguised to prevent spam
From:Patrick Geoffray
Subject:Re: Firewire for remote DMA?
Date:Tue, 11 Jan 2005 17:40:00 -0500
Stephen Fuld wrote:
> Isn't "most" of the overhead in the TCP part of TCP/IP and thus could one
> develop a replacement for TCP that was slimmed down to the needs of the
> particular application and save a lot of overhead, yet still use standard
> Ethernet switches?

Let's see. If you send a tiny packet (we are talking latency here, right
?) then:
* syscall
* one copy on the send side
* DMA down + checksum offload.
* wire + switch
* DMA up
* interrupt
* one copy on the recv side
* syscall

If the packet is tiny, the cost of the 2 copies is small, there is no
fragmentation overhead, and the DMAs cost is mostly initialization (1
us). What's left is the 2 syscalls (1us each in average, faster on
Opteron), the wire + switch and the interrupt. The interrupt is
definitively expensive, 10-15 us. The wire + switch is smaller, but
still in the order of 5 us for a decent switch.

The part that is actually in the software stack itself is not that big
in this case (but it becomes quite bad when doing fragmentation on large
packets). You can actually save a lot by removing the interrupt (busy
polling on copyblock) and the syscalls (OS-bypass). However, doing the
OS bypass part cleanly is a pain, and busy-polling burn CPU cycles.

When you take (OS-pass + busy-polling + wormhole switching) and you
shake very hard, you get today's high-speed interconnects, more or less.
Only the switching part is harware dependant.

CPU overhead is a problem for big messages, there it may be usefull to
do a zero-copy not-OS-bypass implementation.

Patrick
From:Terje Mathisen
Subject:Re: Firewire for remote DMA?
Date:Wed, 12 Jan 2005 08:45:23 +0100
Stephen Fuld wrote:

> "Patrick Geoffray" wrote in message
>>You can achieve decent latency by talking directly to the hardware, but
>>then the Ethernet switching latency becomes much more important, and there
>>is no real way to avoid it. According to my Ethernet expert, The spanning
>>tree part of the Ethernet spec pretty much requires store-and-forward in
>>the switch.
>
> Isn't "most" of the overhead in the TCP part of TCP/IP and thus could one
> develop a replacement for TCP that was slimmed down to the needs of the
> particular application and save a lot of overhead, yet still use standard
> Ethernet switches?

The easy solution would be to use Xerox' original XNS (?) or the Novell
version IPX:

Pretty much raw Ethernet packets, with just a tiny header (to allow
routing), but that would not be needed here, right?

Terje

--
-
"almost all programming can be viewed as an exercise in caching"
   

Copyright © 2006 newsgroups-index   -   All rights reserved   -   Impressum