|
|
 | | From: | Paul Rubin | | Subject: | Firewire for remote DMA? | | Date: | 09 Jan 2005 03:21:26 -0800 |
|
|
 | Hi, I don't know if this is the right newsgroup for this kind of question, but I'm wondering what people think of the idea of using Firewire-800 as a pseudo-RDMA mechanism when most of the traffic is small packets (say 64 bytes). The idea is it's a lot cheaper than Myrinet or Infiniband, while (this is what I'm asking) hopefully avoiding the protocol overhead of Ethernet. With the most common Firewire interfaces, you can initiatate a transfer write directly into the target computer's memory at the address of your choice (yes, that's a huge security hole if your cluster has untrusted boxes). I don't know if you can also read from the target's memory at your choice of address, but most Firewire cards support multiple buses, so you could use one bus for reading and one for writing.
Any thoughts about this? I haven't studied Firewire in detail; the above is just impressions that I've gotten.
|
|
 | | From: | Del Cecchi | | Subject: | Re: Firewire for remote DMA? Use InifiniBand | | Date: | Tue, 11 Jan 2005 15:48:31 -0600 |
|
|
 | Paul Rubin wrote: > Hi, I don't know if this is the right newsgroup for this kind of > question, but I'm wondering what people think of the idea of using > Firewire-800 as a pseudo-RDMA mechanism when most of the traffic is > small packets (say 64 bytes). The idea is it's a lot cheaper than > Myrinet or Infiniband, while (this is what I'm asking) hopefully > avoiding the protocol overhead of Ethernet. With the most common > Firewire interfaces, you can initiatate a transfer write directly into > the target computer's memory at the address of your choice (yes, > that's a huge security hole if your cluster has untrusted boxes). I > don't know if you can also read from the target's memory at your > choice of address, but most Firewire cards support multiple buses, so > you could use one bus for reading and one for writing. > > Any thoughts about this? I haven't studied Firewire in detail; the > above is just impressions that I've gotten.
I had a nice talk with someone from one of the IB vendors. Apparently the "oem pricing" for an end node is around 100 USD for the 4x (10 Gb) infiniBand HCA if I understood correctly.
A switch is required for more than 2 nodes, and that costs a few hundred, like 300 usd, per port and comes in a box. So that costs a few thousand per minimum increment depending on what vintage hardware. Old stuff came in chunks of 8 and new stuff comes in 24 port increments. The switch includes subnet management and all that stuff.
Maybe in a while the older switches will start showing up used on ebay.
del cecchi
|
|
 | | From: | Pete Zaitcev (OTID3) | | Subject: | Re: Firewire for remote DMA? | | Date: | Sun, 09 Jan 2005 12:54:46 -0800 |
|
|
 | On Sun, 09 Jan 2005 03:21:26 -0800, Paul Rubin wrote:
> Hi, I don't know if this is the right newsgroup for this kind of > question, but I'm wondering what people think of the idea of using > Firewire-800 as a pseudo-RDMA mechanism when most of the traffic is > small packets (say 64 bytes). The idea is it's a lot cheaper than > Myrinet or Infiniband, while (this is what I'm asking) hopefully > avoiding the protocol overhead of Ethernet. [...]
I can see two problems with this idea.
First, the so-called protocol overhead of Ethernet is largely a myth. If your fabric does not allow short packets to overtake or split large packets, your latency is going to be 100 times larger than the theoretical numbers. Also, the way the software works has a large influence. This perception is undoubtedly colored by my experience, which was with database and storage clusters. But I tend to be skeptical of fancy interconnects in general. I do not remember what the MTU of FireWire is and if it allows packets to be split.
Second, there's no such thing as a big-ass FireWire switch, unlike, say 288-port Infiniband switches. This makes your cluster laughably small.
My friend Wim's group at Oracle has their DB running on top of FireWire. It works well for 2 and 4 node clusters. But there's not enough advantage to make it a shipping product, even though small scale clusters for databases make a whole lot more sense than small scale HPC clusters, for the failover purposes. If FireWire were ubiquitous on server motherboards, then such a product would be more interesting.
-- Pete
|
|
 | | From: | Paul Rubin | | Subject: | Re: Firewire for remote DMA? | | Date: | 09 Jan 2005 20:21:40 -0800 |
|
|
 | "Pete Zaitcev (OTID3)" writes: > > question, but I'm wondering what people think of the idea of using > > Firewire-800 as a pseudo-RDMA mechanism when most of the traffic is > I can see two problems with this idea.> > First, the so-called protocol overhead of Ethernet is largely a myth. > If your fabric does not allow short packets to overtake or split large > packets, your latency is going to be 100 times larger than the theoretical > numbers. Also, the way the software works has a large influence. This > perception is undoubtedly colored by my experience, which was with database > and storage clusters. But I tend to be skeptical of fancy interconnects > in general. I do not remember what the MTU of FireWire is and if it > allows packets to be split.
Thanks, this is the kind of answer I was looking for. I think it may be reasonable to set up the software so all packets are small. Larger transfers could use an additional channel, or 1G ethernet. The application is closer to a database cluster than a numerical supercomputer. I don't envision huge cluster sizes, but yeah, 2-4 nodes is a bit limiting.
I had thought most ethernet RDMA implementations went through the TCP stack, which has messages in both directions, multiple layers of protocol, and two round trips through the kernel. I saw some figures indicating about 20k null RPC's per second across 1G ethernet, or 50 microseconds per RPC. I'd hoped for something much faster than that.
> If FireWire were ubiquitous on server motherboards, then such a > product would be more interesting.
It's not ubiquitous, but it's fairly common, and PCI cards are cheap.
|
|
 | | From: | Erno Kuusela | | Subject: | Re: Firewire for remote DMA? | | Date: | 10 Jan 2005 15:04:33 +0200 |
|
|
 | Paul Rubin writes:
> I saw some figures > indicating about 20k null RPC's per second across 1G ethernet, or 50 > microseconds per RPC. I'd hoped for something much faster than that.
superficial googling yields one to think firewire has an intrinsic latency in the hundres of microseconds...
-- erno
|
|
 | | From: | Greg Lindahl | | Subject: | Re: Firewire for remote DMA? | | Date: | 10 Jan 2005 10:19:00 -0800 |
|
|
 | In article , Pete Zaitcev (OTID3) wrote:
>First, the so-called protocol overhead of Ethernet is largely a myth. >If your fabric does not allow short packets to overtake or split large >packets, your latency is going to be 100 times larger than the theoretical >numbers.
Sorry, are these 2 sentences supposed to be related? Overhead and latency are different things. The high overhead in Ethernet is the host cpu time spent processing packets. Latency is, well, latency.
The end of the second sentence is plain wrong. Firewire 800 runs at 800 megabits. This means a 1024 byte packet will delay a short packet by ~ 10 microseconds. Now that's a latency. If you think about the latency plus overhead of a packet, you'll find that this sum doesn't go up by 100 times for TCP/IP, because the so-called protocol overhead of TCP/IP is so large. [Of course, you can run cut-down protocols on Ethernet, but I think you had TCP/IP in mind.]
> Also, the way the software works has a large influence. This > perception is undoubtedly colored by my experience, which was with database > and storage clusters.
It's worth mentioning that most database/storage clusters have a mix of short packets (control and lock messages) and long packets (disk blocks), so they exhibit this behavior all the time. HPC clusters don't necessarily have this issue. Also note that you could have 2 high speed networks for your database cluster, one for short and one for long packets. The price of the database software is so high that this additional hardware cost tends to be in the noise.
-- greg
|
|
 | | From: | Pete Zaitcev (OTID3) | | Subject: | Re: Firewire for remote DMA? | | Date: | Mon, 10 Jan 2005 14:47:45 -0800 |
|
|
 | On Mon, 10 Jan 2005 10:19:00 -0800, Greg Lindahl wrote:
> [Of course, you can run cut-down protocols on > Ethernet, but I think you had TCP/IP in mind.]
I was remembering attempts to run so-called "ST/SST" which I thought at the time being competitive with iSCSI (good thing I didn't bet my career on that guess).
You're right about the correct meaning and usage of the word "overhead". Somehow I assumed the original poster meant "unnecessary latency introduced by all the extra processing" rather than "usecs of CPU times per I/O", which may be more common. I should've not done those assumptions.
--- Pete
|
|
 | | From: | Paul Rubin | | Subject: | Re: Firewire for remote DMA? | | Date: | 10 Jan 2005 16:08:40 -0800 |
|
|
 | "Pete Zaitcev (OTID3)" writes: > You're right about the correct meaning and usage of the word "overhead". > Somehow I assumed the original poster meant "unnecessary latency introduced > by all the extra processing" rather than "usecs of CPU times per I/O", > which may be more common. I should've not done those assumptions.
I did mean extra latency. CPU time matters as well, but I was thinking the extra latency came from the context switches through the kernel TCP stack, and also the round trip packet delays because of the ACK packets etc.
|
|
 | | From: | Niels_Jørgen_Kruse | | Subject: | Re: Firewire for remote DMA? | | Date: | Tue, 11 Jan 2005 00:52:16 +0100 |
|
|
 | Pete Zaitcev (OTID3) wrote:
> On Sun, 09 Jan 2005 03:21:26 -0800, Paul Rubin wrote: > > > Hi, I don't know if this is the right newsgroup for this kind of > > question, but I'm wondering what people think of the idea of using > > Firewire-800 as a pseudo-RDMA mechanism when most of the traffic is > > small packets (say 64 bytes). The idea is it's a lot cheaper than > > Myrinet or Infiniband, while (this is what I'm asking) hopefully > > avoiding the protocol overhead of Ethernet. [...] > > I can see two problems with this idea. > > First, the so-called protocol overhead of Ethernet is largely a myth. > If your fabric does not allow short packets to overtake or split large > packets, your latency is going to be 100 times larger than the theoretical > numbers. Also, the way the software works has a large influence. This > perception is undoubtedly colored by my experience, which was with database > and storage clusters. But I tend to be skeptical of fancy interconnects > in general. I do not remember what the MTU of FireWire is and if it > allows packets to be split.
IIRC FireWire divides bandwidth into a number of channels, which are interleaved on the wire. If you don't allocate all bandwidth for the transfer of a large piece of data, a smaller can overtake.
> Second, there's no such thing as a big-ass FireWire switch, unlike, say > 288-port Infiniband switches. This makes your cluster laughably small.
Right, all bandwidth is shared on FireWire. However, with multiple Firewirecards in each node, you could have overlapping small clusters. If most communication is gridlike (ie. neighbour to neighbour), I suppose it could be made to work OK.
-- Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
|
|
 | | From: | Del Cecchi | | Subject: | Re: Firewire for remote DMA? | | Date: | Tue, 11 Jan 2005 08:30:20 -0600 |
|
|
 | Niels Jørgen Kruse wrote: > Pete Zaitcev (OTID3) wrote: > > >>On Sun, 09 Jan 2005 03:21:26 -0800, Paul Rubin wrote: >> >> >>>Hi, I don't know if this is the right newsgroup for this kind of >>>question, but I'm wondering what people think of the idea of using >>>Firewire-800 as a pseudo-RDMA mechanism when most of the traffic is >>>small packets (say 64 bytes). The idea is it's a lot cheaper than >>>Myrinet or Infiniband, while (this is what I'm asking) hopefully >>>avoiding the protocol overhead of Ethernet. [...] >> >>I can see two problems with this idea. >> >>First, the so-called protocol overhead of Ethernet is largely a myth. >>If your fabric does not allow short packets to overtake or split large >>packets, your latency is going to be 100 times larger than the theoretical >>numbers. Also, the way the software works has a large influence. This >>perception is undoubtedly colored by my experience, which was with database >>and storage clusters. But I tend to be skeptical of fancy interconnects >>in general. I do not remember what the MTU of FireWire is and if it >>allows packets to be split. > > > IIRC FireWire divides bandwidth into a number of channels, which are > interleaved on the wire. If you don't allocate all bandwidth for the > transfer of a large piece of data, a smaller can overtake. > > >>Second, there's no such thing as a big-ass FireWire switch, unlike, say >>288-port Infiniband switches. This makes your cluster laughably small. > > > Right, all bandwidth is shared on FireWire. However, with multiple > Firewirecards in each node, you could have overlapping small clusters. > If most communication is gridlike (ie. neighbour to neighbour), I > suppose it could be made to work OK. > And to comment on a post from the point to point bus thread that was supposed to be here, if one controlled both ends of the link, one would be free to write different software to drive the "Ethernet" NIC, that would reduce the "overhead" associated with TCPIP. Note that some terms are in quotes since this would no longer really be ethernet and I'm not sure how much better the software could be.
del cecchi
|
|
 | | From: | Paul Rubin | | Subject: | Re: Firewire for remote DMA? | | Date: | 11 Jan 2005 06:44:01 -0800 |
|
|
 | Del Cecchi writes: > And to comment on a post from the point to point bus thread that was > supposed to be here, if one controlled both ends of the link, one > would be free to write different software to drive the "Ethernet" NIC, > that would reduce the "overhead" associated with TCPIP. Note that some > terms are in quotes since this would no longer really be ethernet and > I'm not sure how much better the software could be.
Of course it would still be ethernet. TCP/IP is a protocol that runs on top of ethernet, just like IPX or PUP various other protocols that have been used instead of TCP/IP. I don't know whether doing something other than tcp/ip would improve things much either, though.
Characteristic of IP and Ethernet is that both are unreliable protocols (they are allowed to drop packets) so TCP is needed to put an acknowledge/retry layer over IP. One improvement (I'm not a TCP expert though, so maybe it already can work like this) could be to put the TCP acknowledgement and the application level response into the same packet. That would get rid of the separate response packet, and as well, the response packet would itself not need to be acked (there would be a retry request if it didn't arrive).
|
|
 | | From: | Jason Ozolins | | Subject: | Re: Firewire for remote DMA? | | Date: | Wed, 12 Jan 2005 10:47:12 +1100 |
|
|
 | Paul Rubin wrote: > Characteristic of IP and Ethernet is that both are unreliable > protocols (they are allowed to drop packets) so TCP is needed to put > an acknowledge/retry layer over IP. One improvement (I'm not a TCP > expert though, so maybe it already can work like this) could be to put > the TCP acknowledgement and the application level response into the > same packet. That would get rid of the separate response packet, and > as well, the response packet would itself not need to be acked (there > would be a retry request if it didn't arrive).
Try Googling for "delayed ACK". :-)
-Jason
|
|
 | | From: | Del Cecchi | | Subject: | Re: Firewire for remote DMA? | | Date: | Tue, 11 Jan 2005 10:55:22 -0600 |
|
|
 | Paul Rubin wrote: > Del Cecchi writes: > >>And to comment on a post from the point to point bus thread that was >>supposed to be here, if one controlled both ends of the link, one >>would be free to write different software to drive the "Ethernet" NIC, >>that would reduce the "overhead" associated with TCPIP. Note that some >>terms are in quotes since this would no longer really be ethernet and >>I'm not sure how much better the software could be. > > > Of course it would still be ethernet. TCP/IP is a protocol that runs > on top of ethernet, just like IPX or PUP various other protocols that > have been used instead of TCP/IP. I don't know whether doing > something other than tcp/ip would improve things much either, though. > > Characteristic of IP and Ethernet is that both are unreliable > protocols (they are allowed to drop packets) so TCP is needed to put > an acknowledge/retry layer over IP. One improvement (I'm not a TCP > expert though, so maybe it already can work like this) could be to put > the TCP acknowledgement and the application level response into the > same packet. That would get rid of the separate response packet, and > as well, the response packet would itself not need to be acked (there > would be a retry request if it didn't arrive).
It is sort of scary when a circuit designer like me talks about software. I probably get all sorts of stuff wrong. What I meant was that since, as several long threads in the past have shown, there is a lot of code between the application and the bare metal of the Ethernet card that perhaps could be reduced in complexity for the specific application you are interested in. And this reduction might accomplish a reduction in latency for your specific application.
What you end up with might well not meet the definition of Ethernet.
del cecchi
|
|
 | | From: | Phillip Fayers | | Subject: | Re: Firewire for remote DMA? | | Date: | Wed, 12 Jan 2005 12:12:12 +0000 |
|
|
 | Del Cecchi wrote:
> It is sort of scary when a circuit designer like me talks about > software. I probably get all sorts of stuff wrong.
I'm sure that there are a whole load of people on this newsgroup who are less qualified than you, like me for instance.
> What I meant was > that since, as several long threads in the past have shown, there is a > lot of code between the application and the bare metal of the Ethernet > card that perhaps could be reduced in complexity for the specific > application you are interested in. And this reduction might accomplish > a reduction in latency for your specific application.
My guess is that the reason there is a fair bit of latency in various parts of the machine -> machine packet path is that, well, it wasn't that important to deal with it. People have been concentrating on throughput on the network for sometime. Every now and then someone spots a new application and they go back and redesign.
One example I can think of is the SunOS TCP/IP stack. I started using Suns back in the SPARCstation 1 days and one of those machine could saturate a 10Mb/s network without too much trouble. Then someone went and invented the web and suddenly you couldn't saturate the bandwidth because you couldn't handle the required number of connections. The set up/tear down time of network connections was too slow. So the engineers pulled the software stack apart and eliminated the bottle necks. If I remember right the Sun stack used to do 3 data copies for each packet (user to kernel, another to checksum, another to copy to the device) which the engineers hacked down to 1.
People are now applying the same effort to low latency ethernet, with various solutions getting the machine to machine latency down from >40us to <10us. Not quite in the <4us latency of dedicated high performance connects but pretty close.
-- Phillip Fayers School of Psychology, Cardiff University Fayers@cf.ac.uk http://www.astro.cf.ac.uk/pub/Phillip.Fayers/ Tel: +44 (0)29 2087 9337 Attribute these comments to me not UWC.
|
|
 | | From: | Patrick Geoffray | | Subject: | Re: Firewire for remote DMA? | | Date: | Tue, 11 Jan 2005 12:19:42 -0500 |
|
|
 | Del Cecchi wrote: > It is sort of scary when a circuit designer like me talks about > software. I probably get all sorts of stuff wrong. What I meant was
Not bad at all. I wish I could talk about circuit design the same way.
> that since, as several long threads in the past have shown, there is a > lot of code between the application and the bare metal of the Ethernet > card that perhaps could be reduced in complexity for the specific > application you are interested in. And this reduction might accomplish > a reduction in latency for your specific application.
It has been done in the past multiple times (The GAMMA project for example), the main problem being that the life span of a particular revision of a particular chip on a Ethernet NIC is quite short, and you have to reverse engineer the interface every time because you cannot get the specs. Time consuming for acadmic and not profitable for business unless you do the hardware part too.
You can achieve decent latency by talking directly to the hardware, but then the Ethernet switching latency becomes much more important, and there is no real way to avoid it. According to my Ethernet expert, The spanning tree part of the Ethernet spec pretty much requires store-and-forward in the switch.
> What you end up with might well not meet the definition of Ethernet.
You have to meet the Ethernet specs if you want to go through a switch, but you can have some freedom if you restrict yourself to point-to-point links, which may be just fine for small clusters.
Patrick
|
|
 | | From: | Stephen Fuld | | Subject: | Re: Firewire for remote DMA? | | Date: | Tue, 11 Jan 2005 21:57:15 GMT |
|
|
 | "Patrick Geoffray" wrote in message news:cs11rc$qtb@flex.myri-local.com... > Del Cecchi wrote: >> It is sort of scary when a circuit designer like me talks about software. >> I probably get all sorts of stuff wrong. What I meant was > > Not bad at all. I wish I could talk about circuit design the same way. > >> that since, as several long threads in the past have shown, there is a >> lot of code between the application and the bare metal of the Ethernet >> card that perhaps could be reduced in complexity for the specific >> application you are interested in. And this reduction might accomplish a >> reduction in latency for your specific application. > > It has been done in the past multiple times (The GAMMA project for > example), the main problem being that the life span of a particular > revision of a particular chip on a Ethernet NIC is quite short, and you > have to reverse engineer the interface every time because you cannot get > the specs. Time consuming for acadmic and not profitable for business > unless you do the hardware part too. > > You can achieve decent latency by talking directly to the hardware, but > then the Ethernet switching latency becomes much more important, and there > is no real way to avoid it. According to my Ethernet expert, The spanning > tree part of the Ethernet spec pretty much requires store-and-forward in > the switch.
Isn't "most" of the overhead in the TCP part of TCP/IP and thus could one develop a replacement for TCP that was slimmed down to the needs of the particular application and save a lot of overhead, yet still use standard Ethernet switches?
-- - Stephen Fuld e-mail address disguised to prevent spam
|
|
 | | From: | Patrick Geoffray | | Subject: | Re: Firewire for remote DMA? | | Date: | Tue, 11 Jan 2005 17:40:00 -0500 |
|
|
 | Stephen Fuld wrote: > Isn't "most" of the overhead in the TCP part of TCP/IP and thus could one > develop a replacement for TCP that was slimmed down to the needs of the > particular application and save a lot of overhead, yet still use standard > Ethernet switches?
Let's see. If you send a tiny packet (we are talking latency here, right ?) then: * syscall * one copy on the send side * DMA down + checksum offload. * wire + switch * DMA up * interrupt * one copy on the recv side * syscall
If the packet is tiny, the cost of the 2 copies is small, there is no fragmentation overhead, and the DMAs cost is mostly initialization (1 us). What's left is the 2 syscalls (1us each in average, faster on Opteron), the wire + switch and the interrupt. The interrupt is definitively expensive, 10-15 us. The wire + switch is smaller, but still in the order of 5 us for a decent switch.
The part that is actually in the software stack itself is not that big in this case (but it becomes quite bad when doing fragmentation on large packets). You can actually save a lot by removing the interrupt (busy polling on copyblock) and the syscalls (OS-bypass). However, doing the OS bypass part cleanly is a pain, and busy-polling burn CPU cycles.
When you take (OS-pass + busy-polling + wormhole switching) and you shake very hard, you get today's high-speed interconnects, more or less. Only the switching part is harware dependant.
CPU overhead is a problem for big messages, there it may be usefull to do a zero-copy not-OS-bypass implementation.
Patrick
|
|
 | | From: | Terje Mathisen | | Subject: | Re: Firewire for remote DMA? | | Date: | Wed, 12 Jan 2005 08:45:23 +0100 |
|
|
 | Stephen Fuld wrote:
> "Patrick Geoffray" wrote in message >>You can achieve decent latency by talking directly to the hardware, but >>then the Ethernet switching latency becomes much more important, and there >>is no real way to avoid it. According to my Ethernet expert, The spanning >>tree part of the Ethernet spec pretty much requires store-and-forward in >>the switch. > > Isn't "most" of the overhead in the TCP part of TCP/IP and thus could one > develop a replacement for TCP that was slimmed down to the needs of the > particular application and save a lot of overhead, yet still use standard > Ethernet switches?
The easy solution would be to use Xerox' original XNS (?) or the Novell version IPX:
Pretty much raw Ethernet packets, with just a tiny header (to allow routing), but that would not be needed here, right?
Terje
-- - "almost all programming can be viewed as an exercise in caching"
|
|
|