|
|
 | | From: | Paul A. Clayton | | Subject: | RISC vs. CISC design principles | | Date: | 12 Jan 2005 17:15:23 GMT |
|
|
 | In the CISC vs. RISC debate, it seems that the design principles behind each are generally not considered in their historical context.
ISTM that a major concern for CISC was memory capacity. This concern is expressed in efforts at code density (variable length instructions, complex instructions [often targeting the 'common case' of higher level programming, meshing with semantic gap theory], implicit arguments [leading to special-purpose registers], and fewer registers [fewer bits to encode]), hardware support of unaligned loads (to improve data density), and finer-grained memory protection (segment-based rather page-based). In earlier hardware the cost of unaligned loads may have been smaller due to the smaller width of memory interfaces. The cost of ROM relative to RAM (especially fast RAM) may have tended to encourage the use of static micro-code even beyond the general code density advantage. Earlier systems may also have used fewer segments per application and less dynamic resizing, possibly making segmentation more efficient.
Also earlier semantic gap theory may have seemed more reasonable, making compilers (and assembly-level programming) simpler when programming effort was perhaps a greater consideration (and as mentioned above it meshes well with targeting code density).
OTOH, the main RISC design principles are pipelinability (leading to fixed-sized instructions, few instruction formats, simpler memory addressing modes, etc.), compiler optimization ('reduced' operations can be independently scheduled, a relatively large number of [fast] registers allows software to cache data [e.g., Common Subexpression Elimination] and use faster/less expensive procedure interfaces), and simpler/faster hardware (in addition to pipelinability aids, aligned memory accesses [which can simplify a usually time-critical path] and 'reduced' instructions [particularly separation of memory accesses from other operations]).
The design principles of RISC place more burden on the compiler, which may allow system developers to take advantage of late-binding. It would certainly seem to allow system developers to leverage a greater volume of software developers relative to hardware developers.
At current hardware budgets, the aligned memory access requirement is probably the least useful of the RISC mechanisms. The largish number of general purpose registers may become more burdensome if microthreading becomes more common, but the benefits of this mechanism still seem to outweigh the disadvantages significantly. With the exception of some embedded systems (for which two-sized instructions are common), fixed-sized instructions seem to provide more benefit than cost. The emphasis on scalar decode may be considered a weakness of RISC in the world of superscalar processing, though the simple generally explicit encoding of RISC does help somewhat even there.
CISC design principles may be said to depend too much on expensive memory capacity to remain practical in most modern circumstances.
Paul A. Clayton (a 'Dysthymicdolt' reachable at aol.com)
|
|
 | | From: | MrTibbs | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | 15 Jan 2005 22:05:14 -0800 |
|
|
 | > Putting only a few extra gates on a chip to allow unaligned accesses,
I'm not so sure it's just a few gates. What if the unaligned access crosses a cache line boundary, and one line is in the cache and one isn't? What if it crosses a page boundary, and blah blah...
There's the MOESI/whatever protocol for multiprocessors as well. Although few programs may do unaligned accesses on shared memory, it has to work right if it is advertised.
It may or may not be a few gates, but I think the hardware folks, with unaligned accesses, now have to deal with a whole bunch of corner cases that they wouldn't be considered otherwise.
jim
|
|
 | | From: | John Savard | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | Mon, 17 Jan 2005 00:25:25 GMT |
|
|
 | On 15 Jan 2005 22:05:14 -0800, "MrTibbs" wrote, in part:
>> Putting only a few extra gates on a chip to allow unaligned accesses, > >I'm not so sure it's just a few gates. What if the unaligned access >crosses a cache line boundary, and one line is in the cache and one >isn't? What if it crosses a page boundary, and blah blah...
You make it into a few gates by turning an unaligned access into multiple accesses of smaller things. If you want a smaller performance penalty, *then* it's more gates.
John Savard http://home.ecn.ab.ca/~jsavard/index.html
|
|
 | | From: | Seongbae Park | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | Tue, 18 Jan 2005 16:58:46 +0000 (UTC) |
|
|
 | John Savard wrote: > On 15 Jan 2005 22:05:14 -0800, "MrTibbs" wrote, > in part: > >>> Putting only a few extra gates on a chip to allow unaligned accesses, >> >>I'm not so sure it's just a few gates. What if the unaligned access >>crosses a cache line boundary, and one line is in the cache and one >>isn't? What if it crosses a page boundary, and blah blah... > > You make it into a few gates by turning an unaligned access into > multiple accesses of smaller things.
You can't simply turn it into multiple smaller accesses without locking multiple cache lines (or potentially even TLB entries if it crosses page boundary) if the ISA defines the memory operations to be atomic (most ISAs do). Locking multiple anything will cost more than "just a few gates" if otherwise you don't need to do so. -- #pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
|
|
 | | From: | Terje Mathisen | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | Tue, 18 Jan 2005 21:07:12 +0100 |
|
|
 | Seongbae Park wrote:
> John Savard wrote: > >>You make it into a few gates by turning an unaligned access into >>multiple accesses of smaller things. > > You can't simply turn it into multiple smaller accesses > without locking multiple cache lines (or potentially even TLB entries > if it crosses page boundary) > if the ISA defines the memory operations to be atomic (most ISAs do). > Locking multiple anything will cost more than "just a few gates" > if otherwise you don't need to do so.
In the cases we've been discussing allowing mis-aligned accesses to be not atomic wouldn't cost anything at all:
After all this is what the alternative sequence have to do anyway, right?
I.e. I'd be perfectly happy with a "best effort" alignment handler in hw:
Load a single item (quickly) if aligned, otherwise load two items into the barrel shifter, shift to align, and return the result.
This would be at least comparable to an explicit sw sequence to do the same task, and it would simplify programming quite a bit.
(I.e. aligned writes and misaligned reades are nearly the same speed as having both aligned on most x86 implementations!)
Using a LOCK prefix should trap in such a case.
Terje
-- - "almost all programming can be viewed as an exercise in caching"
|
|
 | | From: | Eric P. | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | Tue, 18 Jan 2005 16:01:22 -0500 |
|
|
 | Terje Mathisen wrote: > > Seongbae Park wrote: > > > John Savard wrote: > > > >>You make it into a few gates by turning an unaligned access into > >>multiple accesses of smaller things. > > > > You can't simply turn it into multiple smaller accesses > > without locking multiple cache lines (or potentially even TLB entries > > if it crosses page boundary) > > if the ISA defines the memory operations to be atomic (most ISAs do). > > Locking multiple anything will cost more than "just a few gates" > > if otherwise you don't need to do so. > > In the cases we've been discussing allowing mis-aligned accesses to be > not atomic wouldn't cost anything at all:
Note that the Intel x86 does NOT guarantee atomic access to nonaligned values that straddle 32 byte cache lines. (Vol 3, Sys Prog Guide, section 7.1.1)
> After all this is what the alternative sequence have to do anyway, right? > > I.e. I'd be perfectly happy with a "best effort" alignment handler in hw: > > Load a single item (quickly) if aligned, otherwise load two items into > the barrel shifter, shift to align, and return the result.
Most of this hw support would likely already be present in the L1 data cache as it is required for byte and aligned word/dword/qword access. Nonaligned access should require only minor extensions.
> This would be at least comparable to an explicit sw sequence to do the > same task, and it would simplify programming quite a bit.
The sw trap incurs a pipeline flush that a hw sequencer does not.
> (I.e. aligned writes and misaligned reades are nearly the same speed as > having both aligned on most x86 implementations!) > > Using a LOCK prefix should trap in such a case.
Hmmm... what else might might be affected?
- Load-Store queue must do more complex overlap checks before allowing read or write reordering
- On store operations that straddle pages, MMU must probe TLB for both pages before starting so they do not fault half way through. If both are valid then emit physical addresses to L1.
- Write combine buffer must do more complex check for straddles. Also must try not to evict one needed part when loading another.
Anything else?
Eric
|
|
 | | From: | Terje Mathisen | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | Wed, 19 Jan 2005 08:35:22 +0100 |
|
|
 | Eric P. wrote:
> Terje Mathisen wrote: > Hmmm... what else might might be affected? > > - Load-Store queue must do more complex overlap checks before > allowing read or write reordering
Not too much though: Currently it must take into consideration both base and length of each operations, this extension could conservatively extend this to be the aligned base, and the extended length. > > - On store operations that straddle pages, MMU must probe TLB for > both pages before starting so they do not fault half way through. > If both are valid then emit physical addresses to L1. > > - Write combine buffer must do more complex check for straddles. > Also must try not to evict one needed part when loading another.
None of these would seem to apply if the store that crosses a cache line boundary is turned into multiple micro-ops, with traps allowed between them. I.e. in case of a store that traps halfway, the first half could get written either once or twice, with no guarantee of what would actually happen, except that both halves would eventually make it to the destination.
Terje
-- - "almost all programming can be viewed as an exercise in caching"
|
|
 | | From: | Eric P. | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | Wed, 19 Jan 2005 15:05:52 -0500 |
|
|
 | Terje Mathisen wrote: > > Eric P. wrote: > > > Terje Mathisen wrote: > > Hmmm... what else might might be affected? > > > > - Load-Store queue must do more complex overlap checks before > > allowing read or write reordering > > Not too much though: Currently it must take into consideration both base > and length of each operations, this extension could conservatively > extend this to be the aligned base, and the extended length.
Oops yes, I should have seen that. I kept thinking this required arithmetic. If the largest operand is 8 bytes then round down to a 16 byte boundary and check for overlap on the 16 byte blocks.
Eric
|
|
 | | From: | MitchAlsup at aol.com | | Subject: | Re: RISC vs. CISC design principles | | Date: | 17 Jan 2005 09:03:46 -0800 |
|
|
 | "Why aren't Intel/AMD doing this already?" Rumor has it that Dotham does this, at least for integer moves.
Mitch
|
|
 | | From: | Terje Mathisen | | Subject: | Re: RISC vs. CISC design principles | | Date: | Mon, 17 Jan 2005 21:48:29 +0100 |
|
|
 | MitchAlsup@aol.com wrote:
> "Why aren't Intel/AMD doing this already?" > Rumor has it that Dotham does this, at least for integer moves.
Terje
-- - "almost all programming can be viewed as an exercise in caching"
|
|
 | | From: | already5chosen at yahoo.com | | Subject: | Re: RISC vs. CISC design principles | | Date: | 19 Jan 2005 12:04:38 -0800 |
|
|
 | What is you definition for "all x86 CPUs"? I would imagine it doesn't include outsiders like SiS, Transmeta and Geode GX. How about VIA? Does the definition include P6 that currently has near-zero market share but still dominates installed base?
|
|
 | | From: | MitchAlsup at aol.com | | Subject: | Re: RISC vs. CISC design principles | | Date: | 12 Jan 2005 11:56:35 -0800 |
|
|
 | "At current hardware budgets, the aligned memory access requirement is probably the least useful of the RISC mechanisms."
I, respectfully, disagree.
At current hardware budgets, the least useful RISC mechanism is the fixed length instruction format. Both Intel and AMD have shown that they/we can decode just as many instructions per unit time as the RISC guys.
Consider a x86 machine like an Athlon (or P3 or P4). How much performance is sacrificed by having to decode multibyte instructions? Answer; with modern branch prediction, the added pipe stages extract a penalty around only 1%-2% compared to fixed length machines! Yet these decoded multibyte instructions contain more semantic units of work that the equivalent 4-ish wide RISC decoders. But, in neither catagory of machines is the basic throughput significantly dependent upon the performance of the decoder(s)!
I could address the rest of the statements piecemeal, however, the general premiss is wrong. The evolution of x86 is proceeding faster than the evolution of other CPUs bacause of the amount of cubic dollars that can be thrown at teams of designers to solve yesterdays problems and develop tomorows monster machines. Cubic dollars beats architectural cleanliness everytime.
Mitch
|
|
 | | From: | Paul Rubin | | Subject: | Re: RISC vs. CISC design principles | | Date: | 12 Jan 2005 12:00:18 -0800 |
|
|
 | MitchAlsup@aol.com writes: > Consider a x86 machine like an Athlon (or P3 or P4). How much > performance is sacrificed by having to decode multibyte instructions? > Answer; with modern branch prediction, the added pipe stages extract a > penalty around only 1%-2% compared to fixed length machines!
What do you mean by that? Don't those added pipe stages and decoder logic burn a lot of silicon area that could be used for more functional units or caches or something? What about the x86's register starvation, couldn't code run faster with more registers? The x86-64 supports more registers (16 instead of 8), but 16 still isn't an awful lot, and it makes the instructions longer.
|
|
 | | From: | prep at prep.synonet.com | | Subject: | Re: Unaligned accesses | | Date: | Tue, 18 Jan 2005 04:47:22 +0800 |
|
|
 | jsavard@excxn.aNOSPAMb.cdn.invalid (John Savard) writes:
> Putting only a few extra gates on a chip to allow unaligned > accesses, and then warning programmers that these accesses will have > a performance penalty, so they should not be used unless really > needed, is usually the best tradeoff, though. It eliminates a > potential source of confusion and error at the lowest cost.
Because you are paying the gate delay penalty ofr EVERY access that now has to go through them.
See the Alpha papers for the messy details, and more.
-- Paul Repacholi 1 Crescent Rd., +61 (08) 9257-1001 Kalamunda. West Australia 6076 comp.os.vms,- The Older, Grumpier Slashdot Raw, Cooked or Well-done, it's all half baked. EPIC, The Architecture of the future, always has been, always will be.
|
|
 | | From: | Terje Mathisen | | Subject: | Re: Unaligned accesses | | Date: | Tue, 18 Jan 2005 08:41:05 +0100 |
|
|
 | prep@prep.synonet.com wrote:
> jsavard@excxn.aNOSPAMb.cdn.invalid (John Savard) writes: > > >>Putting only a few extra gates on a chip to allow unaligned >>accesses, and then warning programmers that these accesses will have >>a performance penalty, so they should not be used unless really >>needed, is usually the best tradeoff, though. It eliminates a >>potential source of confusion and error at the lowest cost. > > > Because you are paying the gate delay penalty ofr EVERY access > that now has to go through them.
Is that really true?
_Either_ you pay the gate delay penalty of being able to detect misaligned accesses, and convert those to a trap,
_or_ you pay the gate delay penalty of being able to detect misaligned accesses, and convert those into slower/microcoded sequences. :-)
I'll accept that generating a trap is probably easier, since you need that for other problem cases (i.e. out-of-bounds) anyway, but the HW that allows the cpu to do a realtime decision of the path to follow should be very similar.
It is only if/when the trap is async that this really becomes worrysome, since at this point the cpu much revert to the last checkpoint and singlestep forward to the point of the trap.
If the same mechanism is used to handle misaligned accesses, then they will be so slow as to make the alternate (aligned only) code sequence faster except when misalignment is very rare.
OK, I guess I'm sorta/reluctantly agreeing with you. :-(
Terje
-- - "almost all programming can be viewed as an exercise in caching"
|
|
 | | From: | Wilco Dijkstra | | Subject: | Re: Unaligned accesses | | Date: | Tue, 18 Jan 2005 22:41:30 GMT |
|
|
 | "Terje Mathisen" wrote in message news:csieii$uf6$1@osl016lin.hda.hydro.com... > prep@prep.synonet.com wrote: > > > jsavard@excxn.aNOSPAMb.cdn.invalid (John Savard) writes: > > > > > >>Putting only a few extra gates on a chip to allow unaligned > >>accesses, and then warning programmers that these accesses will have > >>a performance penalty, so they should not be used unless really > >>needed, is usually the best tradeoff, though. It eliminates a > >>potential source of confusion and error at the lowest cost. > > > > > > Because you are paying the gate delay penalty ofr EVERY access > > that now has to go through them. > > Is that really true?
No, it's not. Those extra gates are already needed to select words, halfwords and bytes, endian swap them (perhaps dynamically) and zero or signextend them as necessary. If you look at it you're close to a full crossbar switch already, so it isn't much more work to support unaligned accesses. Initial Alphas didn't support any of this as they didn't have those gates indeed, but they did pay for this as code using chars and shorts ran slow.
ARM is perhaps the only RISC that added support for unaligned access due to customer demand. It speeds up code that occasionally does do unaligned accesses as the cost on ARMs is high (>10x slower than an aligned access as ARM has no funnel shifter). It's essential for SIMD as unaligned access often outnumber aligned ones (eg. SAD in motion estimation).
As Stephen Fuld guessed the hardware people didn't like it initially but then again ARM already has instructions that can straddle up to 4 cache lines...
> _Either_ you pay the gate delay penalty of being able to detect > misaligned accesses, and convert those to a trap, > > _or_ you pay the gate delay penalty of being able to detect misaligned > accesses, and convert those into slower/microcoded sequences. > :-) > > I'll accept that generating a trap is probably easier, since you need > that for other problem cases (i.e. out-of-bounds) anyway, but the HW > that allows the cpu to do a realtime decision of the path to follow > should be very similar.
Indeed you have a lot more time for a trap as you only have to generate it just before the cache returns the hit signal. However generating an unaligned signal is so easy it can be done during effective address generation at virtually no cost. This can then be used to stall the load store unit for an extra cycle to access the other cacheline (the ARM11 doesn this). If the execution units are statically scheduled you'll have to replay the load, but since cachelines are large nowadays this doesn't matter much (see below).
> It is only if/when the trap is async that this really becomes worrysome, > since at this point the cpu much revert to the last checkpoint and > singlestep forward to the point of the trap. > > If the same mechanism is used to handle misaligned accesses, then they > will be so slow as to make the alternate (aligned only) code sequence > faster except when misalignment is very rare.
Assuming a 10-cycle cost for an unaligned word access crossing a 64-byte cacheline it would take 192 cycles for the replay mechanism to be worse! So in principle it would be possible to add unaligned access to a CPU that doesn't support it by taking a trap, inserting the instructions for an unaligned access using a micro code engine and still get a (small) speedup :-)
Wilco
|
|
 | | From: | Paul A. Clayton | | Subject: | Re: RISC vs. CISC design principles | | Date: | 13 Jan 2005 15:58:19 GMT |
|
|
 | In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>, MitchAlsup@aol.com wrote:
>"At current hardware budgets, the aligned memory access >requirement is probably the least useful of the RISC mechanisms." > >I, respectfully, disagree. > >At current hardware budgets, the least useful RISC mechanism is the >fixed length instruction format. Both Intel and AMD have shown that >they/we can decode just as many instructions per unit time as the RISC >guys. > >Consider a x86 machine like an Athlon (or P3 or P4). How much >performance is sacrificed by having to decode multibyte instructions? >Answer; with modern branch prediction, the added pipe stages extract a >penalty around only 1%-2% compared to fixed length machines! Yet these [snip]
I assume a 1% performance penalty for complex instructions is greater than the performanc penalty (if even positive) of supporting unaligned memory operations in hardware. Is this incorrect?
>I could address the rest of the statements piecemeal, however, the >general premiss is wrong. The evolution of x86 is proceeding faster >than the evolution of other CPUs bacause of the amount of cubic dollars >that can be thrown at teams of designers to solve yesterdays problems >and develop tomorows monster machines. Cubic dollars beats >architectural cleanliness everytime.
I agree.
I should not have used practicality. My concern was with comparing design principles given a clean slate for modern tradeoffs and trying to understand the reasoning behind the design choices that generated CISCs and RISCs in their historical context.
Paul A. Clayton just a technophile, not a computer professional
|
|
 | | From: | Stephen Fuld | | Subject: | Re: RISC vs. CISC design principles | | Date: | Thu, 13 Jan 2005 17:34:04 GMT |
|
|
 | "Paul A. Clayton" wrote in message news:20050113105819.01270.00000034@mb-m23.aol.com...
snip
> I should not have used practicality. My concern was with comparing > design principles given a clean slate for modern tradeoffs and trying > to understand the reasoning behind the design choices that generated > CISCs and RISCs in their historical context.
With regard to both variable length instructions and unalligned storage operations, you have to go back in history to the original RISC era. The idea was that you gained so much by eliminating the off chip connection delay in favor of everything within one chip that the single chip requirement pretty much dominated everything else. Now look at the number of transistors one could get on a single chip at that time. That dictated eliminating a lot of features that might otherwise be desirable. So all instructions being the same length saved a lot of transistors (and speeded decoding) and that was a much bigger issue then than it is now. That is why you see the multi-length instruction sets added in some RISC chips, and the minimal cost of pretty much full generality of X-86 being quite fast.
With regard to unalligned memops, I think it is usefull to divide them into two cases. The first is where the entire operation is contained within one cache line/page. These will be much more frequent and probably are easier to make fast. The other is where a cache line or even a page boundry is crossed, which are much less frequent, and of course have to be done correctly, but are less important to be fast. Note that as cache lines get larger, the first case becomes more frequent. And since (to go back to your question), the first RISC chips had no on-chip cache, all cases had to go directly to memory, and unalligned accesses were much costlier (both in terms of time and the then all important transistor count.
I suspect if some architect demanded unalligned access support in a hypothetical new chip or new version of an existing chip that doesn't now have it, the hardware guys would grumble a lot and then do a good job of making it fast in the common cases and correct in all cases. But I would appreciate comments from people who know more about that aspect of things than I do.
-- - Stephen Fuld e-mail address disguised to prevent spam
|
|
 | | From: | Kai Harrekilde-Petersen | | Subject: | Re: RISC vs. CISC design principles | | Date: | Sun, 16 Jan 2005 23:50:44 +0100 |
|
|
 | nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
> In article , > Andi Kleen wrote: >>nmm1@cus.cam.ac.uk (Nick Maclaren) writes: >>> >>> In my career, I have never seen a significant use of it except to >>> cover up misdesigned interfaces - in particular, ones that have >>> failed to take the decision whether they are based on semi-abstract >>> types like integers and floating-point or on precisely specified >>> bit patterns. >> >>It's useful to process IPv4 packets. On a aligned ethernet packet the >>TCP header ends up being unaligned. Same is true for other protocols. > > That is precisely what I am describing as a misdesigned protocol.
Are you poking at the 14 byte Ethernet header or the IPv4 header here? - I thought the IPv4 header was quite well-laid out, with everything aligned to natural boundaries.
Regards,
Kai -- Kai Harrekilde-Petersen
|
|
 | | From: | Nick Maclaren | | Subject: | Re: RISC vs. CISC design principles | | Date: | 12 Jan 2005 20:32:50 GMT |
|
|
 | In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>, wrote: >"At current hardware budgets, the aligned memory access >requirement is probably the least useful of the RISC mechanisms." > >I, respectfully, disagree. > >At current hardware budgets, the least useful RISC mechanism is the >fixed length instruction format. Both Intel and AMD have shown that >they/we can decode just as many instructions per unit time as the RISC >guys.
Yes. But let's ignore that and go back to the alignment issue. Speaking as a software engineer from way back:
"Allowing unaligned memory access is probably the least useful of common CISC features."
In my career, I have never seen a significant use of it except to cover up misdesigned interfaces - in particular, ones that have failed to take the decision whether they are based on semi-abstract types like integers and floating-point or on precisely specified bit patterns.
The point is that the former have no trouble with padding being inserted to create alignment, and the latter are uniformly better done by the use of packing and unpacking primitives because there are almost certainly other things to fix up than alignment (e.g. endianness).
Regards, Nick Maclaren.
|
|
 | | From: | Terje Mathisen | | Subject: | Re: RISC vs. CISC design principles | | Date: | Thu, 13 Jan 2005 08:29:41 +0100 |
|
|
 | Nick Maclaren wrote:
> In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>, > wrote: >>At current hardware budgets, the least useful RISC mechanism is the >>fixed length instruction format. Both Intel and AMD have shown that >>they/we can decode just as many instructions per unit time as the RISC >>guys. > > > Yes. But let's ignore that and go back to the alignment issue. > Speaking as a software engineer from way back: > > "Allowing unaligned memory access is probably the least useful > of common CISC features." > > In my career, I have never seen a significant use of it except to > cover up misdesigned interfaces - in particular, ones that have > failed to take the decision whether they are based on semi-abstract > types like integers and floating-point or on precisely specified > bit patterns.
A recent post in c.l.a.x86 made me go back to C string.h functions, as well as the *BSD-inspired strl*() replacements.
Efficient handling of C strings pretty much requires you to process a full register's worth of data (usually 4 or 8 chars), while you cannot depend on either the source or destination to be properly aligned, right?
Besides alignement, another key problem is that the terminating zero byte in the source string could well be the last byte in a memory block, meaning that any access past this point will cause a trap.
Handling both of these at the same time pretty much requires either unaligned load and/or store operations, together with the capability to do non-trapping (speculative) load operations past the end of allocated memory, or you need to re-invent the Alpha:
I.e. load operations that disregard the bottommost (alignment) bits, together with fast shift/mask/merge operations based on those same bits, so that you can synthesize unaligned operations this way.
> The point is that the former have no trouble with padding being > inserted to create alignment, and the latter are uniformly better > done by the use of packing and unpacking primitives because there > are almost certainly other things to fix up than alignment (e.g. > endianness).
Even these are much better off if you can specify them in such a way as to allow the compiler to generate optimal code, i.e. not just a set of byte load/shift/merge operations.
Terje
-- - "almost all programming can be viewed as an exercise in caching"
|
|
 | | From: | Nick Maclaren | | Subject: | Re: RISC vs. CISC design principles | | Date: | 13 Jan 2005 11:02:46 GMT |
|
|
 | In article , Terje Mathisen writes: |> |> A recent post in c.l.a.x86 made me go back to C string.h functions, as |> well as the *BSD-inspired strl*() replacements. |> |> Efficient handling of C strings pretty much requires you to process a |> full register's worth of data (usually 4 or 8 chars), while you cannot |> depend on either the source or destination to be properly aligned, right?
Well, there are other approaches, but I agree that is one of the few that works on most current hardware.
|> Besides alignement, another key problem is that the terminating zero |> byte in the source string could well be the last byte in a memory block, |> meaning that any access past this point will cause a trap. |> |> Handling both of these at the same time pretty much requires either |> unaligned load and/or store operations, together with the capability to |> do non-trapping (speculative) load operations past the end of allocated |> memory, or you need to re-invent the Alpha:
Well, no, it doesn't. Look at BSD libraries for examples of how you can (semi-portably) use only aligned loads and stores.
Also, you are asking for MUCH more than just unaligned operations. For example, you are relying on two features that are generally not the case:
1) EFFICIENT unaligned loads and stores, whether in terms of cycle counts, cache use or TLB use.
2) The non-trapping aspects you mentioned, which can be very important indeed.
3) No error detection in such operations. This is less obvious, but I have almost never seen code that operates that way (including relying on non-trapping aspects) AND correctly traps when using a genuinely invalid location.
Regards, Nick Maclaren.
|
|
 | | From: | Terje Mathisen | | Subject: | Re: RISC vs. CISC design principles | | Date: | Thu, 13 Jan 2005 15:40:04 +0100 |
|
|
 | Nick Maclaren wrote:
> In article , > Terje Mathisen writes: > |> > |> A recent post in c.l.a.x86 made me go back to C string.h functions, as > |> well as the *BSD-inspired strl*() replacements. > |> > |> Efficient handling of C strings pretty much requires you to process a > |> full register's worth of data (usually 4 or 8 chars), while you cannot > |> depend on either the source or destination to be properly aligned, right? > > Well, there are other approaches, but I agree that is one of the > few that works on most current hardware.
My statement was re. requirements, not implementations. :-)
The proper way to handle this on a CISC is to have REP SCASB/REP MOVSB opcodes that actually do the right thing with the hw, in all versions of the architecture and in all combinations of memory types.
However, according to many notes from Andy Glew, this isn't very likely to happen. :-(
> |> Besides alignement, another key problem is that the terminating zero > |> byte in the source string could well be the last byte in a memory block, > |> meaning that any access past this point will cause a trap. > |> > |> Handling both of these at the same time pretty much requires either > |> unaligned load and/or store operations, together with the capability to > |> do non-trapping (speculative) load operations past the end of allocated > |> memory, or you need to re-invent the Alpha: > > Well, no, it doesn't. Look at BSD libraries for examples of how > you can (semi-portably) use only aligned loads and stores.
You can obviously do it quite portably, even endian-independently, by just being able to assume that a register will hold a power of two number of 8-bit characters. Doing the same with 36 or 60-bit register sizes is slightly harder, unless you're allowed some init code to detect the current environment. :-)
strlen() is easy: Load single bytes until alignment is reached, then process (safely, since a memory block cannot end in the middle of an aligned word!) full words until one is found that contains at least one zero byte.
At this point you switch back to reloading and checking each character, or if you could setup an array of masks at startup, just check the current word against those masks. (Due to cache-misses, the first option might well be the faster one.)
The copying operations (strcpy, strlcpy, strlcat, strncpy etc) are harder because you want to use aligned accesses for both source and destination, which means that you _must_ do some form of shift/mask/merge to convert from source to destination alignment, and this cannot be done both portably and efficiently without introducing some level of endian-dependent coding. > > Also, you are asking for MUCH more than just unaligned operations. > For example, you are relying on two features that are generally > not the case: > > 1) EFFICIENT unaligned loads and stores, whether in terms of > cycle counts, cache use or TLB use.
unaligned load operations have always been very efficient on x86, as long as the load didn't straddle a cache line boundary. I.e. the effective overhead of reading a stream this way is _much_ lower than the cost of shifting a set of aligned loads!
> 2) The non-trapping aspects you mentioned, which can be very > important indeed.
They help a lot by allowing the last load to straddle the end of the buffer, yeah. > > 3) No error detection in such operations. This is less obvious, > but I have almost never seen code that operates that way (including > relying on non-trapping aspects) AND correctly traps when using a > genuinely invalid location.
I've seen it done, by having special non-trapping load operations.
This will work as long as the input was valid, i.e. a terminating zero was actually found.
The faster solution is to be able to have a user-level trap of such a load, and turn it into a load of zeroes. That way you can safely load a few words past the end of the input (for unrolling), while still never writing beyond the terminating zero of the output.
Terje
-- - "almost all programming can be viewed as an exercise in caching"
|
|
 | | From: | Nick Maclaren | | Subject: | Re: RISC vs. CISC design principles | | Date: | 13 Jan 2005 15:17:55 GMT |
|
|
 | In article , Terje Mathisen writes: |> |> strlen() is easy: Load single bytes until alignment is reached, then |> process (safely, since a memory block cannot end in the middle of an |> aligned word!) full words until one is found that contains at least one |> zero byte. |> |> At this point you switch back to reloading and checking each character, |> or if you could setup an array of masks at startup, just check the |> current word against those masks. (Due to cache-misses, the first option |> might well be the faster one.)
That's what the BSD code did when I looked at it.
|> > 3) No error detection in such operations. This is less obvious, |> > but I have almost never seen code that operates that way (including |> > relying on non-trapping aspects) AND correctly traps when using a |> > genuinely invalid location. |> |> I've seen it done, by having special non-trapping load operations. |> |> This will work as long as the input was valid, i.e. a terminating zero |> was actually found.
But did those correctly diagnose the error if the input was NOT valid? That is what I meant.
Regards, Nick Maclaren.
|
|
 | | From: | Terje Mathisen | | Subject: | Re: RISC vs. CISC design principles | | Date: | Thu, 13 Jan 2005 21:51:39 +0100 |
|
|
 | Nick Maclaren wrote:
> In article , > Terje Mathisen writes: > That's what the BSD code did when I looked at it.
Not too surprising, it is the obvious solution. :-)
> |> This will work as long as the input was valid, i.e. a terminating zero > |> was actually found. > > But did those correctly diagnose the error if the input was NOT > valid? That is what I meant.
To get correct behaviour, you'll have to reload the last (aligned!) word, using regular (trapping) operations.
This is actually similar to the way you can rewrite Java programs to only require tests at buffer ends, split the code path, and then add a known-to-trap load in the case where that should have happened. :-)
Terje
-- - "almost all programming can be viewed as an exercise in caching"
|
|
 | | From: | Stephen Fuld | | Subject: | Re: RISC vs. CISC design principles | | Date: | Wed, 12 Jan 2005 22:01:33 GMT |
|
|
 | "Nick Maclaren" wrote in message news:cs41hi$62t$1@gemini.csx.cam.ac.uk...
snip
> Yes. But let's ignore that and go back to the alignment issue. > Speaking as a software engineer from way back: > > "Allowing unaligned memory access is probably the least useful > of common CISC features." > > In my career, I have never seen a significant use of it except to > cover up misdesigned interfaces - in particular, ones that have > failed to take the decision whether they are based on semi-abstract > types like integers and floating-point or on precisely specified > bit patterns.
While I don't doubt that is true, it is perhaps so due to your specializing in HPC and not say business data processing. Think COBOL, reports where the aesthetics of the output are more important than allignment considerations, dealing with arbitrary input files, etc.
-- - Stephen Fuld e-mail address disguised to prevent spam
|
|
 | | From: | Nick Maclaren | | Subject: | Re: RISC vs. CISC design principles | | Date: | 12 Jan 2005 23:25:45 GMT |
|
|
 | In article <17hFd.8944$7N1.38@bgtnsc04-news.ops.worldnet.att.net>, Stephen Fuld wrote: > >> Yes. But let's ignore that and go back to the alignment issue. >> Speaking as a software engineer from way back: >> >> "Allowing unaligned memory access is probably the least useful >> of common CISC features." >> >> In my career, I have never seen a significant use of it except to >> cover up misdesigned interfaces - in particular, ones that have >> failed to take the decision whether they are based on semi-abstract >> types like integers and floating-point or on precisely specified >> bit patterns. > >While I don't doubt that is true, it is perhaps so due to your specializing >in HPC and not say business data processing. Think COBOL, reports where the >aesthetics of the output are more important than allignment considerations, >dealing with arbitrary input files, etc.
Where did you get the idea from that I specialised in HPC for most of my career? I can assure you that is not so.
Firstly, formatted I/O is irrelevant, as that is always treated as characters on modern machines.
Secondly, the paragraph that you snipped explains why all portable programs (and most correct ones) use packing and unpacking primitives when dealing with arbitrary (binary) input files.
Please note that I have written serious code to convert MVS SL tapes using most BSAM/QSAM formats to Unix tar files and MS-DOS and MacOS ZIP files. And vice versa. Plus a good many related tasks, including reading M-bit data on N-bit systems. That is about as 'commercial' an application as you get :-)
And I have always been very much into producing aesthetic output, not least because properly aligned tables are a damn sight easier to check than unaligned ones, and I spent more years in the statistical area than the HPC one.
No, sorry. I was posting more with a 'commercial' hat on than an HPC one.
Regards, Nick Maclaren.
|
|
 | | From: | Stephen Fuld | | Subject: | Re: RISC vs. CISC design principles | | Date: | Thu, 13 Jan 2005 17:03:46 GMT |
|
|
 | "Nick Maclaren" wrote in message news:cs4blp$pvo$1@gemini.csx.cam.ac.uk... > In article <17hFd.8944$7N1.38@bgtnsc04-news.ops.worldnet.att.net>, > Stephen Fuld wrote: >> >>> Yes. But let's ignore that and go back to the alignment issue. >>> Speaking as a software engineer from way back: >>> >>> "Allowing unaligned memory access is probably the least useful >>> of common CISC features." >>> >>> In my career, I have never seen a significant use of it except to >>> cover up misdesigned interfaces - in particular, ones that have >>> failed to take the decision whether they are based on semi-abstract >>> types like integers and floating-point or on precisely specified >>> bit patterns. >> >>While I don't doubt that is true, it is perhaps so due to your >>specializing >>in HPC and not say business data processing. Think COBOL, reports where >>the >>aesthetics of the output are more important than allignment >>considerations, >>dealing with arbitrary input files, etc. > > Where did you get the idea from that I specialised in HPC for most > of my career? I can assure you that is not so.
I'm sorry for the mistake. Obviously, I only know you from your posts here and I inferred (apparently incorrectly) that most of your experience was with HPC.
> > Firstly, formatted I/O is irrelevant, as that is always treated as > characters on modern machines.
Yes, but business data processing seems to do more of it than say HPC.
> Secondly, the paragraph that you snipped explains why all portable > programs (and most correct ones) use packing and unpacking primitives > when dealing with arbitrary (binary) input files.
But don't these primitives benefit from being able to handle unalligned data efficiently?
-- - Stephen Fuld e-mail address disguised to prevent spam
|
|
 | | From: | Nick Maclaren | | Subject: | Re: RISC vs. CISC design principles | | Date: | 13 Jan 2005 17:49:49 GMT |
|
|
 | In article , Stephen Fuld wrote: > >> Secondly, the paragraph that you snipped explains why all portable >> programs (and most correct ones) use packing and unpacking primitives >> when dealing with arbitrary (binary) input files. > >But don't these primitives benefit from being able to handle unalligned data >efficiently?
Yes and no. Because of the endian and other problems I mentioned, there is little point in accessing the data DIRECTLY - macros or functions are always a better solution. And the difference in efficiency between using (say) unaligned integer loads and loading a character at a time is usually small.
Regards, Nick Maclaren.
|
|
 | | From: | Maynard Handley | | Subject: | Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | Thu, 13 Jan 2005 01:46:34 GMT |
|
|
 | In article , nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:
> In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>, > wrote: > >"At current hardware budgets, the aligned memory access > >requirement is probably the least useful of the RISC mechanisms." > > > >I, respectfully, disagree. > > > >At current hardware budgets, the least useful RISC mechanism is the > >fixed length instruction format. Both Intel and AMD have shown that > >they/we can decode just as many instructions per unit time as the RISC > >guys. > > Yes. But let's ignore that and go back to the alignment issue. > Speaking as a software engineer from way back: > > "Allowing unaligned memory access is probably the least useful > of common CISC features." > > In my career, I have never seen a significant use of it except to > cover up misdesigned interfaces - in particular, ones that have > failed to take the decision whether they are based on semi-abstract > types like integers and floating-point or on precisely specified > bit patterns. > > The point is that the former have no trouble with padding being > inserted to create alignment, and the latter are uniformly better > done by the use of packing and unpacking primitives because there > are almost certainly other things to fix up than alignment (e.g. > endianness). >
You obviously have never programmed AltiVec, have you, Nick?
While I understand why AltiVec does not allow for unaligned accesses, and accept that it may well have been and continue to be the correct tradeoff, the fact is that it is a pain to deal with. And, Nick, please don't give me any BS about how properly designed code would not require this. If you've no experience with either AltiVec programming or modern day audio and video compression algorithms, you're not in a position to make this claim.
On the other hand, regarding unaligned instructions; is density of instructions (either inability to load them fast enough, or capacity of I1$ or I TLB) both really a big deal AND only about a factor of 1.5 off from ideal, meaning that unaligned instructions are worthwhile? The window for codes that meet both these requirements strikes me as pretty small, and I'd have to see some real evidence that the costs of I1$ misses (high but infrequent) are larger than the costs of an extra few cycles on branch misses (fewer cycles but frequent), not to mention the extra power and associated issues.
Maynard
|
|
 | | From: | Andrew Reilly | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | Thu, 13 Jan 2005 14:00:05 +1100 |
|
|
 | On Thu, 13 Jan 2005 02:46:34 +0000, Maynard Handley wrote:
> In article , > nmm1@cus.cam.ac.uk (Nick Maclaren) wrote: > >> In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>, >> wrote: >> >"At current hardware budgets, the aligned memory access >> >requirement is probably the least useful of the RISC mechanisms." >> > >> >I, respectfully, disagree. >> > >> >At current hardware budgets, the least useful RISC mechanism is the >> >fixed length instruction format. Both Intel and AMD have shown that >> >they/we can decode just as many instructions per unit time as the RISC >> >guys. >> >> Yes. But let's ignore that and go back to the alignment issue. >> Speaking as a software engineer from way back: >> >> "Allowing unaligned memory access is probably the least useful >> of common CISC features." >> >> In my career, I have never seen a significant use of it except to >> cover up misdesigned interfaces - in particular, ones that have >> failed to take the decision whether they are based on semi-abstract >> types like integers and floating-point or on precisely specified >> bit patterns. >> >> The point is that the former have no trouble with padding being >> inserted to create alignment, and the latter are uniformly better >> done by the use of packing and unpacking primitives because there >> are almost certainly other things to fix up than alignment (e.g. >> endianness). >> > > You obviously have never programmed AltiVec, have you, Nick?
What's that got to do with anything? (I haven't programmed AltiVec, per-se, myself. If a compiler has done it on my behalf, good on it. If a compiler hasn't been able to do it on my behalf, then perhaps that says something about the architecture of AltiVec.)
> While I understand why AltiVec does not allow for unaligned accesses, > and accept that it may well have been and continue to be the correct > tradeoff, the fact is that it is a pain to deal with. And, Nick, please > don't give me any BS about how properly designed code would not require > this. If you've no experience with either AltiVec programming or modern > day audio and video compression algorithms, you're not in a position to > make this claim.
I would say that modern-day audio and video compression standards are a good example of file (and communication) formats done *well*, by Nick's standards, as they are universally (in my experience) defined in terms of packed bit-strings, rather than fwrite(c-struct) /* and-hope-it-ports-ok, later */, which was what Nick was complaining about (I believe).
At an audio *algorithm* level, rather than file format level, I've never encountered anything that would enforce or encourage unaligned floating point accesses, which is just as well, since most of the DSPs I code for are still word-addressed.
-- Andrew
|
|
 | | From: | Maynard Handley | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | Fri, 14 Jan 2005 02:45:16 GMT |
|
|
 | In article , Andrew Reilly wrote:
> On Thu, 13 Jan 2005 02:46:34 +0000, Maynard Handley wrote: > > > In article , > > nmm1@cus.cam.ac.uk (Nick Maclaren) wrote: > > > >> In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>, > >> wrote: > >> >"At current hardware budgets, the aligned memory access > >> >requirement is probably the least useful of the RISC mechanisms."
> >> In my career, I have never seen a significant use of it except to > >> cover up misdesigned interfaces - in particular, ones that have > >> failed to take the decision whether they are based on semi-abstract > >> types like integers and floating-point or on precisely specified > >> bit patterns. > >> > > > > You obviously have never programmed AltiVec, have you, Nick? > > What's that got to do with anything? (I haven't programmed AltiVec, > per-se, myself. If a compiler has done it on my behalf, good on it. If a > compiler hasn't been able to do it on my behalf, then perhaps that says > something about the architecture of AltiVec.)
So let's review. Nick says "unaligned memory access is not very useful". I say that it sodding well is useful, it's a shame (though very understandable) that AltiVec does not provide it, and that ten years of experience working with modern codecs has shown me many many situations where material is NOT "naturally" aligned.
Your response to that, parroted by Nick, is "I've never actually used AltiVec (but you're wrong anyway), and by the way modern codecs do a fine job of describing how the bit stream is packed". Excuse me for barfing at the sheer pointlessness of this reply, since the packedness of the material in the bitstream has ZERO to do with the issue of how well it is adapted to naturally aligned packing. Heck, even the most clueless undergrad should know that the first stage in decoding data (or the last stage in encoding data) consist of bit-parsing and twiddling to handle the entropy coding, usually followed by a table lookup. It's only at that point that you handle modelling (transforms, motion comp and so on) which is where something like AltiVec is useful.
My whole point was that the specific nature of these codecs (for example the way that H264 breaks the image up into variable sized blocks which can be as small as 4x4) means that however you slice and dice the problem (and you have complete control over the memory structures --- these are all internal) you're going to spend a lot of your time wanting to load vectors that are not aligned to a multiple of 16.
If Nick wants to say that unaligned memory access is not useful for his little corner of the world, a corner that does not deal with multi-media, that's fine. But Nick, as is his way, is very fond of making grandiose claims for the entire freaking computer universe.
(More about audio algorithms below.)
> > While I understand why AltiVec does not allow for unaligned accesses, > > and accept that it may well have been and continue to be the correct > > tradeoff, the fact is that it is a pain to deal with. And, Nick, please > > don't give me any BS about how properly designed code would not require > > this. If you've no experience with either AltiVec programming or modern > > day audio and video compression algorithms, you're not in a position to > > make this claim. > > I would say that modern-day audio and video compression standards are a > good example of file (and communication) formats done *well*, by Nick's > standards, as they are universally (in my experience) defined in terms of > packed bit-strings, rather than fwrite(c-struct) /* and-hope-it-ports-ok, > later */, which was what Nick was complaining about (I believe). > > At an audio *algorithm* level, rather than file format level, I've never > encountered anything that would enforce or encourage unaligned floating > point accesses, which is just as well, since most of the DSPs I code for > are still word-addressed.
So, for example, if one is dealing with, say, MPEG audio, one is faced with the problem of computing the convolution at pretty much the last stage of the algorithm, using an index that increments by one each iteration --- meaning that 3 times out of 4 the data one wants to load is not naturally aligned with AltiVec 16-byte wide (ie 4 fp wide) registers.
Maynard
|
|
 | | From: | Andrew Reilly | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | Fri, 14 Jan 2005 17:00:23 +1100 |
|
|
 | On Fri, 14 Jan 2005 03:45:16 +0000, Maynard Handley wrote:
> So let's review. > Nick says "unaligned memory access is not very useful".
And the context was "RISC vs CISC", and (to me) the unalignedness was in terms of individual words of whatever sort. Natural alignment of data types. That's not a really big restriction, and I'll stick by saying that it's no biggie.
> I say that it sodding well is useful, it's a shame (though very > understandable) that AltiVec does not provide it, and that ten years of > experience working with modern codecs has shown me many many situations > where material is NOT "naturally" aligned.
And it seems, now, that you've lept in and said that because AltiVec requires alignment not just on floating point boundaries but on entire fixed-length vectors of them, that "unaligned access" (without further restriction) is necessary. Of course. However understandable (your words), that does seem to be a pretty crippling defficiency of AltiVec, particularly for nearly all of the audio signal processing algorithms that I can think of. How "RISC" is AltiVec if compilers can't use it to help speed up existing algorithms and existing code? Is it RISC just because it has a monumental alignment restriction?
All I can say to that argument is that it's a pretty daft extension to the notion of "natural alignment", particularly if the object of the exercise is to be able to compute existing numeric algorithms efficiently, rather that just being able to claim the best peak flops numbers.
> Your response to that, parroted by Nick, is > "I've never actually used AltiVec (but you're wrong anyway), and by the > way modern codecs do a fine job of describing how the bit stream is > packed". > Excuse me for barfing at the sheer pointlessness of this reply, since > the packedness of the material in the bitstream has ZERO to do with the > issue of how well it is adapted to naturally aligned packing.
Try reading the thread again, after your barf. The issue being responded-to was "unaligned" values occurring in popular (but perhaps poorly or unfortunately specced) file and wire formats. The sentence above makes no sense at all in the context of the discussion.
> Heck, even > the most clueless undergrad should know that the first stage in decoding > data (or the last stage in encoding data) consist of bit-parsing and > twiddling to handle the entropy coding, usually followed by a table > lookup.
Yup. Nicely defined, and access not susceptable to endianness or alignment issues. Not like many disk file and network protocols, which are pretty much defined as fwrite(desc, *(some_C_struct), 1, sizeof(*some_C_struct)), on some specific computer system, to the eventual annoyance of anyone using a system with different alignment/ endianness/ compiler struct padding / compiler switches/ etc.
> It's only at that point that you handle modelling (transforms, motion > comp and so on) which is where something like AltiVec is useful.
You brought AltiVec up. Hadn't been mentioned before in the thread. We *had* been discussng file and wire formats and alignment issues, though.
> My whole point was that the specific nature of these codecs (for example > the way that H264 breaks the image up into variable sized blocks which > can be as small as 4x4) means that however you slice and dice the > problem (and you have complete control over the memory structures --- > these are all internal) you're going to spend a lot of your time wanting > to load vectors that are not aligned to a multiple of 16.
Yup. That's how maths works. You don't, however, ever need to read any of those individual floating point numbers from non-aligned addresses.
> If Nick wants to say that unaligned memory access is not useful for his > little corner of the world, a corner that does not deal with > multi-media, that's fine. But Nick, as is his way, is very fond of > making grandiose claims for the entire freaking computer universe. > > (More about audio algorithms below.) > >> > While I understand why AltiVec does not allow for unaligned accesses, >> > and accept that it may well have been and continue to be the correct >> > tradeoff, the fact is that it is a pain to deal with. And, Nick, >> > please don't give me any BS about how properly designed code would >> > not require this. If you've no experience with either AltiVec >> > programming or modern day audio and video compression algorithms, >> > you're not in a position to make this claim. >> >> I would say that modern-day audio and video compression standards are a >> good example of file (and communication) formats done *well*, by Nick's >> standards, as they are universally (in my experience) defined in terms >> of packed bit-strings, rather than fwrite(c-struct) /* >> and-hope-it-ports-ok, later */, which was what Nick was complaining >> about (I believe). >> >> At an audio *algorithm* level, rather than file format level, I've >> never encountered anything that would enforce or encourage unaligned >> floating point accesses, which is just as well, since most of the DSPs >> I code for are still word-addressed. > > So, for example, if one is dealing with, say, MPEG audio, one is faced > with the problem of computing the convolution at pretty much the last > stage of the algorithm, using an index that increments by one each > iteration --- meaning that 3 times out of 4 the data one wants to load > is not naturally aligned with AltiVec 16-byte wide (ie 4 fp wide) > registers.
Well, that sucks. Doesn't AltiVec have permutation operations to at least help with that sort of thing?
Is there no scope for doing the loop-order inversion trick, so that the words in your altivec vectors are successive bins, and the shifting-order index is over blocks of bins? That tends to need more memory bandwidth than the in-register accumulator approach, but maybe machines with AltiVec have such bandwidth (in cache, anyway)?
I'd just note that AltiVec and its restrictions don't by any means define the universe of multimedia and audio implementation strategies. Lots of that still takes place on DSPs and other embedded processors that work just fine one word at a time.
-- Andrew
|
|
 | | From: | Christian Bau | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | Fri, 14 Jan 2005 08:51:17 +0000 |
|
|
 | In article , Andrew Reilly wrote:
> Well, that sucks. Doesn't AltiVec have permutation operations to at least > help with that sort of thing?
Obviously "aligned" vs. "unaligned" is always in terms of what you are trying to process. If you try to process a single floating point number, then four byte alignment = aligned, anything else = unaligned. If you try to process vectors of four floating point numbers, then sixteen bytes = aligned, anything else = unaligned. Especially floating-point aligned != vector aligned.
And obviously Altivec has permutation operations, there are all kinds of tricks that you can use to make things faster (lets just say it smokes anything that is on any Intel processor). That doesn't change the fact that without alignment restrictions, some of these tricks wouldn't be needed.
> I'd just note that AltiVec and its restrictions don't by any means define > the universe of multimedia and audio implementation strategies. Lots of > that still takes place on DSPs and other embedded processors that work > just fine one word at a time.
The discussion was not how much one particular processor is used; the discussion was about the importance of unaligned accesses. Vector processors need unaligned access much more than non-vector processors.
|
|
 | | From: | Nick Maclaren | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | 14 Jan 2005 08:56:57 GMT |
|
|
 | In article , Christian Bau wrote: >In article , > Andrew Reilly wrote: > >> Well, that sucks. Doesn't AltiVec have permutation operations to at least >> help with that sort of thing? > >Obviously "aligned" vs. "unaligned" is always in terms of what you are >trying to process. If you try to process a single floating point number, >then four byte alignment = aligned, anything else = unaligned. If you >try to process vectors of four floating point numbers, then sixteen >bytes = aligned, anything else = unaligned. Especially floating-point >aligned != vector aligned.
That's bonkers. What alignment does it require for vectors of length three, or doesn't it allow them?
Regards, Nick Maclaren.
|
|
 | | From: | Nick Maclaren | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | 13 Jan 2005 10:54:05 GMT |
|
|
 | In article , Andrew Reilly writes: |> |> I would say that modern-day audio and video compression standards are a |> good example of file (and communication) formats done *well*, by Nick's |> standards, as they are universally (in my experience) defined in terms of |> packed bit-strings, rather than fwrite(c-struct) /* and-hope-it-ports-ok, |> later */, which was what Nick was complaining about (I believe).
Precisely.
Regards, Nick Maclaren.
|
|
 | | From: | John Savard | | Subject: | Re: Unaligned accesses (was Re: RISC vs. CISC design principles) | | Date: | Sat, 15 Jan 2005 17:45:08 GMT |
|
|
 | On Thu, 13 Jan 2005 01:46:34 GMT, Maynard Handley wrote, in part:
>You obviously have never programmed AltiVec, have you, Nick? > >While I understand why AltiVec does not allow for unaligned accesses, >and accept that it may well have been and continue to be the correct >tradeoff, the fact is that it is a pain to deal with.
AltiVec is a feature similar to MMX.
It works with small vectors which contain several items of a given data type.
It certainly is true that forcing these vectors to be aligned on a 256-bit boundary will impact many perfectly legitimate programming operations.
But that doesn't change the fact that it is very seldom necessary to allow a 64-bit floating-point number to start on an odd 32-bit boundary, and so on. If one has a compressed record format that includes 32-bit integer fields starting at odd bytes, one just uses byte instructions to construct the records.
Putting only a few extra gates on a chip to allow unaligned accesses, and then warning programmers that these accesses will have a performance penalty, so they should not be used unless really needed, is usually the best tradeoff, though. It eliminates a potential source of confusion and error at the lowest cost.
Pipelined arithmetic units allow for vector operations which allow overlapped, rather than simultaneous, operation on successive vector elements. The Cray and its predecessors are examples of this. While there's nothing wrong with having a parallel vector unit as well, it can be pipelined too, and vectorized as well: that is, given vector instructions that act on vectors whose length is a multiple of the length of the vectors on which it operates as elementary units.
Thus, when the fast wide arithmetic unit won't do, just use a vector instruction on the slow narrow arithmetic unit. Since they're two different arithmetic units, they could even be running at the same time, so that rather than having fewer FLOPS by using the slower arithmetic unit occasionally, one ends up with more FLOPS!
John Savard http://home.ecn.ab.ca/~jsavard/index.html
|
|
 | | From: | Bernd Paysan | | Subject: | Re: RISC vs. CISC design principles | | Date: | Thu, 13 Jan 2005 12:30:19 +0100 |
|
|
 | Paul A. Clayton |
|
|