Preparing For Future Software: AMD's Lightweight Profiling Proposal
by Ryan Smith on August 16, 2007 12:00 AM EST- Posted in
- CPUs
First Thoughts
In spite of the various performance opportunities offered by LWP it's important to note that this is by no means a silver bullet. Profiling tools in general are a very fine tool used to get the last, smallest parts of performance out of an application. Improved tools aren't a replacement for code that's better designed, or compilers that are better at identifying parallelizable code or more in tune with how a specific processor design performs. LWP won't solve the problem of poor implementations here hurting performance far more than a profiler can ever help.
More fundamentally however is that this is just a proposal, which is unusual for an industry that has silicon and roadmaps to go with new technology announcements. AMD has not announced what chip of theirs will be the first to implement the technology, and with Barcelona shipping this month we can safely assume that this technology won't be in that chip. The earliest we would see this technology would be in late 2008 with the Shanghai core, if not in 2009 with Bulldozer. Given the average two-year development cycle for most programs, this would push programs that make significant use of the technology out to 2010 and beyond, a long way away in the fast-changing computer industry.
There are also some miscellaneous issues that bear mentioning. Operating system support of the new instructions is required, and meanwhile Windows Vista does away with the ability to use the traditional interrupt controls; in a slightly humorous tone given the open-community nature of this proposal, AMD states "So let's figure one [method] out, or create a new one, or share an existing interrupt." There's also the question of Intel following AMD along with this as they have done on AMD's past few instruction set extensions; such a feature would be far more useful if both major vendors supported it.
Beyond LWP, AMD has indicated that there will be further new specifications released as part of the Extensions for Software Parallelism initiative. At this point AMD isn't offering any concrete details on what those will be, but our best guess would be general-use instructions for enhancing multithreading performance to go along with new hardware, and possibly more tools for developers. The former in particular was precedent in the x86 instruction set as Intel released a pair of such instructions as part of SSE3 for improving the performance of their now-abandoned hyperthreading technology.
With all of that said, we're left in an interesting situation where we're looking at a very interesting AMD proposal, but not a whole lot of hard numbers to back things up. For developers this proposal and the new instructions could be of significant help, but we're left without a way to measure or even predict the value of "significant." As we stated earlier, similar tools have offered an exceptional benefit to developers on embedded/closed platforms using standardized hardware, but we have to keep in mind that this isn't a perfect predictor for a general purpose computer.
Until AMD hammers out the finer details of their proposal and releases some hardware to go with it, we are going to be left with bated breath waiting for AMD to move this proposal to the next stage. We are looking forward to seeing what developers can do with such a dramatic leap in the ability of their profiling tools in the coming years.
11 Comments
View All Comments
MadBoris - Thursday, August 16, 2007 - link
I can't help but wonder what advantages this would have over Intel's existing open source TBB. Threading Building Blocks 2.0 seems to be a pretty robust runtime library to be able to do the hard work covering some of the most difficult things to manage currently...Efficient parallelism.
Automatic load balancing.
Easier thread management.
Help with concurrency issues.
Plus Intels companion tools are awesome. Expensive but pretty nice.
TBB is also open source and multiplatform friendly.
I'm also curious to learn more about real world TBB experiences.
Anyway, it's good to hear more work is being done on this stuff and different approaches will always help.
We have a long way to go because as it is Quad cores and above will not really be leveraged like everyone assumes they will be. It will take more than market saturation of multicores for them to be used efficiently. Unless an application is manipulating data streams, which has an easily splittable workload, like encoding, rendering, compression, etc. then you really won't see the type of workload granularity from other types of applications in truly leveraging multicores. Core's beyond 2 will offer ever decreasing negligible results for quite a while in mainstream applications without some advances.
I hope to hear more about advances on this in future articles because as it is, it seems quad will be somewhat of a wall for us of any real beneficial performance, anything above that will really just serve as a good heater unless it's for a very specific application.
Ryan Smith - Thursday, August 16, 2007 - link
LWP isn't intended to be competition for TBB, rather it augments it (at least as much as TBB is beneficial on an AMD chip). TBB is compiler and library help, an essential part of extracting maximum performance, but it doesn't include anything as far as application profiling goes. LWP is the final link as far as that goes, once TBB has taken you as far as it can, you break out profilers and start looking at what your code is doing that could be causing any more performance bottlenecks.JumpingJack - Sunday, August 19, 2007 - link
I think one of the points he is making is that this is really not a methodology or instruction set change that is used for multithreading, rather a HW based profiler which is only useful during product development.Profiling is not new, so while AMD is proposing some unique intructions to get a realtime peak of the architectural state (profiling), it does not directly speed up multithreading by some new or novel algorithm. Basically, what AMD is proposing is pretty much already used at a software level to an extent.
MadBoris - Sunday, August 19, 2007 - link
Yeah, while profiling is necessary especially for multithreaded apps for optimizing and finding overhead, stalls, IPC issues, synchronized contention, etc., I'm far more interested in the core issues of leveraging cores for better parallel execution. Rather than a better topical ointment we need to address the core cause. Initially I had thought AMD was doing more than just profiling with the extra architecture, but that was due to my silly skimming.I also mentioned TBB because I think it also may be worthy of an article someday, it is an intriguing route to the core problems developers are facing in multithreading in the future. I'm curious how well it works, from engineers feedback, something Anandtech would have access to. Making a thread and throwing it to another core today is all well and good in removing thread contention for primary threads and giving other threads more CPU budget, but lower level multithreading and parallel advantages are currently limited for most types of apps due to inherent limitations. That's the real core issue in multithreading, improving scaling and fully leveraging several cores.
I'm interested to see what AMD's methods may bring in improved performance and maybe ways to gather a few more tidbits of data, much still depends on the software element and how it will present the data and how robust it will be during development time. Who knows, maybe they are putting the horse out first then bringing in the cart next, it would be the right order of things. Having a HW profiler in place first would make it easier to produce and test, a HW based multithreading optimizing approach, which would be stellar someday down the road.
MadBoris - Thursday, August 16, 2007 - link
Maybe LWP can lessen overhead w/ HW and even be more precise, as you say. Although their is a good bit you can retrieve in software, so it will be interesting to see what it is I am missing, maybe some deeper level CPU cache usage metrics maybe beneficial. Time will tell.yyrkoon - Friday, August 17, 2007 - link
Dude, do you even know what software profiling *is* ? I have no idea what TBB *is*, but I can tell you with 99.9% certainty is is not even remotely related(except perhaps that it is a set of instructions).Direct quote from wikipedia:
http://en.wikipedia.org/wiki/Performance_analysis">http://en.wikipedia.org/wiki/Performance_analysis
MadBoris - Friday, August 17, 2007 - link
Thank you for the definition. I just finished saying I have used Intels profilers in profiling applications, so I may know.As to TBB, it's a runtime tool for making threaded applications more efficient and easier to produce, not a profiler. I initially thought AMD's approach was going to be more than just profiling, but that's what I get for skimming articles.
MadBoris - Thursday, August 16, 2007 - link
And of course the Intel Threading tools which work in cooperation with TBB are very robust as well, which have been around for a while. I've used the Thread Checker and Threading Profiler and I like them pretty well, pretty impressive actually, although you can incur some serious overhead with the profilers if your not careful.http://www.intel.com/cd/software/products/asmo-na/...">http://www.intel.com/cd/software/products/asmo-na/...
The more the merrier, just was wondering what LWP was adding beyond what I was already familiar with on the Intel side.
DigitalFreak - Thursday, August 16, 2007 - link
Unless Microsoft puts the boot to Intel again, I doubt Intel will jump on board. They have too much of a "nowhere but here" mentality.DeepThought86 - Thursday, August 16, 2007 - link
I think this article could have been written with 30% less words