Thursday, November 22, 2007

Sunday, October 07, 2007

Thursday, September 27, 2007

Herb Sutter's Talk on Machine Architecture

Herb Sutter says this is the first talk video for programmers that he recommended as "Must View" in 25 years. The video is titled

Machine Architecture: Things Your Programming Language Never Told You
Unfortunately, the video is not the best quality and you would need the PDF slides of the presentation - available at Sutter's blog.

Know Your Enemy: Latency

or rather,

KNOW YOUR ENEMY
LATENCY


To summarize, the talk describes how important latency is in computation today - that an overwhelming fraction of effort, money and space of computing resources is dedicated to reducing latency. For example, 99% of transistors in the modern CPU are used to "hide" or reduce latency, and only 1% for computation. Much of the programming effort today also has to be dedicated to reduce latency. Indeed, latency is the root of nearly all complexity in hardware.


Latency Lags Bandwidth

Reduction in latency has not kept pace with increase in bandwidth. For example, in 20 years "CPU bandwidth" or computation speed has grown 2,250X but latency has decreased only 21X. This means the contribution of latency is always growing. Memory latency is now 210 cycles, instead of 1.4 cycles in 1980 - worse off 150X. Therefore, L1, L2 caches and registers have become much more important.

Our Programming Model Must Graduate

Our programming model must graduate from a obsolete CPU-Memory-Disk to include:
- CPU: CPUs, Registers, L1 Cache - Instruction (2 cycles) and Data (3 cycles), L2 Cache (14 cycles), Store Buffers
- Memory: DRAMs (200 cycles), Flash (3M cycles) and other nodes
- Disk: Flash, Hard Disk including compressed files (15M cycles), Network
Some numbers would be indicative of the scale of importance of changing the computation model - see figure here.

Little's Law says Concurrency = Bandwidth x Latency and is useful in applications in many areas of science. Therefore, we have to add concurrency everywhere to cope up with latency. This may even affect the correctness of my code, not just performance. Compilers have to attempt to go all the way in optimizing the code, that it breaks some implicit (incorrect) assumptions programmers often make.

In summary, to save the correctness of your code, Sutter says the following:
* Use locks to protect the shared variables.
* Avoid lock-free code since it is too difficult.
* Avoid fences or memory barriers.
There is a scenario when taking conditional locks may not be advisable since the compiler would rearrange the code so that some code might be executed without the condition. So pessimistically take locks even if updates are conditional and you believe you do not need a lock in a particular condition.

To exploit locality, the memory allocation, and memory access pattern become important.

Saturday, September 15, 2007

First Encounter with Intel's Threading Building Blocks

Herb Sutter mailed about the upcoming webinar series by Intel. The first in the series is a talk by Herb Sutter on September 25 on The Concurrency Revolution. Here is part of the abstract posted by Sutter:
We are now seeing the initial stages of the next major change in software development, as over the next few years the software industry brings concurrency pervasively into mainstream software development, just as it has done in the past for objects, garbage collection, generics, and other technologies.

The next speaker in the webinar series is James Reinders, on the topic "Steps to Parallelism NOW". He is also the author of "Intel® Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism". Chapter 1 of the book is available for download here (excerpt below in this article).

Intel® Threading Building Blocks

The Intel site says "Intel Threading Building Blocks" library has ready to use parallel algorithms, cross-platform support, task based parallelism, highly concurrent containers and generic algorithms. More on this here.

The library is open source and released under GPLv2 with the runtime exception - which means, it is as good as using gcc libstdc++ in your application, and I think (and please correct me if I'm wrong) commercial applications can use the library without restrictions and without releasing code as open source.

Some related notes on Threading Building Blocks:

Why Too Many Threads Hurts Performance, and What to do About It

This article discusses the impact of having too many threads.
Partitioning a fixed amount of work among too many threads gives each thread too little work that the overhead of starting and terminating threads swamps the useful work.
Since the most efficient number of compute threads depends upon the hardware, the division of program into threads is not a good way to do multithreaded programming. It is often better to do formulate the program into logical tasks, which are much more lightweight than logical threads. Task based programming also allows the distribution of tasks to processors unfairly to gain efficiency while giving up fairness. Task based programming is a lot easier since programmer is not forced to think about the low level parallelism. Scalability is easier to achieve using task based approach.

The article gives an example of summing a tree. Under breadth-first execution traversal, parallelism is maximized but so is memory consumption because the number of tasks doubles at each level of execution - there are many more tasks than hardware parallelism supports. Using depth-first traversal reduces both parallelism and memory consumption. The queue for keeping tasks is a shared resource and may become a bottleneck due to communication overheads.

A modern method to solve these problems is to keep a separate queue for each thread, and "work depth-first, steal breadth-first". This means memory consumption is reduced, while there are just enough tasks to keep the hardware busy. This technique was used by MIT's Cilk language project.

Excerpts from Chapter 1 of "Intel Threading Building Blocks" by James Reinders
When considering it for C programs, OpenMP has been referred to as “excellent for Fortran-style code written in C.” That is not an unreasonable description of OpenMP since it focuses on loop structures and C code. OpenMP offers nothing specific for C++. Looking to address weaknesses in OpenMP, Threading Building Blocks is designed for C++, and thus to provide the simplest possible solutions for the types of programs written in C++.

Sunday, September 02, 2007

Hi!


[Please Note: This blog is no longer updated. Please visit www.cattywumpus.in for updated articles.]

Amit Kumar

PhD Student
Interactive Intelligence Lab

Electrical and Computer Engineering
National University of Singapore


email: amit_kumar at nus dot edu.sg

research | courses | resources | personal page | home



Welcome to my page. I intend to post most of my research-related work here. All the entries on this site are always under construction.