GPU圖形處理器 - 20X~400X倍運算能力的威力: 7月 2010

2010年7月29日星期四

CUDA實現栩栩如生的胎兒3D超音波成像, 還支持立體眼鏡觀賞

Vivid 3D Ultrasounds with CUDA

本週西門子健康產品部門宣布, 已經完成下一代3D/4D超音波軟體, 可以實現栩栩如生的胎兒3D超音波成像. 當我們在上週二的CUDA Day活動中首次看到西門子公司Roee Lazebnik博士示範的時候, 就很驚訝於它的效果

現在這產品 (叫做 syngo.fourSight Workplace) 已經出貨了, 我們聯絡產品開發部門的協理, Roee博士, 看看有什麼進一步的消息. 以下是我們訪談的內容:

NVIDIA: Roee, 恭喜您的產品已經開始出貨. 想請教你, 您如何用CUDA將超音波成像提升到另一個技術層次的?

Roee: 這是世界上第一個在市場上發行的, 將超音波影像變成3D影像, 還支持立體眼鏡觀看的產品. 這是一大創舉, 它提供醫生與病患解剖學上的資訊, 以進行產前與產後的手術規劃.

NVIDIA: 這技術同時使用了以CUDA為基礎的NVIDIA Quadro FX圖形處理器, 與NVIDIA 3D Vision 3D幻鏡技術. 可以說說看為何3D立體技術對這應用來說是重要的?

Roee: 當大量的3D資訊被投放在"平面的”顯示器上時, 我們很難分辨解剖學上的細節與影像的深度, 例如骨骼的彎曲程度等等. 用立體眼鏡觀察超音波成像是較直覺的方式.

NVIDIA: 您的技術提到”3D/4D”, 請問這是什麼意思?

Roee: 3D指的是具有靜態體積的圖像(3維影像的靜態平面投放圖像). 我4D指的是可動態觀賞的體積圖像, 例如用來觀察胎兒的肢體運動或臉部表情.

NVIDIA: 這樣的技術還能應用在哪些地方?

Roee: 不只用在胎兒成像, 從心臟器官搏動到腹腔內所有器官都可以用動態的體積圖像來觀察, 這表示我們可以從各種視角重建身體組織. 這個技術在醫療診斷與醫病雙方的溝通上, 有很令人興奮的潛力.

Ø 看VizWorld關於本標題的文章. http://www.vizworld.com/2010/07/siemens-syngofoursight-obstetrics/

Ø 看西門子公司的現場說明 http://vimeo.com/9214016

Ø 想看CUDA Day當天更多其他精彩的GPU高速運算應用與示範影片嗎? 來這裡看看吧. http://www.youtube.com/watch?v=ZOGLkl9cFPw

This week Siemens Healthcare announced availability of its next-generation 3D/4D ultrasound software which produces realistic, vivid images of a fetus. When we saw Dr. Roee Lazebnik of Siemens demo this at CUDA Day last February, we frankly were amazed by it.

Now that the product (called syngo.fourSight Workplace) is shipping, we contacted Roee, director of product development, for an update. Here's an excerpt from our interview:

NVIDIA: Roee, congratulations on the release. How does this technology take ultrasound to the next level?

Roee: This is the world's first commercially-available product for stereoscopic visualization of ultrasound data. It's a breakthrough because it provides anatomical information that can enable better communication between physicians and patients and aid in pre- and post-natal surgical planning.

NVIDIA: The technology leverages CUDA-based NVIDIA Quadro FX solutions as well as NVIDIA 3D Vision. Why is the 3D functionality important?

Roee: When we interact with volumetric data on a 2D screen, our ability to appreciate anatomical subtleties and depth-based details, such as the curvature of a skeletal feature, is limited. It is much more intuitive to visualize 3D data in stereo.

NVIDIA: The technology is described as "3D/4D." What does that mean?

Roee: 3D involves visualizing a static volumetric image. The term 4D refers to visualizing dynamic anatomy, such as the moving limbs or facial gestures of the fetus.

NVIDIA: In what other ways could this technology be applied in the future?

Roee: In the human body, everything from the beating heart to the abdomen can be visualized in a volume mode, meaning it can be reconstructed and viewed from any orientation. This technology has many exciting potential benefits both in diagnosis and visual communication of findings.

– See VizWorld article: http://www.vizworld.com/2010/07/siemens-syngofoursight-obstetrics/

– See Siemens Video: http://vimeo.com/9214016

– See CUDA Day demo on YouTube: http://www.youtube.com/watch?v=ZOGLkl9cFPw

2010年7月25日星期日

CUDA的榮耀

CUDA的榮耀

Kudos for CUDA

by Dr. Vincent Natoli, President and Founder, Stone Ridge Technology

July 06, 2010

從NVIDIA 2007年九月推出的CUDA 應用程式介面(API)開始算起, GPU圖形運算處理器進入到我們的高速運算領域已經有三年的歷史了.

這套GPU高速平行運算技術的開發與被廣為接受, 實在快速得出奇. 許多研究團體已經從剛開始只是嘗試性的小規模評估實驗, 轉變為積極的開發與享受NVIDIA UDA GPU高速運算所帶來的好處.

NVIDIA GPU高速運算電腦更在今年六月全球500大超級電腦排行榜攻下第2名的佳績.

在這個不容易接受新技術改變的高速運算領域裡, 如此出乎意料之外地快速接受並且轉為積極開發CUDA技術, 是一個非常, 非常, 值得我們重視的指標.

對比於寫得好的CPU編程技術(*亂寫且無運算效率的不算), 我們體會到NVIDIA CUDA GPU編程並沒有比較困難(*還更快) – 這與從來沒有接觸過NVIDIA CUDA GPU高速運算的工程師的”常識”完全不同.

我們都很清楚, NVIDIA CUDA GPU編程能以簡御繁 (從CPU複雜的3500行C語言程式碼, 改為GPU精簡的, 僅有800行的NVIDIA CUDA程式碼 -- 精簡77%程式碼), 很簡潔的處理”大規模平行處理問題”.

並且它的程式碼容易維護, 有更好的擴展性, 與更好的向下整合能力(向尚未推出的, 未來更新更多核NVIDIA GPU相容的能力).

It's been almost three years since GPU computing broke into the mainstream of HPC with the introduction of NVIDIA's CUDA API in September 2007. Adoption of the technology since then has proceeded at a surprisingly strong and steady pace. Many organizations that began with small pilot projects a year or two ago have moved on to enterprise deployment, and GPU accelerated machines are now represented on the TOP500 list starting at position two. The relatively-rapid adoption of CUDA by a community not known for the rapid adoption of much of anything is a noteworthy signal. Contrary to the accepted wisdom that GPU computing is more difficult, I believe its success thus far signals that it is no more complicated than good CPU programming. Further, it more clearly and succinctly expresses the parallelism of a large class of problems leading to code that is easier to maintain, more scalable and better positioned to map to future many-core architectures.

The continued growth of CUDA contrasts sharply with the graveyard of abandoned languages introduced to the HPC market over the last 20 to 25 years. Its success can largely be attributed to i) support from a major corporate backer as opposed to a consortium, ii) the maturity of its compilers iii) adherence to a C syntax easily recognized by developers and iv) a more ephemeral feature that can best be described as elegance or simplicity. Physicists and Mathematicians, often use the word "elegant" as a high compliment to describe particularly appealing solutions or equations that neatly represent complex physical phenomena; where the language of mathematics succinctly and...well...elegantly describes and captures symmetry and physics. CUDA is an elegant solution to the problem of representing parallelism in algorithms, not all algorithms, but enough to matter. It seems to resonate in some way with the way we think and code, allowing an easier more natural expression of parallelism beyond the task-level.

HPC developers writing parallel code today have two enterprise options i) traditional multicore platforms built on CPUs from Intel/AMD and ii) platforms accelerated with GPGPU options from NVIDIA and AMD/ATI. Developing performant, scalable parallel code for multicore architectures is still non-trivial and involves a multi-level programming model that includes inter-node parallelism handled with MPI, intra-node parallelism with MPI, OpenMP or pthreads, and register level parallelism expressed via Streaming SIMD Instructions (SSE). The expression of parallelism in this multi-level model is often verbose and messy, obscuring the underlying algorithm. The developer is often left feeling as though he or she is shoehorning in the parallelism.

The CUDA programming model presents a different, in some ways refreshing, approach to expressing parallelism. The MPI, OpenMP and SSE trio evolved from a world centered on serial processing. CUDA, by contrast, arises from a decidedly parallel world, where thousands of simultaneous threads are managed as the norm. The programming model forces the developer to identify the irreducible level of parallelism in his or her problem. In a world that is rapidly moving to manycore, not multicore, this seems to be a better, more intuitive and extensible way to think about our problems.

CUDA is a programming language with constructs that are designed for the natural expression of data-level parallelism. It's not hard to understand expressibility in languages and the idea that some concepts are more easily stated in specific languages. Computer scientists do this all the time as they create optimal structures to represent their data. DNA base pairs, for example, are neatly and compactly expressed as a sequence of 2-bit data fields much better than a simple minded ASCII representation. Our Italian exchange student was fond of pointing out the vast superiority of Italian over English for passionate argument.

Similarly, we have found in many cases that the expression of algorithmic parallelism in CUDA in fields as diverse as oil and gas, bioinformatics and finance is more elegant, compact and readable than equivalently-optimized CPU code, preserving and more clearly presenting the underlying algorithm. In a recent project we reduced 3,500 lines of highly-optimized C code to a CUDA kernel of about 800 lines. The optimized C was peppered with inline assembly, SSE macros, unrolled loops and special cases, making it difficult to read, extract algorithmic meaning and extend in the future. By comparison the CUDA code was cleaner and more readable. Ultimately it will be easier to maintain.

Commodity parallel processing began as a way to divide large tasks over multiple loosely-connected processors. Programming models supported the idea of dividing problems into a number of smaller pieces of equivalent work. Over time those processors have grown closer to one another in terms of latency and bandwidth, first as single operating system multiprocessor nodes and next as multicore processor components of those nodes. Looking towards the future we see only more cores per chip and more chips per node.

Even though our computing cores are more tightly coupled, our view of them is still very much from a top-down, task parallel mindset, i.e., take a large problem, divide it into many small pieces, distribute them to processing elements and just deal with the communication. In this top-down approach, we must discover new parallelism at each level, domain level parallelism for MPI, "for-loop" level for OpenMP, and data level parallelism for SSE. What is intriguing about CUDA is that it takes a bottom-up point of view, identifying the atomic unit of parallelism and embedding that in a hierarchical structure, e.g., thread::warp::block::grid.

The enduring contribution of GPU computing to HPC may well be a programming model that peels us away from the current top-down, multi-level, task-parallel approach, popularizing instead a more scalable bottom-up, data-parallel alternative. It's not right for every problem but for those that map well to it, such as finite difference stencils and molecular dynamics among many others, it provides a cleaner, more natural language for expressing parallelism. It should be recognized that the simpler, cleaner expression for these applications in code is a main driver for the relatively-rapid adoption by commercial and academic practitioners. Further, there is no intrinsic reason scaling must stop at the grid or device level. One can easily imagine a generalization of CUDA on future architectures that abstracts one or more levels above the grid to accomplish an implementation across multiple devices, effectively aggregating global memory into one contiguous span; a sort of GPU/NUMA approach. If this can be done, then GPU computing will have made a great leap toward solving a key problem in parallel computing by reducing the programming model from three levels to one level for a simpler more elegant solution.

GPU圖形處理器 – 綠色環保超級運算電腦

GPUs Boost Supercomputers' Energy Efficiency

甫於六月底發布的2010綠色環保500大超級電腦(2010 Green 500), 這500大最頂尖前10大綠色環保超級電腦中, NVIDIA的GPU圖形處理器就佔了2名.

GPU正在改變計算機處理數據的方式。其中一大優勢是，集CPU與GPU於一身的計算機能源效率更高。用正確的工具來幹正確的活才能實現高效運行，而GPU便是實現高效計算的工具。

“Green500最省電超級計算機榜單已於六月底出爐，結合使用CPU與圖形處理器的超級計算機名列其中。Green500榜單每年評選兩次，據介紹，其中八台全球最節能超級計算機結合運用了類似GPU的專用加速器與CPU，以提升性能並讓超級計算機更加節能。該榜單與500強榜單的評選機構系同一家單位。據Wu Feng介紹，與榜單上未配備加速器的機型相比，配備加速器的超級計算機節能效果可達3倍。他的職務是弗吉尼亞理工學院電氣計算機工程係以及州立大學工程學院的副教授。”

由於計算機圖形的特點，GPU在同時執行多項任務這方面表現得特別出色，因此也就非常適合這一全新的計算環境。 GPU為我們帶來的是，利用數以百計的核心以大規模並行方式來解決難題的特殊能力。

Supercomputers that mix CPUs with graphics processors made their mark on the Green500 list of top energy-efficient supercomputers released on Wednesday.
Eight of the world's greenest supercomputers combined specialized accelerators like GPUs with CPUs to boost performance and make supercomputers more power efficient, according to the Green500 list, which is released twice a year. The list was released by the same group that compiles the Top500 list.
Supercomputers with accelerators are three times more energy efficient than their non-accelerated counterparts on the list, according to Wu Feng, associate professor of electrical and computer engineering at Virginia Polytechnic Institute and State University's college of engineering.
Two of the top eight green supercomputers are new entrants from China, and combine graphics processors from Nvidia with Intel's CPUs. In the previous list issued in November, only one supercomputer combined CPUs with GPUs from Advanced Micro Devices, but that machine has now dropped to the 11th spot.
The Green500 list is compiled to "ensure that supercomputers only simulate climate change and not create it," according to the Green500 Web site.
The list rates the greenest supercomputers by measuring performance in relation to power consumed. The calculation takes the megaflops-per-second performance (MFLOP/s) of a supercomputer and divides it by per watt of energy consumed. Supercomputers with accelerators averaged 554 MFLOP/s per watt, while other measured supercomputers without accelerators produced 181 MFLOP/s per watt.
The supercomputers combining GPUs with CPUs include the Dawning Nebulae supercomputer, which was in the fourth spot, and the Mole-8.5 supercomputer, which took the eighth spot. The supercomputers are in China and combine Nvidia's Tesla C2050 graphics processors with Intel's Xeon X5650 quad-core processor, which runs at 2.66GHz. The Nebulae supercomputer achieved efficiency of around 492.64 MFLOP/s per watt, while the Mole-8.5 achieved efficiency of 431.88 MFLOP/s per watt.
The top three green supercomputers were IBM supercomputers, all in Germany. The supercomputers include PowerXCell 8i processors from IBM, with custom field-programmable gate array accelerators to boost application performance. The supercomputers were also the top three in the previous list issued in November.
Overall, IBM chips were used in six of the top eight green supercomputers.
There is growing interest in building supercomputers that use graphics processors along with CPUs. GPUs are typically faster than traditional CPUs at executing certain tasks, such as those used in scientific and computing applications. Some institutions like the Tokyo Institute of Technology have announced plans to deploy more GPUs in an effort to squeeze more performance out of servers.

2010年7月18日星期日

15x-40x倍快 -- GPU加速的神奇影像編修軟體: Musemage(無中文名稱, 我先暱稱為"美之美")

今天, 一家名為Paraken的中國公司, 發布了測試版本的Musemage. Paraken是一家位於中國四川省成都的軟體公司, 由一群GPU圖形處理器軟體開發者共同集資成立 – 而Musemage軟體是paraken公司的第一個作品.

Today, a China company named Paraken, publicly released their 1^st Beta version of Musemage. Paraken is a software company that based in Chengdu, Sichuan Province, it was founded by a group of GPU developers. Musemage is their 1^st commercial product.

Musemage軟體是一套完全GPU加速的影像處理軟體, 大部分的功能(比方色彩調整, 濾鏡, 影像選取, 影像顯示, 都使用GPU加速. 簡而言之, Musemage完全就建構在GPU硬體架構之上.

Musemage is a fully GPU accelerated image processing software, most functions including color adjustments, filters, selections and image display all implement in GPU. Basically, Musemage is fundamentally based on GPU.

Musemage的高速運算是一大特色. 使用配備NVIDIA GTX480顯示卡的電腦, 測試兩千一百萬像素的圖檔, 在某些濾鏡(Radial Blur and Surface Blur, best quality and all parameters set to highest)下達到比Photoshop配備最新的i7 870 CPU還要有15x-40x倍快的速度. 更重要的, Musemage所有的濾鏡與調校工具的結果都能做到即時顯示, 若你擁有夠快的GPU圖形處理器, 即使是2560x1600全畫面模式下, 你都可以輕鬆的, 像播放影片般流暢的即時顯示. 這樣的高速運算給美工人員帶來的方便性是, 你可以在任何調整過程中, 用滾輪與滑鼠右鍵放大/選取/快速檢視工作區域的處理效果. 整體來說, 使用Musemage你可以擁有與任何美編軟體截然不同的, 流暢的工作經驗.

The most obvious feature of Musemage is high speed. I’ve tested a 21M pixels picture in a GTX480 PC, some filters I got 15-40x faster advantage than Photoshop with a fast Core i7 870 CPU (Radial Blur and Surface Blur, best quality and all parameters set to highest). Most important is in Musemage, all filters and tools adjustments can be effected in working area in real-time, with a fast GPU, even in 2560x1600 full screen, the working space image changes follow the adjustments so smoothly just like in movies. The fast speed of processing can produce other benefits, for example, you can use mouse scroll-wheel and right button to zoom and move working image at any moment even while adjusting. Over all, using Musemage you can get totally different experience than any other image software today while interoperating.

Musemage提供了很多影像處理需要的功能, 例如筆刷, 區域選取, 圖層, 文字, RAW檔編修支持, 你簡直就可以把它當成GPU圖形處理器高速運算版本的Photoshop(*註:以上部份功能在測試版本尚未開放). 也支持一些有趣的功能, 例如神奇的臉部美化功能 (在Musemage軟體裏稱之為”Soft Skin”), 效果太令人驚訝了:

Musemage is providing many image processing needed features including paintbrushes, selections, layers, text, and RAW data support. It just like a GPU version of Photoshop (of course the features not so complete for a 1^st version). It also integrated some interesting features like face beautify (called Soft Skin in software), like picture below:

這個神奇的臉部美化功能非常的快, 當你左右調節分界線時, 你可以看到GPU即時運算的威力. 即使是兩千一百萬像素的高像素圖檔, 也僅須0.2秒來完成(使用NVIDIA GTX480). 除此之外, Musemage還提供許多好用的濾鏡 – 例如特殊攝影鏡頭效果濾鏡, 特殊顏色濾鏡(老照片,….), 水波紋效果, …. 軟體開發商未來還有計畫加入GPU加速的圖片裁切, 全景構圖(panorama), 製作3D立體圖片, HDR高動態渲染….

The speed of this function also very fast, while you dragging the slide bar, you can see the changes in real time. For a 21M pixels image, it only need 0.2s to accomplish the processing on a GTX480. Beside, Musemage also provides many other filters like special lens, special color, water, glass, etc. the developer is planning to add more features in it in near future including GPU based image cutting out, panorama, 3D picture compositing, HDR lighting...

現在Musemage軟體的測試版本已經來到了Beta 1, 它有一些已知的小問題, 而且使用者介面也尚未完成 – 軟體開發者需要更多高手進來試用並提出建議. 您可以在這裡下載http://www.musemage.com/download.html 英文的試用版軟體. 你必須擁有NVIDIA GeForce8或更高的GPU顯示卡圖形處理器, 能夠有Fermi GeForce 400系列顯示卡當然是最好的. 根據軟體開發商的時間表, 正式版v1.0將在八月中發布, 零售價大約在$99~$149美金.

The current version of Musemage is Beta1, it contains a few known bugs, and the UI not completely finished, and it may also contains more unknown issues. The developer want more people can try this software and get feedbacks. The English beta version you can download from: http://www.musemage.com/download.html. to run this software, you need a GeForce 8 or newer GPU, and a Fermi GPU is recommended. According to developer’s schedule, the final version of V1.0 will be released on middle of Aug. they are planning to price it in range of US$99-149.

Musemage是完全依據NVIDIA GPU圖形處理器來開發與測試的軟體, 只建議使用NVIDIA GPU圖形處理器以獲得優異的加速性能.

The Musemage is completely developed/tested upon NVIDIA GPUs, they also only suggest users to use NVIDIA GPUs to run this software.

訂閱：文章 (Atom)

2010年7月29日 星期四