2010年11月17日 星期三

Amazon推出新的NVIDIA繪圖晶片運算服務

2010/11/17 12:15:01

Amazon推出新的NVIDIA繪圖晶片運算服務

為了反映科技界的未來趨勢,Amazon Web Services (AWS)新增了一項新的運算方式,它將使用電腦的繪圖晶片。
AWSElastic Compute Cloud (EC2)可讓用戶使用不同類型的線上運算資源,只要付費就能使用。EC2一開始只是傳統的企業伺服器設定,但是Amazon為它增加了不同的服務以滿足特定的運算需求。
新的Cluster GPU服務是一台伺服器,配備兩顆四核英特爾Nehalem系列Xeon X5570處理器、兩顆Nvidia Fermi系列Tesla M2050繪圖晶片、22GB的記憶體、1.7TB儲存空間、以及10 gigabit的乙太網路連線。
繪圖處理器一開始只是專門用來加速電腦的繪圖運作,主要是3D遊戲與設計軟體。但是近年來繪圖晶片的效能大幅增進,早已超出當初設計的目的,這也是他們會現身在超級電腦之中的原因,例如世界最快的超級電腦Tianhe-1A
深入來說,GPU可以用來處理媒體資料例如重新剪輯影像大小,或是壓縮音訊並且也可以處理一些平行執行的計算工作。這是因為繪圖晶片很擅於執行這種類型的運作。因為每一顆Nvidia M2050具有448個處理核心可處理這種平行工作。
但是設計這些混合式系統並不是容易的事,因為繪圖晶片與傳統的處理器各自有其記憶體。為了使用它,程式設計者可以直接寫入到Nvidia 專為GPU處理的CUDA技術,使用專供它運用的程式碼函式庫,或是使用較高層的介面如OpenCL標準。
新的GPU服務目前只有搭配Linux執行,並且僅在Amazon的北維吉尼亞州區提供。目前它最貴的服務為每小時2.10美元。相較於傳統執行Linux伺服器的每小時費用為34美分,Windows伺服器則為每小時48美分。
(ZDNet Taiwan編輯部/Rex Chang)

Amazon EC2 Functionality with NVIDIA Graphic Cores

Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of operating systems, load them with your custom application environment, manage your network’s access permissions, and run your image using as many or few systems as you desire.

Cluster GPU Instances

Instances of this family provide general-purpose graphics processing units (GPUs) with proportionally high CPU and increased network performance for applications benefitting from highly parallelized processing, including HPC, rendering and media processing applications. While Cluster Compute Instances provide the ability to create clusters of instances connected by a low latency, high throughput network, Cluster GPU Instances provide an additional option for applications that can benefit from the efficiency gains of the parallel computing power of GPUs over what can be achieved with traditional processors. Learn more about use of this instance type for HPC applications.
Cluster GPU Quadruple Extra Large 22 GB memory, 33.5 EC2 Compute Units, 2 x NVIDIA Tesla “Fermi”M2050 GPUs, 1690 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet.

Amazon Web Services takes NVIDIA GPU servers to the cloud

Amazon Web Services on Monday launched graphical processing unit instances for its high performance computing workloads.
In a statement and blog post, Amazon said graphical processing unit (GPU) servers have become popular enough to bring to AWS. GPU servers, powered by Nvidia’s Tesla chips, are about to hit mainstream as Dell, HP and IBM bring formerly custom servers to market.
AWS said that these GPU servers have generally been out of reach for many companies due to costs and architecture. Now AWS will put these servers on its Elastic Compute Cloud (EC2) service.
Amazon’s Cluster GPU Instances allow for 22 GB of memory and 33.5 EC2 Compute Units. The GPU instances tap Amazon’s cluster network, which is designed for data intensive applications. Each GPU instance features two NVIDIA Tesla M2050 GPUs.

The key specs:
·         A pair of NVIDIA Tesla M2050 “Fermi” GPUs.
·         A pair of quad-core Intel “Nehalem” X5570 processors offering 33.5 ECUs (EC2 Compute Units).
·         22 GB of RAM.
·         1690 GB of local instance storage.
·         10 Gbps Ethernet, with the ability to create low latency, full bisection bandwidth HPC clusters.

For Nvidia, the AWS launch is a nice win. GPUs may get a larger footprint in the enterprise.
For Amazon, the GPU clusters are a nice way to tap verticals such as the oil and gas industry, graphics and engineering design.
Amazon noted that customers may mix and match the usual instances with the GPU flavor to coax the most performance out of the cloud.
The company has been testing out its GPU instances with a few customers such as Calgary Scientific, a medical imaging software company; BrightScope, a financial data analytics outfit; Elemental Technologies, which provides video processing applications.

2010年9月22日 星期三

NVIDIA CUDA GPU加速分析C型肝炎病毒(HCV)轉變 – 超過x40倍速 (39天變成22.5小時)


NVIDIA CUDA GPU加速分析C型肝炎病毒(HCV)轉變超過x40倍速 (39天變成22.5小時)

美國疾病管制與預防中心(CDC)正進行一項運算科學上的研發, 期望在生物學與統計學等需要大量平行運算領域上, 大幅縮短研究時程與成本.
評估新計畫是否有效的關鍵因素包括: 時間, 保密性, 成本, 耗用的晶片數量, 與未來軟硬體的擴充性, 儲存媒介, 指令與數據模式, 視覺化, 可驗證性, 以及設備運行的持續性.
C型肝炎病毒(HCV)機轉的研究, 就是一個有趣的案例.
AccelerEyes(使用NVIDIA CUDA GPU高速運算的公司)提供CDC在這方面研發上的協助, 並以平行運算大幅加速C型肝炎研究將原本需要40天的運算, 加速到一天之內就完成.
這個劇烈的改善, 只需使用超高運算效能的NVIDIA低成本個人超級電腦, 搭配MathworksMATLAB軟體, 掛上AccelerEyes神奇的Jacket外套程式 (神奇的地方在於不需更動MATLAB程式碼, 原本的程式穿上這件外套立刻即時轉譯成NVIDIA CUDA高速運算語言).
比起原本使用位於康乃爾大學的, 使用512CPU, 昂貴的TeraGrid超級電腦中心; 當然是便宜太多了.

C型肝炎病毒研究的困境
(延伸閱讀: 由於C型肝炎的非結構蛋白5B (NS5B) 所轉譯的以RNA為模板進行複製的RNA聚合酶 (RdRp) 缺乏了校正的功能,再加上C型肝炎病毒具有很高的複製能力,在快速複製的過程中因為無法校正而造成C型肝炎病毒具有很高的基因歧異度.)
C型肝炎名列人類主要疾病排行榜. 儘管C型肝炎病毒高度的基因變異性, 我們還是觀察到C型肝炎病毒演化時, 大量並列的胺基酸取代與變異可用遵循羃次法則的無尺度分佈(scale-free network)來研究.
這個發現開啟了一扇研究之門, 將分子演化與複雜網路分佈這兩門學問連結起來.
複雜網路分佈具有拓樸學的特徵, 同時這個複雜分佈也高度影響著其自身的變異.
C型肝炎病毒網路大量並列的胺基酸取代與變異, 與其他生物學, 社會學的複雜網路, 具有類似的拓樸學上的特徵.
這個網路分佈的拓樸結構與分級組織, 讓很少數的胺基酸選址變異, 就能對C型肝炎演化施以極大的影響力.
我們也發現, C型肝炎病毒很容易產生隨機變異, 並不喜歡依據特定的規律演化 , 但是, 它也很敏感於在最大量連結處產生變異(不太會在最大量連結處變化).
了解C型肝炎病毒拓樸結構, 可以設計新穎的分子鍵結以瓦解病毒功能, 或者妨礙病毒分子進行補償性的改變而減低抗藥性或增加疫苗的有效性, 會有很大的幫助.


Mutation modeling for HCV on the GPU


The Centers for Disease Control and Prevention (CDC) is currently performing Research and Development (R&D) in computational science in order to accelerate research and reduce the costs of the massive parallelization required by some biological and statistical algorithms. Relevant factors in platform decisions are: time, security, costs, processors required, scalability, memory model, instruction and data models, visualization, repeatability and sustainability. A case study of interest is the biological research regarding coordinated mutations of the Hepatitis C Virus (HCV). AccelerEyes provided collaborative R&D resources and greatly improved the speed of this HCV research with the use of parallelization, reducing the computing time from 40 days to less than 1 day. This vast improvement was achieved with a low-cost Personal Supercomputer with NVIDIA GPU's, Mathworks MATLAB®, and AccelerEyes Jacket. Moderately faster execution times were achieved with a high-cost supercomputer at Cornell University, a TeraGrid computing resource with 512 dedicated cores running MATLAB.

The Problem

Hepatitis C virus (HCV) is a major cause of liver disease worldwide (Alter, 2007). It has been previously shown that despite its extensive heterogeneity, HCV evolution of Hepatitis C virus is shaped by numerous coordinated substitutions between pairs of amino acid (aa) sites, which can be organized into a scale-free network (Campo et al 2008). This finding creates a bridge between molecular evolution and the science of complex networks. This scientific discipline studies the topological features of networks, which highly influence the dynamics of the processes executed on them. The HCV network of coordinated substitutions has a topology with some key characteristics identified in other biological, technological, and social complex networks. The topological structure and hierarchical organization of this network suggest that a small number of aa sites exert extensive impact on hepatitis C virus evolution. It was also found that HCV is highly resistant to disadvantageous mutations at random sites, but it is very sensitive to mutations at the sites with the highest number of links (hubs). The HCV network topology may help devise novel molecular intervention strategies for disrupting viral functions or impeding compensatory changes for vaccine escape or drug resistance mutations.
There are three different sources of covariation in related biological sequences (such as the dataset of HCV sequences): (i) chance, (ii) common ancestry and (iii) structural or functional constraints. Effectively discriminating among these underlying causes is a task with many statistical and computational difficulties. In oder to rule out the effect of the first source of covariation (chances), the statistical significance of each correlation was established with a permutation test whereby the aa at each site in the sequence alignment was vertically shuffled. Ten thousand random alignments were created this way, simulating the distribution of correlation values under the null hypothesis that substitutions of aa at two sites are statistically independent. For physiochemical factors, a pair of sites was considered significantly correlated if its correlation value in the observed dataset was higher than the correlation value for those two sites in any of the random datasets (p = 0.0001). This is a simple but computationally intensive process that was greatly benefited by parallelization. However, the confounding effects of common ancestry still need to be addressed and a new method was developed. A phylogenetic tree was built and the ancestral sequences were estimated by maximum likelihood. For each pair of sites, a new sample of mutation events is obtained by crawling from the tips to the root of the tree (Willis, unpublished data). This new subset is used to calculate the physicochemical correlation between the two sites, which is then compared with the distribution of 10,000 random correlation values obtained by permutation of the same subset. This method was intractable on regular scientific workstations but the usage of the TeraGrid Computing resource greatly reduced the analysis time via deployment of a computationally efficient parallel algorithm [1].

The Solution: Personal Supercomputer

AccelerEyes transformed the existing HCV covariation algorithm to GPUs leveraging the Jacket software platform. Several steps were taken to migrate the existing algorithm to GPUs and Jacket. First and foremost, the code was profiled using the MATLAB profiler to identify the most expensive regions of code that would be targets for optimization.
Upon completion of the profiling step, it was clear that the code would benefit most significantly, for both CPUs and GPUs, by moving the computationally expensive regions of code to vectorized implementations within MATLAB. As a result, step two was the vectorization process.
Here is a quick snapshot of vectorization on a small piece of the code:

The code is expanded a bit in the following example and is typical of many MATLAB algorithms in that the use of "for loops" is prevalent. By removing the "for loop" and flattening the code, performance improvements will result.

Vectorization of the "for loop" in the above example results in MATLAB code that looks like the following code snapshot:

Vectorization of MATLAB code provides several benefits even though it requires time and thought. First and foremost, vectorization provides substantial performance improvements in most codes. Furthermore, vectorization is likely a better approach then porting to low level programming languages such as C, C++, or Fortran in that vectorized MATLAB code is a hardware independent implementation of performance improvements and therefore can be easily deployed to other users without recompilation or optimization for different computing architectures.
During the vectorization process, the AccelerEyes and CDC team also incorporated Jacket specific functions in order to GPU enable the code along the way. You will notice through the Vectorized code example that the use of commands such as "gsingle" and "gzeros" are used. These are GPU datatypes established with the Jacket software platform and signal to the MATLAB code that this data is to be directed to the GPU for processing. Once data is in GPU memory, Jacket's just-in-time compiler and run-time system take over in order to execute the computationally intensive portions of the code on the GPU. Use of Jacket on GPU datatypes in this simplest form of GPU enabling algorithms and is made possible due to the vectorization effort.
To implement this algorithm for GPUs in less than 2 weeks from start to finish, the team first profiled the MATLAB code then vectorized the code for optimization and applied Jacket datatypes to move the code to GPUs.

Results

A significant result of this project is that less than 2 weeks were spent preparing the code for performance improvements for both CPU and GPU architectures. It is not uncommon for months or even years to be spent migrating high level algorithms written in MATLAB or other very high level languages to C, C++, or Fortran along with parallelization labor to get codes working with MPI or OpenMP. It is possible that these multi-month or year long efforts can result in significant speed ups, but the trade off of labor time with computing resources must be considered.
The primary objective of the project was to evaluate the performance possibilities of GPU technology and the Jacket platform for the MATLAB based algorithm. The following results were achieved and compared with alternative computing architectures. Any comparative data associated with labor time to run the code on each architecture was not contemplated.
The base line for running the algorithm was 10,000 permutations for each computing architecture. Some modifications to the algorithm for each architecture were assumed to be made but the output was only considered valid if consistent with the original CPU based code.
·         Scientific Workstation - CPU Only
To understand the significant run time of the 10,000 values only an estimate was made of the runtime using the original MATLAB code for CPU only. The CPU used for estimating was Intel's Nehalem CPU with 4 cores (8 threads) running at 2.66 GHz.

Cost of the workstation = ~$2,000US
Estimated Run time = ~39 days

·         Scientific Workstation - CPU + GPU hybrid
An initial estimate of the runtime for the hybrid workstation configuration with the same Nehalem CPU and 4 Nvidia C1060 GPUs was approximately 24 hours. As a result, the team ran the algorithm in total for 10,000 permutations to get the final data output and run time for the vectorized and Jacket-based code.

Cost of the hybrid workstation = ~$5,000US
Actual Run time = 22.5 hours

·         TeraGrid - 512 core CPU cluster
The same base line code used for this architectural work was also used to run the algorithm on Cornell's 512 core Teragrid resource. The same 10,000 permutations were used and as stated in the opening. Some algorithmic modifications were made to the code to maximize the potential of the cluster.

Cost of cluster = >$250,000
Actual Run time = ~5 hours

The performance and speed up results prove that various architectures yield different performance and, depending on the urgency of the results and the users access to computing resources, there are ways to solve computationally intense problems based on these circumstances. Most importantly the price/ performance ratios tell a strong story about what is possible in technical computing in 2010 depending on the user and the type of budget or access available. The economics of technical computing change dramatically with GPUs when using traditional measures of price/performance. In the following chart, run time speed up (X speed up) and amount of dollars spent(X $ spent) are calculated as usual when evaluating price/performance. However, with GPUs delivering substantial performance gain at such a low relative cost to traditional high end computing resources, it is necessary to calculate a relative price/performance (Relative P-P)figure to understand how much more performance an application gains given a small increment of dollars spent.


Conclusions

Although absolute performance may be the goal on any given algorithm or application and any cost could be expended to get the absolute lowest run times, when calculating relative price to performance the higher number will always represent the best value for any given application. Any relative price performance number less than 100% means that more dollars are being spent than the number of times an algorithm is sped up from the base line computing configuration.

The Team

·         CDC Research and Development Team
·         Gallagher Pryor, AccelerEyes
·         Sandeep Maddipatla, AccelerEyes

References

Alter M. 2007. Epidemiology of hepatitis C virus infection. World J Gastroenterol 13:2436-2441.

Campo, D. Dimitrova, Z. Mitchell, R. Lara, J. Khudyakov, Y. 2008. Coordinated evolution of the hepatitis C virus. Proc Natl Acad Sci USA.105(28):9685-9690.

Dimitrova, Z. Campo, D. Neuhaus, E. Khudyakov, Y. Woody, N. Lifka, D. Seamless Scaling of a Desktop MATLAB Application to an Experimental TeraGrid Resource. Poster. 21st Supercomputing Conference, November 12-14, Portland, Oregon. [1]

2010年8月2日 星期一

PGA高爾夫球錦標賽(PGA Championship) 將以3D實況立體轉播2場 - 8/13 & 8/14亞洲時間



















跟上次美國極限房車大賽(NASCAR)一樣, 這次的PGA高爾夫球錦標賽(PGA  Championship)將會以3D實況立體轉播2 (Par-3, 12&17).
時間:
·         (台灣/香港/中國時間) 8/13週五凌晨3:00AM-7:00AM = (美東時間) 8/12 3:00PM-7:00PM
·         (台灣/香港/中國時間) 8/14週六凌晨3:00AM-7:00AM = (美東時間) 8/13 3:00PM-7:00PM

觀看網址:

硬體需求:
1. NVIDIA 3D Vision 電腦
2. 最新的NVIDIA GeForce and 3D Vision 驅動程式
3. 微軟Silverlight 4軟體

----------------------------------------------------------------------------------------------------------------------------------
Earlier this week The PGA of America and Turner Sports announced that they will be televising and streaming 2 days of the PGA Championship in 3D, and just as with NASCAR (www.nascar.com/racebuddy3d), the streaming solution is powered by NVIDIA 3D Vision.

The production will be a bit different than the Masters as it will only feature the par-3 12th and 17th holes at Whistling Straits in Kohler, Wisconsin.  However, there will be eight cameras stationed around these two holes so it should enable a nice 3D viewing experience. 

Thursday, August 12th    3:00pm – 7:00pm ET
Friday, August 13th        3:00pm – 7:00pm ET

Requirements to watch the action:
1. A NVIDIA 3D Vision PC
2. The latest GeForce and 3D Vision Drivers
3. Silverlight 4 from Microsoft

2010年8月1日 星期日

Bunkspeed 發布NVIDIA CUDA GPU圖形處理器加速的繪圖軟體

Bunkspeed 發布 NVIDIA CUDA GPU圖形處理器加速的繪圖軟體
Bunkspeed Launches GPU-Accelerated 3D Rendering Software





















一家名為Bunkspeed的 3D繪圖與動畫軟體公司, 剛剛發布了用GPU圖形處理器加速的專業互動繪圖軟體 – Bunkspeed SHOT. 
這個
Bunkspeed軟體, 效果好得像是數位虛擬世界裡的攝影機”, 使用最先進逼真的iray光線追蹤繪圖技術, 可以使用NVIDIA Quadro, TeslaGeForce圖形處理器實現高速光線追蹤運算參考上圖與下圖.
Bunkspeed, provider of simplified 3D rendering and animation software, has released Bunkspeed SHOT, an interactive 3D application that brings the power of the GPU to creative professionals. The new version, which acts as a “camera” for 3D data, incorporates innovative iray technology from mental images and runs on NVIDIA Quadro, Tesla and GeForce platforms.

Bunkspeed2007年就開發了以CPU為基礎的互動光跡運算(最逼真的光線反射/折射/陰影模擬運算), 受限於CPU極為有限的運算能力, 光跡運算在繪圖領域一直無法發展開來. 直到現在, NVIDIA擁有20x~400x高速運算能力的CUDA GPU圖形處理器, 才能讓Bunkspeed推出Bunkspeed SHOT軟體, 戲劇性的加速複雜且龐大的光線追蹤工作.
即使是最複雜的, 互動的, 即時的場景, 使用CUDA 多核心GPU圖形處理器加速的Bunkspeed SHOT軟體仍然得心應手, 讓你輕鬆加速運算.
比起多核心CPU, 改使用多核心GPU圖形處理器可以省20%成本, 還能達到前所未有的繪圖效能與真實性.
Background: Bunkspeed brought interactive ray tracing (a rendering method that simulates light reflections, refractions and shadows) to the market in 2007 but it was limited to the CPU. The new Bunkspeed SHOT leverages CUDA-based GPUs to dramatically accelerate the user’s workflow. Even the most complex scenes render interactively and in real-time. Bunkspeed also enables users to scale up rendering over multiple machines. A new level of performance and realism is now achievable at less than 20% of the cost of the same level on multiple CPUs.

有個user以前使用過舊版的Bunkspeed說道: “我熟悉光跡運算, 且知道哪些事過去我們做不到的. 然而, 有了NVIDIA GPU圖形處理器, 加上Bunkspeed SHOT突破性的效能, 畫質, 更廣泛的應用, 以及使用上的樂趣 , 我現在拿它來繪製一年前我想都不敢想的, 複雜逼真的光跡運算繪圖” – Peter Crowther, 英國電腦繪圖藝術家
"As a user of the previous generation software from Bunkspeed, I knew the standard set at the time was tough to beat. Bunkspeed SHOT offers breakthrough usability, performance and quality, and is a pleasure to use…. I am rendering at speeds unimagined just a couple of years ago…." – Peter Crowther, UK-based computer graphics artist and illustrator