Skip navigation

Anthony's blog

6 Posts
0

Iceberg upgrade - part 6.

Posted by cs1ab Sep 22, 2011

Currently stress-testing our new Lustre storage system… unfortunately I don't have any exciting pictures of the hardware, mainly because disk-packs and servers don't look terribly interesting.

 

Lustre is a parallel file system, designed to cope with HPC type workloads. Unlike our current NFS servers, the Lustre system provides a big pool of storage (80TB currently) which we can scale in performance or capacity by adding extra storage servers.  Lustre can also run over Infiniband, which provides greater bandwidth than ethernet networking.

 

I've been running some iozone tests to compare the single client performance of the new system with our existing storage servers.  The results are as we expect - considerable speed improvement for large files with large block size, plus read improvements due to coherent cache that Lustre maintains.  Small files/random IO not so great, and in some cases worse, but that's to be expected - Lustre is designed for large amounts of data with large block sizes, with many clients accessing it.

 

I'm currently running multiple instances of IOzone and Bonnie across the new cluster to stress it (a task which would bring our existing storage servers to their knees!), and everything seems to be holding up OK.  And just to make things interesting I've tried switching off various parts of the Lustre system to check that things fail over and fail back as they should…. so far the system just pauses, re-configures it's disk controllers, and carries on.  I've included a screen-grab showing the load on the cluster…. you can see the gap between 16:20 and 16:40 when I switched the storage server OSS2 off - OSS1 and OSS3 shared out it's disks and everything carried on.

 

ganglia.jpg

 

Compared to traditional NFS servers, Lustre is quite a complicated beast.  Our installation is relatively simple and small-scale, but we've got 2 meta-data servers (for failover), 3 storage servers and 3 expansion disk arrays, all controlled by Lustre.  Because of this we're not going to be moving all of users /home or /data areas on to the new system… instead we're going to use this new storage as a fast, temporary blob of data for Iceberg users to stick large amounts of data for processing.

0

Iceberg upgrade - part 5.

Posted by cs1ab Sep 19, 2011

We've spent the last week running the system hot with benchmarking and soak-testing codes…. apart from a few loose cables we haven't seen any problems with the new kit. 

 

Here's a picture of one of the new ' nodes':

 

P1000754.jpg

 

Unlike the previous generation of Iceberg where each node was a completely separate server, each new 'node' is actually a blade unit, with four blades in each chassis.  The chassis contains all of the power supplies, fans and hard disks - sharing power supplies between nodes provides redundancy against failure, and increases the energy efficiency of the system.  Each blade then contains a pair of CPU's, the RAM and networking/interface cards. 

 

You can see the first CPU (or rather it's heatsink…) on the left of the picture, with it's bank of RAM above it.  The second CPU is in the centre of the picture, underneath a plastic baffle, with it's memory below it.  Finally on the left you can see the Infiniband and PCIe card… This node is a bit special because it has both Infiniband networking ports and an external PCIe interface to connect to the GPGPU chassis.

 

Here's another view from the rear, showing the Infiniband ports (top left), PCIe interface (top right), Gigabit ethernet ports (below the Infiniband ports), plus the usual USB, serial, and VGA ports.

 

P1000757.jpg

 

For the technically inclined the CPU's are six-core Intel X5650's giving 12 cores per node, with 24GB of RAM (plus a few high-memory nodes with 48GB of RAM).  We've disabled Hyper Threading on the CPUs, as HPC workloads tend not to benefit from this feature, and enabled Turbo-Boost, so singly-threaded applications can get a performance boost if the system has the power and cooling capacity to allow the CPU to over-clock itself.

0

Iceberg upgrade - part 4.

Posted by cs1ab Sep 7, 2011

It's alive!

 

P1000722.jpgP1000744.jpg

We've now completely powered up all of the nodes, and our suppliers are currently running the system at full utilisation to generate as much heat as possible to check that all of the cable connections are stable as they heat up and expand.  It gets quite warm behind the racks!  So far everything seems to be working fine, and Alces have done a very thorough job of cabling, labelling and documenting the system, which should make my job a lot easier. 

 

I haven't talked yet about what new capabilities this upgrade will provide our users.... I'll go into more details about each of the new features as I get to grips with them myself, but to summarize:

1.  An additional 912 CPU cores (although we'll look at retiring some of Iceberg's older kit later, so the total core count may not increase by this much).

2.  All of the new nodes are connected via a high speed Infiniband network... so it's now theoretically possible to run a 912 core job across the system.

3.  A high performance parallel file system to provide a large area of faster (but temporary) disk store, accesible using the Infiniband network.

4.  A set of 8 GPGPU cards for codes designed to run on graphics cards.

0

Iceberg upgrade - part 3.

Posted by cs1ab Sep 7, 2011

The nodes have now all been installed into the racks, and we've switched on the power.  Fortunately nothing has exploded yet, but there's still time....

 

Whilst our installer's are checking cables, here's an explanation of why we chose to keep the name "iceberg" for our new cluster.....

 

One of the obvious questions asked by our suppliers when they were building the new cluster was "What do you want it to be called?".  This rather stumped us.  On the one hand, this isn't a complete rip-and-replace of the old service, and we wanted users to be able to use the new system in exactly the same way that they used the old one.  However we also wanted the opportunity to advertise and get "user-buy-in" for the new machine to help expand our user base beyond our traditional users.  A new name for a new cluster might help do that….. In the end we decided to throw the question open to our users… and surprisingly the overwhelming response was to keep the Iceberg name (although the poll has been left open, and "teratron" has now inched into the lead, too late for us to adopt).

 

As a colleague at another University asked me, "Why did you name your cluster after a lettuce?".  The actual inspiration for the name comes from the White Rose Grid collaboration (http://www.wrgrid.org.uk/) between Sheffield, Leeds and York, with all three Universities purchasing HPC systems to form a shared computing grid.  Each system was named after a white rose, with Iceberg chosen for Sheffield, possibly referencing it's predecessor Titania….

0

Iceberg upgrade - part 2.

Posted by cs1ab Sep 6, 2011

Just a quick post - the new kit has arrived, and we've man-handled it into the data center.  The picture below shows some of the boxes and racks which need to be assembled into the new system.

P1000713.jpg

The original Iceberg (and it's predecessor) were supplied by Sun Microsystems.  However the new system is being supplied by Dell, with integration services provided by Alces Software, following a very competitive EU Tender process.  The new hardware was delivered by Dell to Alces Software, who have racked/installed/cabled and configured the nodes into the racks and then removed the nodes to allow for easier transport to our data center.  Alces are currently on-site re-assembling all of the nodes into the racks as I type....

 

Once everything has been reassembled the fun will really begin.... Power-on, soak-testing, benchmarking and acceptance testing the system to make sure that it meets all of the requirements of the tender, and then lots of work to integrate the system into our existing cluster and test our main applications so that users can seemlessly take advantage of the new system. 

 

Plus the latest upgrade adds some new capabilities to the cluster, which will provide researchers with the tools to perform exciting and innovative new research..... more on this later.

 

Edit - bonus picture of the racks being populated with nodes - note the network and power cables are already in place, so it's a relatively quick job to install the nodes:

 

P1000714.jpg

0

Iceberg upgrade - part 1.

Posted by cs1ab Sep 6, 2011

This is the first is what will hopefully be a series of blog posts documenting the current upgrade of Iceberg...

 

It probably makes most sense to first answer the question "What is Iceberg?".

 

Iceberg is the general purpose High Performance Computing (HPC) resource at the University of Sheffield.  The service is provided by CiCS to support research, and any researcher/postgraduate student can apply for an account.  Iceberg also supports the teaching of final year undergraduates, who can use the system for their project work. In addition, Iceberg is the Sheffield node of the White Rose Grid, which is a collaboration between then Universities of Leeds, York and Sheffield.

 

Here's a picture of Iceberg, taken from our website:

http://www.shef.ac.uk/polopoly_fs/1.1048!/image/icebergsm2.jpg

Unfortunately this isn't what Iceberg looks like anymore!  This picture was taken when Iceberg was first installed, way back in 2005 or so. And 'Iceberg' is actually only half of the machine in the picture - the remainder belongs to the Physics Department as part of Cern's LHC data-crunching project GridPP.

 

Since then, CiCS have provided the funds to perform a rolling-upgrade of Iceberg, so that gradually the original nodes have been replaced with more modern machines (HPC clusters typically have a 3-4 year life span.... even though a lot of the nodes still function OK after four years it becomes economically and environmentally more cost-effective to purchase new nodes which are more energy efficient). 

 

In addition to providing a basic 'free' service to researchers/project students, we also charge a number of research groups for priority or dedicated CPU resource within the cluster (further info here: http://www.shef.ac.uk/wrgrid/iceberg/costing).  All of the revenue generated via this mechanism goes towards future upgrades of Iceberg, increasing the capacity of the cluster for all users.

 

Here's what's left of the old Iceberg:

icebergold.jpg

So where has the cluster gone?  Not only have we been replacing the old nodes, but we've also moved the entire cluster into another data center due to space constraints in the original machine room.  Iceberg has it's own private network spanning the two data centers to ensure that our network traffic doesn't disrupt the University's core network.

 

This is what Iceberg currently looks like:

 

P1000693.jpgP1000701.jpg

Iceberg currently comprises of 2 head nodes (for redundancy), 127 worker nodes, some test nodes and three storage servers. This provides ~650 cores for general use and ~45TB of usable file store.

 

Here's a picture showing a view from the rear of the racks -

P1000703.jpg

At the top of each rack are some Gigabit network switches, which provide the cluster's private network - most of the nodes have a one gigabit network connection, with the storage servers (the two big machine in the middle-left of the picture) connected at 10 gigabits.

 

In addition to the standard ethernet network approximately one third of the current nodes also have a high performance Infiniband network link (you can see one of the infiniband switches in the bottom left of the picture with the blue lights).  Here's a closer look:

P1000705.jpg

 

Infiniband provides a high bandwidth (16 gigabits per second per node for this switch) low latency interconnect (~a couple of microseconds: ~1/1000 of the latency of ethernet networking).  This type of network is useful for parallel jobs which have lots of inter-process communication, where the speed at which processes communicate with each other limits the performance of the job - any program which uses MPI (either OpenMPI or MvaPICH2) can use the Infiniband interconnect.

 

You might have noticed that each rack is only part populated - this is because the trend for HPC hardware is towards smaller, denser nodes with shared components such as power supplies to improve the energy efficiency... however this increases the electrical energy that each rack requires to power the kit, and the cooling required to take away the heat generated.  As such, each rack in the data center is limited to ~7kW of power, so we can't use all of the space in the racks for worker nodes.

 

Unfortunately we've now run out of space in the current data center! (Actually, the data center still has lots of space in it as CiCS have virtualised/consolodated a lot of servers, however allocating a large block of uninturrupted space is a bit tricky.... The power/cooling contraints in this data center also limit the effective growth of the cluster).

 

This is why we're going to start re-building Iceberg back in it's original home, which now has more space available.... Thanks to our colleagues in Estates we've had extra power sockets installed in the original data center, and our Data Center manager has re-jigged the data center into a 'hot-isle' containment configuration to allow the system to be cooled much more efficiently (saving energy/money).  We've now got the capacity to run 14kW of equipment per rack, which could be increased further in the future.... This will allow us to run a much more scalable and energy-efficient system.

 

To finish, here's a picture of the empty space where the new kit should begin to be installed this afternoon...... assuming it survives it's journey from Oxfordshire.  I'll blog again with details about the lastest upgrade when I get time, although I might be be a bit busy....!

P1000711.jpg