This is the first is what will hopefully be a series of blog posts documenting the current upgrade of Iceberg...
It probably makes most sense to first answer the question "What is Iceberg?".
Iceberg is the general purpose High Performance Computing (HPC) resource at the University of Sheffield. The service is provided by CiCS to support research, and any researcher/postgraduate student can apply for an account. Iceberg also supports the teaching of final year undergraduates, who can use the system for their project work. In addition, Iceberg is the Sheffield node of the White Rose Grid, which is a collaboration between then Universities of Leeds, York and Sheffield.
Here's a picture of Iceberg, taken from our website:

Unfortunately this isn't what Iceberg looks like anymore! This picture was taken when Iceberg was first installed, way back in 2005 or so. And 'Iceberg' is actually only half of the machine in the picture - the remainder belongs to the Physics Department as part of Cern's LHC data-crunching project GridPP.
Since then, CiCS have provided the funds to perform a rolling-upgrade of Iceberg, so that gradually the original nodes have been replaced with more modern machines (HPC clusters typically have a 3-4 year life span.... even though a lot of the nodes still function OK after four years it becomes economically and environmentally more cost-effective to purchase new nodes which are more energy efficient).
In addition to providing a basic 'free' service to researchers/project students, we also charge a number of research groups for priority or dedicated CPU resource within the cluster (further info here: http://www.shef.ac.uk/wrgrid/iceberg/costing). All of the revenue generated via this mechanism goes towards future upgrades of Iceberg, increasing the capacity of the cluster for all users.
Here's what's left of the old Iceberg:

So where has the cluster gone? Not only have we been replacing the old nodes, but we've also moved the entire cluster into another data center due to space constraints in the original machine room. Iceberg has it's own private network spanning the two data centers to ensure that our network traffic doesn't disrupt the University's core network.
This is what Iceberg currently looks like:


Iceberg currently comprises of 2 head nodes (for redundancy), 127 worker nodes, some test nodes and three storage servers. This provides ~650 cores for general use and ~45TB of usable file store.
Here's a picture showing a view from the rear of the racks -

At the top of each rack are some Gigabit network switches, which provide the cluster's private network - most of the nodes have a one gigabit network connection, with the storage servers (the two big machine in the middle-left of the picture) connected at 10 gigabits.
In addition to the standard ethernet network approximately one third of the current nodes also have a high performance Infiniband network link (you can see one of the infiniband switches in the bottom left of the picture with the blue lights). Here's a closer look:

Infiniband provides a high bandwidth (16 gigabits per second per node for this switch) low latency interconnect (~a couple of microseconds: ~1/1000 of the latency of ethernet networking). This type of network is useful for parallel jobs which have lots of inter-process communication, where the speed at which processes communicate with each other limits the performance of the job - any program which uses MPI (either OpenMPI or MvaPICH2) can use the Infiniband interconnect.
You might have noticed that each rack is only part populated - this is because the trend for HPC hardware is towards smaller, denser nodes with shared components such as power supplies to improve the energy efficiency... however this increases the electrical energy that each rack requires to power the kit, and the cooling required to take away the heat generated. As such, each rack in the data center is limited to ~7kW of power, so we can't use all of the space in the racks for worker nodes.
Unfortunately we've now run out of space in the current data center! (Actually, the data center still has lots of space in it as CiCS have virtualised/consolodated a lot of servers, however allocating a large block of uninturrupted space is a bit tricky.... The power/cooling contraints in this data center also limit the effective growth of the cluster).
This is why we're going to start re-building Iceberg back in it's original home, which now has more space available.... Thanks to our colleagues in Estates we've had extra power sockets installed in the original data center, and our Data Center manager has re-jigged the data center into a 'hot-isle' containment configuration to allow the system to be cooled much more efficiently (saving energy/money). We've now got the capacity to run 14kW of equipment per rack, which could be increased further in the future.... This will allow us to run a much more scalable and energy-efficient system.
To finish, here's a picture of the empty space where the new kit should begin to be installed this afternoon...... assuming it survives it's journey from Oxfordshire. I'll blog again with details about the lastest upgrade when I get time, although I might be be a bit busy....!
