[IITAC-users] Update on I/O problems

Geoff Bradley kbradley at tchpc.tcd.ie
Tue Jan 11 16:00:13 GMT 2011


Dear colleagues,

You will no doubt be aware that we are experiencing ongoing difficulties
with the planned upgrade of our I/O cluster.  This I/O cluster is a
critical piece of infrastructure that serves the same GPFS parallel
file system to all compute clusters in HPC: currently IITAC, Lonsdale,
and Parsons.  This upgrade, although scheduled to coincide with the
roll out of our new compute cluster, Kelvin, was independently required
to ensure that the I/O infrastructure would be stable, and crucially,
supported and in warranty until the end of 2013.  This decision was
taken because the infrastructure was going out of warranty and there was
(is) no funding available to purchase a new, second I/O cluster (at an
estimated cost of €300k).  This I/O upgrade includes the provision
of new I/O nodes, new fibre channel cards and switches, new IB qlogic
HBA's and switches, an extended warranty on our DDN9900, and additional
GPFS licenses.  The advice from our vendors and suppliers indicated
that this planned upgrade, although complex, should run without major
technical difficulties.

The Problem:
Following a scheduled downtime period from 20-23rd December for upgrade,
all systems appeared to come back up and be stable.  Unfortunately, the
I/O nodes experienced unexplained kernel panics over the Christmas period
when subjected to an intense read/write load from the compute clusters.
To date, we have been unable to get the file system stable enough to
bring the compute queues back on line.  However, for clarity, it should
be noted that the file system has been (mostly) available since last
Thursday, so users have access to their files/data.

The current proposed solution:
We have assembled a team in HPC that have been working to rectify this
problem, but, as indicated, we have not yet found a stable solution.
We have raised support calls with a number of vendors, we have worked
through a number of potential solution steps and we have a planned number
of additional steps to try.  We are hopeful that the situation can be
resolved by Thursday evening.  If we can not find a stable solution by
then, we will spend Friday rolling back our entire I/O infrastructure to
the setup on 19th December 2010.  Although this will allow us to bring
up the IITAC, Lonsdale and Parsons queues, it will be a temporary fix
and it will not resolve our warranty issues.  If this is the course of
action required, we will then need to schedule a new downtime window
in a few weeks time, resulting in more inconvenience, and spend the
intervening period working on figuring out a more permanent, stable
solution.

We understand that this outage is seriously inconveniencing some of you,
our colleagues (users).  We sincerely apologise for this unanticipated
outage and we thank you for your patience to date.  We will keep you
posted on developments, and please feel free to contact me if you want
a more detailed update.


Yours sincerely,
Geoff.


-- 
-----------------------------------------------------------------------
Geoff Bradley.  Ph.D., MBA.
Executive Director (Acting),
Trinity Centre for High Performance Computing,     
Lloyd Building,       |      Phone: ++353 1 896 3429 
Trinity College,      |      Fax  : ++353 1 679 8039 
Dublin 2.             |      Email: Geoff.Bradley at tchpc.tcd.ie
Ireland.              |      URL  : http://www.tchpc.tcd.ie/
-----------------------------------------------------------------------
Electronic mail to, from, or within Trinity College, Dublin, may be the 
subject of a request under the Freedom of Information Act.
-----------------------------------------------------------------------


More information about the IITAC-users mailing list