More on the EU-INDIA-GRID

September 27, 2007

DISCLAIMER: You will find lots of computer-speak and jargon in this post. Non-geeks may take a peep but may not blame the author for not warning them beforehand, in the very likely case of their not understanding the contents! ;)

The past couple of days have been very interesting and informative. Working hands on with a grid, although experimental and problematic, is a great learning experience. To make matters worse, we are having this workshop on the campus of National Centre for Radio Astronomy (NCRA) Pune and not at C-DAC, due to constraints of space and several other factors that I cannot as yet fathom! Apart from the facility, we had to also use some of the machines of the lab at NCRA, but were under severe constraints about their use. The constraints meant that we were not supposed to do any local installations of the client software for the grid middleware that we had used to set up the grid. To make matters even more difficult, glite made it mandatory for even the client machine to run on Scientific Linux and not on any other flavor of Linux. We had come up with an idea (my boss, Dr Sandeep Joshi’s idea) of using a diskless setup in the lab. The plan was something like this:
1) Install the different componets of the middleware for the grid known as gLite, on to systems specifically chosen for the purpose (machines belonging to C-DAC, over which we had supreme powers!)
2) Install one of our machines to act as a server for the diskless clients (machines belonging to NCRA)
3) The NCRA machines would fetch the Scientific Linux kernel image from the server over NFS.
4) The diskless client, now running Scientific Linux and using an NFS mounted partition on the server for the purpose of data storage would import the display from another server running the glite user interface (the GUI).
5) These diskless clients, unknown to the user, would also function as the worker nodes for the grid. We also had four dedicated worker nodes, keeping in mind the tendency of users to do crazy things on a grid, including turning off their machines (like they are accustomed to doing at home!) in spite of our asking them not to do.
5) Users would have to use this GUI to fire jobs on the grid.
6) We would be able to monitor the status of the queues, system resources of the individual components of the grid as well the entire grid on the whole, from our master console, through a 16 port KVM swich (an extremely useful gizmo that allows a sysadmin to monitor upto 16 machines and work on them using just one keyboard, mouse and monitor. Howzaaat? Without this, we would need to have a mouse, keyboard and a monitor for each one of those machines. Just think of the space saved!)

Unfortunately, things did not exactly go as we would have wanted it to. The system administrator, Alessio Terpin, who came down from ICTP, Italy, to assist with the setup would have nothing of our setup and so, we were forced to do some minor installations on the NCRA machines. We installed VMWare on to them (a tool to run virtual kernels on a machine. Using this, one can create an illusion of several machines and even have them running on unique IP addresses which can even be pinged from a physical machine. User mode Linux, for the geeks out there. The grand-daddy of the Virtual CD-ROM software that gamers used to do away with the problem of pesky games demanding that the cd be in the tray for the game to proceed! There, only a cdrom was spoofed, here, it is an entire system!)
Alessio worked his magic on the machines and the clients were also configured to act as the worker nodes. We had a lot of teething issues with the grid and often had to combat pesky problems without the liberty of simply rebooting the machines, as that is the first thing that a grid system-admin will have to remember. Users have their jobs running on the grid and they cannot just be terminated because you have goofed up somewhere.

The whole exercise has taught me a great deal and I just can’t wait for our machines to come back home to C-DAC, where we will do further experimentations on it and hopefully make it fit enough to be put to production use. Once it enters production (if it does, that is), it will no longer be a test-bed for experiments but a reliable platform for users to submit their computing-power hungry programs. It’ll be fun.