FAQ

Cluster FAQs

What are the things I should never do on the cluster?

  • Do NOT output the results directly in the home: in the cluster, the home is centralized on a common server which is then mounted by all the nodes. This means that each time a nodes write on the home there is a network transfer. Multiply this for each nodes and each users and you can see that things can go pretty bad. On the other hand, the /tmp/ dir is local to the node, so no network here, good!
  • Do NOT output multiple gb of data: are you sure that each one of your jobs needs to output 4gb of data? isn’t it better to store only the data you need? even better, why don’t you process the data directly in your code without saving it in a file? Saving several gb of data can fill up the hd of the nodes, thus blocking that node also for the other users.
  • Do NOT transfer gb of data from majorana to your pc: again, the network is limited, preprocess the data on the nodes, compress the data and only then, transfer it.

I have weird results, is the cluster broken?

There may be cases in which weird results are due to the cluster (e.g. disk full, network failure, etc) but in 99% of the cases it is due to a bug in your code. Double check your code. Remember, valgrind and gdb are your friends. If you still think that your code is correct, ask around, is anybody else having the same problem? If somebody else is having the some problem, then it might be the cluster. Check with the cluster admins. Otherwise, sorry, it’s your code, check it again.

Should I compress my data before sending? Isn’t it going to put a lot of stress on majorana?

Always remember that, in the cluster, cpu is abundant while network is scarce. This means that you should always try to minimize the amount of data to send. Even better, are you sure you need all the data? don’t you just need to extract one or some values? Then process your data directly on the nodes, and then copy only that value. You can insert your R script at the end of your normal script when launching the jobs.

Should I send one big file or many small files?

To put less stress on the network and on the HD system, it’s always better to send a big file than a million of tiny files.

What is s1-0? is it the same as majorana?

When you login to majorana (# ssh username@majorana) you are automagically redirected to the s1-0 node. As a normal user you are not allowed anymore to login to majorana (the front-end). This is done in order to prevent that bad users mess up with the queuing system. Note that the s1-0 is a normal node. In fact it’s the node formerly known as c1-0. For this reason you can do all your computation stuff there without being afraid of breaking anything. You can use s1-0 to compile, test, compress, analyze, run R, run gnuplot, run python etc etc. Do not use it to run your bulk scripts (submit them in the queues). Also, do not use the cluster to store your files (there is fallopius for that).

What should I do when my jobs fail or I cancel them?

When your jobs are running, they should produce the output in /tmp/ (are you storing the output in tmp, right? right?). Normally you move the results from the local /tmp/ to the central HD where your home is. If your jobs are cancelled before the script can move the results from the /tmp/ dir then the data is still there. If the data is not removed it is possible that the hd of one or more nodes is filled up causing problems to all the users. Once in a while you should delete the leftover data in the tmp dir by doing # clean_temp_dirs.sh which goes on all nodes and delete your files in the tmp dir.

I have a special need (nodes, queues, pvm, memory, bla bla bla)

Come speak with us, and if you have been good you will receive a gift this Christmas.

Why jobstats now gives strange numbers for the “pending jobs”?

3500 is the number of total jobs you can have in the system, both running and in queue. Be?fore jobstats was incorrectly showing 3500 as the number of jobs you could have in queue. This is not correct, since if you have 250 jobs, you can only submit 3250 jobs. Use this command to see the total number of jobs that you have in the system (running + in queue): # ??????qstat | wc | awk ‘{print $1 -2}’

My jobs are dying and I don’t know why

If your jobs die it’s very often because they exceed the memory limit or the time limit. Check here to see the limits for each queue. Use valgrind to see if you have memory leaks.

I went in the cluster rooms and I’m preeeetty sure I saw a sheep there, is it normal?

Don’t ask. Seriously, don’t. And yes, the sheep is the responsible for all the problems of the cluster. It’s not us, it’s the sheep.

I have another question

Send us an email to cluster_admins[at]iridia.you.know.the.rest.

Leave a Reply