[Smeagol-discuss] to:Lamjed

Sankar kesanakd at tcd.ie
Fri Feb 13 16:54:00 GMT 2009


Hi Lamjed,

>>> *Dear smeagol users
>>>
>>> i am using smeagol on new machine. To see whether the compilation of
>>> the
>>> program works well,
>>> i have done a bulk calculation with 32 processors and Leads
>>> calculation,
>>> and it works well when
>>> i  compare it with other machine's calculations. However, when i do the
>>> transport calcuations, i get
>>> following message (for 8,16,32,64 processors)*
>>>
>>> * Maximum dynamic memory allocated =   103 MB
>>> firstiter
>>> siesta: reading saved Hamiltonian
>>> siesta: saved Hamiltonian not found
>>> gensvd: Leads decimation
>>> gensvd: Dim of H1 and S1 :    390
>>> gensvd: Rank of H1:            87
>>> gensvd: Rank of (H1,S1):      189
>>> gensvd: Decimated states:     114
>>> gensvd: Decimation from the left
>>> gensvd: Leads decimation
>>> gensvd: Dim of H1 and S1 :    390
>>> gensvd: Rank of H1:            87
>>> gensvd: Rank of (H1,S1):      189
>>> gensvd: Decimated states:     114
>>> gensvd: Decimation from the left
>>> gensvd: Leads decimation
>>> gensvd: Dim of H1 and S1 :    390
>>> gensvd: Rank of H1:            87
>>> gensvd: Rank of (H1,S1):      189
>>> gensvd: Decimated states:     114
>>> gensvd: Decimation from the left
>>> gensvd: Leads decimation
>>> gensvd: Dim of H1 and S1 :    390
>>> gensvd: Rank of H1:            87
>>> gensvd: Rank of (H1,S1):      189
>>> gensvd: Decimated states:     114
>>> gensvd: Decimation from the left
>>> rank 10 in job 1  r19i0n3_44341   caused collective abort of all ranks
>>>  exit status of rank 10: killed by signal 9
>>>
>>> *When i do the same calulations on a single processors, it works well.
>>>
>>> I want to know whether you have an idea about the origin of this
>>> problem???*

******************************************************************************************

>From the above message it is clear that node r19i0n3 crashed because of
overload on it frmm the job. As per my knowledge Smeagol needs large memory
allocation at the above said point. The same thing happened to me as well.
The only solution I can suggest you to use the -npernode flag instead of -np
flag in your submission script. Before all, you need to check the better
value for this npernode flag according to your system. My suggestion is
first to submit the smeagol job and execute 'qutil -u #Job No#'. You will
see all the node info . Then ssh to one of the running nodes and execute top
command. There you can see how much memory smegol is looking for before
crashing. Depending on that you can specify the -npernode flag.

For Eg: In my case when I do the 'top' on the specified node, the memory
smeagol needed was 7GB before crashing. Since each node on stokes hold 16GB
on 8 processors , I have to use only two processors per node (so that each
processor will get access to 8GB). So I'll add '-npernode 2' in my
submission script .
This is why when you run smeagol in seriel you won't get errors. In that
case each processor will get access to 16GB  and you are using only the
memory required for smeagol run. It will be fine if your sytem need more
than 8GB dynamic memory for smeagol run, but if your system needs less than
8GB it is advisable to use this flag to use the resorces properly. You can
also change the MKL_NUM_THREADS value in order to group the processors per
node. In my cas I use 'MKL_NUM_THREADS=4' and '-npernode 2'
I hope this explanation will help you. You are welcome to ask any further
question.

Sankar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tchpc.tcd.ie/pipermail/smeagol-discuss/attachments/20090213/b2b9b6f8/attachment-0001.html 


More information about the Smeagol-discuss mailing list