tmLQCD icon indicating copy to clipboard operation
tmLQCD copied to clipboard

LapH crashes in latest master

Open urbach opened this issue 13 years ago • 12 comments

With all the changes we introduced LapH doesn't work anymore.

The main reason for this is that it is switched on with a configure switch, and therefore, it will never be tested by anyone. So this needs fixing...! Why is this a configure switch at all?

@ggscorzato: could you please fix this!

urbach avatar Dec 11 '12 16:12 urbach

This will also include adding a "repro" mode to the random jacobi field function, otherwise it breaks reproducibilty.

kostrzewa avatar Dec 12 '12 09:12 kostrzewa

I think when real interleaving is added this will break even more.

kostrzewa avatar Dec 12 '12 09:12 kostrzewa

yes, its a mess...

We need to clean the geometry stuff and/or remove one of the versions, its not maintainable...!!

urbach avatar Dec 12 '12 11:12 urbach

LapH was broken because timesplitpar was broken.

This happened because the functions tm_sub_Hopping_Matrix and tm_times_Hopping_Matrix assumed some options of hopping matrix that did not include the time-split one.

I have fixed that.

ggscorzato avatar Dec 13 '12 09:12 ggscorzato

about the geometry and the LapH, I made some proposals that were rejected. I guess we should organize a phono-conf only for that.

ggscorzato avatar Dec 13 '12 09:12 ggscorzato

We will also need to move the random_jacobi_field function into start.c and give it the ability to produce random numbers reproducibly, like the other random_* functions. I have introduced some generalizations into the functions in start.c to reduce code duplication:

https://github.com/etmc/tmLQCD/pull/193

In particular I have removed unif_su3_vector and generalized random_su3_vector instead. Those two functions are now also limited in scope to this file so that they cannot be called externally to prevent hiccups of the random number generator in "repro" mode (reproduce_randomnumber_flag = 1).

kostrzewa avatar Dec 13 '12 09:12 kostrzewa

I don't think that we rejected anything, and as far as I understand we were still in a discussion about it. However, I think it was a good decision to do these changes not in parallel to all the stuff we did during the last weeks and are still doing.

But it became also very clear that as it is now it is not maintainable. Many of the new features will not work with timesplitpar and/or indexindepgeometry, and its not even clear to me what of this is still needed. I still don't understand why we need all the gI_L_0_0_0 things, because they are already in g_ipt[][][][]. So, we need a discussion, maybe on a Wiki page first, and then we could have a phone conference. This is at least my opinion.

urbach avatar Dec 13 '12 10:12 urbach

I think the gI_* structures could be used to make something truly independent of the content of the Index function by relabelling the coordinate system in the order (first parallelized direction, second parallelized direction, third parallelized direction, fourth parallelized direction).

In this way, for instance for an XYZT parallelization, gI_L_0_0_0 would correspond to a coordinate (0,LX,0,0) in the 'normal' (t,x,y,z) format. For the standard TXYZ parallelization it would correspond to (T,0,0,0). In this way it should be possible to express all communicators in a fully uniform fashion without any sort of arithmetic in the communicator itself.

The neighborhood arrays could then consist of g_nb_[0,1,2,3]_up/dn which correspond to the first, second, third and fourth parallelized directions. The same goes for the "edge" and "rand" memory sizes, gathers etc... So in XYZT parallelization the "01" edge would be XY, in TXYZ it would be TX.

kostrzewa avatar Dec 13 '12 12:12 kostrzewa

The same would work using g_ipt, or what am I missing?

urbach avatar Dec 13 '12 15:12 urbach

I need to think about it for a while... I guess you're right though.

kostrzewa avatar Dec 13 '12 15:12 kostrzewa

I defined gI_* for reasons of efficiency. I was worried that loading the huge g_ipt[][][][] array within the communication routine it would waste the cache. But I could be wrong. I will test it. Anyway I agree that it would make the code much more readable, and it is probably worth do it just for that.

ggscorzato avatar Dec 13 '12 15:12 ggscorzato

I don't know, I'm just trying to understand. Performance could be an issue...

urbach avatar Dec 13 '12 15:12 urbach