GEOS Feature/add lifo storage

This add a lifo storage utility that will be used once backward propagation in wave propagation is integrated. This utility first store data on GPU then on host memory and finally on disk. At least one buffer must be stored both on device and host.

Sep 28 '22 14:09 XL64

@corbett5 Seems like something that could/should be in LvArray?

Oct 10 '22 17:10 rrsettgast

@corbett5 Seems like something that could/should be in LvArray?

Yeah perhaps, although I think it could be cleaned up first regardless both for performance and to remove complexity. For example since all the entries in the queue are the same size I would use something like the following.

template< typename T >
struct FixedSizeDeque
{
  Queue( IndexType maxEntries, IndexType valuesPerEntry, LvArray::MemorySpace space ):
  {
    m_storage.resizeWithoutInitializationOrDestruction( maxEntries, valuesPerEntry, space );
  }
  
  bool full() const;

  ArraySlice1d< T const > front() const;

  ArraySlice1d< T const > back() const;

  void pop_back();

  void pop_front();

  void emplace_back( ArraySlice1d< T const > & src );

  void empace_front( ArraySlice1d< T const > & src );

  Array2d< T > m_storage;
  IndexType m_begin = 0;
  IndexType m_size = 0;
};

You could use LvArray::memcpy to all the copying which will automatically figure out the memory spaces for you. You could make communication between host and device truly asynchronous using streams. You could make pop_back and emplace_back and whatnot be threadsafe using atomics and negate the need for locks. This would also be significantly more performant since it would eliminate the need for memory allocation in every single push.

This wouldn't take care of pushing stuff out to disk obviously but I think it would completely eliminate the need for futures or threads when just considering GPU and host.

Oct 10 '22 18:10 corbett5

Thanks for the feedbacks, I'll update that. @corbett5 , just a question, I don't know well LvArray but if I understand it correctly for one LvArray the memory is allocated in both Host and GPU memory, Am I wrong ? If I'm right this is why I choosed to use pure pointers and manage memory myself. Could you tell me if I'm wrong ?

Oct 11 '22 08:10 XL64

Ok I think I understood I was wrong : resizeWithoutInitializationOrDestruction() takes the space for allocation as parameter.

Oct 11 '22 08:10 XL64

Ok I think I understood I was wrong : resizeWithoutInitializationOrDestruction() takes the space for allocation as parameter.

Exactly.

Oct 11 '22 16:10 corbett5

@corbett5 I tried to implement the fixedSizeDeque but the first emplace_front fails on the memcpy :

  void emplace_front( const ArraySlice1D & src ) {
    LvArray::memcpy( m_storage[--m_begin], src );
  }

src is supposed to be on cuda space , m_storage on cuda sapce too.

  void push( arrayView1d< T > array )
  {
    std::cout << array[1] << std::endl;
    array.move( LvArray::MemorySpace::cuda, false );
    m_deviceDeque.emplace_front( array.toSliceConst() );
    push( array.data(), array.size() );
  }

ANy idea what I did wrong ?

Oct 19 '22 15:10 XL64

@corbett5 I tried to implement the fixedSizeDeque but the first emplace_front fails on the memcpy :
  void emplace_front( const ArraySlice1D & src ) {
    LvArray::memcpy( m_storage[--m_begin], src );
  }
src is supposed to be on cuda space , m_storage on cuda sapce too.
  void push( arrayView1d< T > array )
  {
    std::cout << array[1] << std::endl;
    array.move( LvArray::MemorySpace::cuda, false );
    m_deviceDeque.emplace_front( array.toSliceConst() );
    push( array.data(), array.size() );
  }
ANy idea what I did wrong ?

Is it a compile time error or a runtime error? Can you post the message? And if you push your code so I can take a look that would help too.

Just looking at how you implemented it, you'll need to add in some logic for wrapping around m_begin when calling emplace_front. For example if m_begin = 0 before the call, then you'll try an access m_storage[ -1 ] which is undefined (and should result in a run time error with a nice message in debug). You'll need to handle this case as well as the case where the queue is full and you need to remove the end of the queue to make space. How exactly this is done depends on if you keep track of begin and end, or begin and size.

Also what is that second block of that describes void push( arrayView1d< T > array )? Is that another error?

Oct 19 '22 17:10 corbett5

Hello, thanks for your help, it's a runtime error (segfault) and you saw correctly, I wanted to use an emplace_back() and definitely this emplace_front() won't work.

Oct 21 '22 09:10 XL64

Thanks for your comments that really helped. I did plenty errors in my fixedSizeDeque. I know can pass the testFixedSizeError but the testLifoStorage won't pass because only values from first buffer are popped out from the lifo... Still have some wok on it.

To simplify, for now I only used a deviceDeque to store my buffers and left the old code that need to be updated. I only emplace_front in that deviceDeque each time i do a push on the lifo and retrieve and pop front when I call a lifo.pop()

Oct 21 '22 14:10 XL64

I think I found out that for some reson this is wrong :

  void push( arrayView1d< T > array )
  {
    std::cout <<__FILE__<<":"<<__LINE__<<" array[i] " <<  array[1] << std::endl;
    array.move( LvArray::MemorySpace::cuda, false );
    m_deviceDeque.emplace_front( array.toSliceConst() );
    array1d< T > a(array.size());
    LvArray::memcpy(a.toSlice(), m_deviceDeque.front());
    std::cout <<__FILE__<<":"<<__LINE__<<" array[i] " <<  a[1] << std::endl;
    push( array.data(), array.size() );
  }

the 2 prints are different.

EDIT : that move is useless except for my old code that does hardcoded copies. EDIT2 : And this move is in fact the responsible for the error... I don't exactly get why.

Oct 21 '22 14:10 XL64

I think I found out that for some reson this is wrong :
  void push( arrayView1d< T > array )
  {
    std::cout <<__FILE__<<":"<<__LINE__<<" array[i] " <<  array[1] << std::endl;
    array.move( LvArray::MemorySpace::cuda, false );
    m_deviceDeque.emplace_front( array.toSliceConst() );
    array1d< T > a(array.size());
    LvArray::memcpy(a.toSlice(), m_deviceDeque.front());
    std::cout <<__FILE__<<":"<<__LINE__<<" array[i] " <<  a[1] << std::endl;
    push( array.data(), array.size() );
  }
the 2 prints are different.

EDIT : that move is useless except for my old code that does hardcoded copies. EDIT2 : And this move is in fact the responsible for the error... I don't exactly get why.

This is entirely possible and it all depends on the state of array when you pass it into this function. In general accessing array on the host is not allowed until you move it to the host. For an example of how this might happen.

// Create an array on the host and populate it with values.
Array1d< int > foo( 10 );
for( int i = 0; i < 10; ++i )
{
  foo[ i ] = -i;
}

// Create a view into foo. It is only allocated on the host and it's active pointer is on the host.
ArrayView1d< int > const fooView = foo.toView;

// Capture a copy of fooView on device and fill it with new values
forAll< cuda_exec >( 10 [fooView] ( int i )
{
  fooView[ i ] = i;
} );

// At this point foo is allocated on both the host and device and was last touched on device.
// The host allocation has {0, -1, -2, ... } and the device allocation has {0, 1, 2, ...}.
// However the active pointer of both foo and fooView is still to the host allocation!

for( int i = 0; i < 10; ++i )
{
  GEOSX_ERROR_IF_NE( foo[ i ], -i );
  GEOSX_ERROR_IF_NE( fooView[ i ], -i );
}

// We can update the active pointer manually by calling `move`.
// Since the data is already allocated on the device and was last touched there
// all this does is update the active pointer. But only for the object `move` was called on!

foo.move( LvArray::MemorySpace::cuda, false );
int * fooPtr = foo.data();
// fooPtr now points to the device allocation. Calling foo[ ... ] on host would hopefully segfault.
forAll< cuda_exec >( 10, [fooPtr] ( int i )
{
  GEOSX_ERROR_IF_NE( fooPtr[ i ], i );
} );

// fooView's active pointer is still to the host allocation.
for( int i = 0; i < 10; ++i )
{
  GEOSX_ERROR_IF_NE( fooView[ i ], -i );
}

LvArray::memcpy will finds out where both the source and destination were last updated and executes the copy directly between those allocations, regardless of what their active pointer is. https://github.com/GEOSX/LvArray/blob/b146a6f3bb8326bda2513226b71cf2e6b710c350/src/memcpy.hpp#L165

Oct 21 '22 21:10 corbett5

Ok thanks for the explanation, I though that LvArray would ensure data coherency between host and device and forbid accessing foo on host when it has been changed on device without doing a memcpy before.

Oct 26 '22 07:10 XL64

Hello, I think I don't get something in LvArray philosophy. I don't understand why this is OK (all is run with serialPolicy for now):

    array.move( RAJAHelper< POLICY >::space );                                                                                                                                       |
    float * dataPointer = array.data();                                                                                                                                              |
    forAll< POLICY >( elemCnt, [=] GEOSX_HOST_DEVICE ( int i ) { dataPointer[ i ] = j*elemCnt+i; } );                                                                                |
    forAll< POLICY >( elemCnt, [=] GEOSX_HOST_DEVICE ( int i ) { PORTABLE_EXPECT_EQ(array[ i ], j*elemCnt+i); } );

while this is not

    array.move( RAJAHelper< POLICY >::space );                                                                                                                                       |
    float * dataPointer = array.data();                                                                                                                                              |
    forAll< POLICY >( elemCnt, [=] GEOSX_HOST_DEVICE ( int i ) { array[ i ] = j*elemCnt+i; } );                                                                                |
    forAll< POLICY >( elemCnt, [=] GEOSX_HOST_DEVICE ( int i ) { PORTABLE_EXPECT_EQ(array[ i ], j*elemCnt+i); } );

Oct 26 '22 11:10 XL64

Ok thanks for the explanation, I though that LvArray would ensure data coherency between host and device and forbid accessing foo on host when it has been changed on device without doing a memcpy before.

Well there's not really an easy way to do this. operator[] needs to be as fast as possible, so it can't check if the data is up to date on host. The take away is that you really can only access the data an LvArray object contains in two situations, either you capture a view in a RAJA lambda, or you explicitly move it to the host and then access it outside a lambda.

Oct 26 '22 16:10 corbett5

Hello, I think I don't get something in LvArray philosophy. I don't understand why this is OK (all is run with serialPolicy for now):

    array.move( RAJAHelper< POLICY >::space );                                                                                                                                       |
    float * dataPointer = array.data();                                                                                                                                              |
    forAll< POLICY >( elemCnt, [=] GEOSX_HOST_DEVICE ( int i ) { dataPointer[ i ] = j*elemCnt+i; } );                                                                                |
    forAll< POLICY >( elemCnt, [=] GEOSX_HOST_DEVICE ( int i ) { PORTABLE_EXPECT_EQ(array[ i ], j*elemCnt+i); } );

while this is not

    array.move( RAJAHelper< POLICY >::space );                                                                                                                                       |
    float * dataPointer = array.data();                                                                                                                                              |
    forAll< POLICY >( elemCnt, [=] GEOSX_HOST_DEVICE ( int i ) { array[ i ] = j*elemCnt+i; } );                                                                                |
    forAll< POLICY >( elemCnt, [=] GEOSX_HOST_DEVICE ( int i ) { PORTABLE_EXPECT_EQ(array[ i ], j*elemCnt+i); } );

If array is an LvArray::ArrayView then the second example should work. However if array is an LvArray::Array then when it's captured by value in the first lambda a completely new array is created. Just like if you did the following

Array1d< int > foo( 4 );
foo[ 0 ] = -1;

// Create `bar` as a copy of `foo`. This has an entirely new allocation.
Array1d< int > bar( foo );
EXPECT_NE( foo.data(), bar.data() );

Compare this to the case when you use views

Array1d< int > foo( 4 );
ArrayView1d< int > const fooView = foo;
fooView[ 0 ] = -1;

// Create `barView` as a copy of `fooView`.
Array1d< int > barView( foo );
EXPECT_EQ( fooView.data(), barView.data() );

Oct 26 '22 16:10 corbett5

Hello, I think I don't get something in LvArray philosophy. I don't understand why this is OK (all is run with serialPolicy for now):

    array.move( RAJAHelper< POLICY >::space );                                                                                                                                       |
    float * dataPointer = array.data();                                                                                                                                              |
    forAll< POLICY >( elemCnt, [=] GEOSX_HOST_DEVICE ( int i ) { dataPointer[ i ] = j*elemCnt+i; } );                                                                                |
    forAll< POLICY >( elemCnt, [=] GEOSX_HOST_DEVICE ( int i ) { PORTABLE_EXPECT_EQ(array[ i ], j*elemCnt+i); } );

while this is not

    array.move( RAJAHelper< POLICY >::space );                                                                                                                                       |
    float * dataPointer = array.data();                                                                                                                                              |
    forAll< POLICY >( elemCnt, [=] GEOSX_HOST_DEVICE ( int i ) { array[ i ] = j*elemCnt+i; } );                                                                                |
    forAll< POLICY >( elemCnt, [=] GEOSX_HOST_DEVICE ( int i ) { PORTABLE_EXPECT_EQ(array[ i ], j*elemCnt+i); } );

If array is an LvArray::ArrayView then the second example should work. However if array is an LvArray::Array then when it's captured by value in the first lambda a completely new array is created. Just like if you did the following

Array1d< int > foo( 4 );
foo[ 0 ] = -1;

// Create `bar` as a copy of `foo`. This has an entirely new allocation.
Array1d< int > bar( foo );
EXPECT_NE( foo.data(), bar.data() );

Compare this to the case when you use views

Array1d< int > foo( 4 );
ArrayView1d< int > const fooView = foo;
fooView[ 0 ] = -1;

// Create `barView` as a copy of `fooView`.
Array1d< int > barView( foo );
EXPECT_EQ( fooView.data(), barView.data() );

Thanks for the explanation, I get it know, with a [&array] it may have worked also I guess.

Nov 08 '22 15:11 XL64

@corbett5 I updated the code, I added streams but I still need threads because I need to wait for an available buffer in my queues. I added 2 threads to be able to do device to host and host to device concurrently. I removed the futures and used condition_variable instead. I tested it with my unitTest but still need to validate it with a large test and mesure the improvement. In my previous test I could go from 850s to 757s on a large test case on which I save snapshots at each time step (which is more than the reallity I think). I hope the new version will be better.

Nov 25 '22 15:11 XL64

On a specially created test case where I save snapshots at each time step I went from a reférence of ~750s on 3 nodes of 6 GPUS to 232s just by moving to O_DIRECT writting without using the LIFO. The LIFO run took ~670s withtout O_DIRECT and 198s with it. Without the O_DIRECT the system I/O bufferisation was leading to very long writting when bufferisation was flushing (x1000 compared to buffered writting) which desynchronized the MPI ranks and created imbalance. With the LIFO I can cover the whole 32s of writting with computation.

Dec 06 '22 14:12 XL64

Just noticed O_DIRECT needs memory alignement so my tests may not be valid... I'll fix it.

Dec 09 '22 15:12 XL64

@rrsettgast , @corbett5 , Finally found the bug that lead to a deadlock, it was an uncatched erroneous GEOSX_THROW_IF, I fixed the test but there is still no try catch. Should I change it to be an. GEOSX_ERROR_IF instead ? What is the good practice in GEOSX for exception/error ?

Jan 10 '23 07:01 XL64

What is the good practice in GEOSX for exception/error ?

The exceptions are used exclusively for compatibility with Python. We don't try to do any error handling inside GEOSX itself, and in C++ land a GEOSX_ERROR and an exception are both meant to be fatal errors that abort the current simulation. However when using pygeosx we can catch the exceptions and let the user try to start a new simulation after fixing their mistake. As such exceptions should be used whenever the issue may have come from user input.

Jan 10 '23 18:01 corbett5

What is the good practice in GEOSX for exception/error ?

The exceptions are used exclusively for compatibility with Python. We don't try to do any error handling inside GEOSX itself, and in C++ land a GEOSX_ERROR and an exception are both meant to be fatal errors that abort the current simulation. However when using pygeosx we can catch the exceptions and let the user try to start a new simulation after fixing their mistake. As such exceptions should be used whenever the issue may have come from user input.

Ok then I'll change to ERROR as this should not occur. Thanks.

Jan 13 '23 07:01 XL64

Did the change from thorw to error, should be ready for merge now.

Jan 13 '23 08:01 XL64

@corbett5 thanks a lot for your comments and suggestions! do you think this is ready to go into the merge queue?

Jan 17 '23 08:01 sframba

@corbett5 thanks a lot for your comments and suggestions! do you think this is ready to go into the merge queue?

so is this PR actually ready to be merged?

Jan 24 '23 22:01 CusiniM

Yes it is ready

Jan 27 '23 16:01 XL64

GEOS GEOS copied to clipboard

Feature/add lifo storage

GEOS
GEOS copied to clipboard