llvm icon indicating copy to clipboard operation
llvm copied to clipboard

[SYCL] Rewrite properties storage

Open rolandschulz opened this issue 1 year ago • 1 comments

Goal is to improve compilation time. Store all properties as map. A map is type-list of type-lists. The first entry of the inner list is the property key and the 2nd the property value. Allows efficient lookup. Runtime values are stored as base types. Allows efficient storage and retrieval. std::tuple usage is removed (slow to instanstiate).

rolandschulz avatar May 13 '24 23:05 rolandschulz

moved trivial changes into https://github.com/intel/llvm/pull/13777 to reduce size of this

rolandschulz avatar May 14 '24 06:05 rolandschulz

This is very interesting! More comments would be good, especially one describing how we use the properties as base-classes for the properties type.

let me know what else would benefit from comments

rolandschulz avatar May 22 '24 14:05 rolandschulz

  1. Do you have any data on the compile time improvements due to this?
  2. Any idea how the benefits are split between parsing (i.e. the feature isn't used by the customer, but still included in SYCL headers) vs when it's instantiated (customers uses some property)?
  3. Any idea what features/guarantees of std::tuple make it heavier than your approach?
  4. Should we be introducing features/builtins in FE to make std::tuple faster to compiler instead?

aelovikov-intel avatar May 22 '24 14:05 aelovikov-intel

  1. Do you have any data on the compile time improvements due to this?

I haven't done any before/after comparison (yet). Before I started I looked at the compilation trace of the header and noticed the impact of std::tuple instanstiation and very slow compilation for long property lists (in particular >10 properties) because of the recursion used for e.g. merge, get, has. While I created multiple prototypes I made sure that those issues are resolved. For doing before/after timing it matters a lot what one is interested in: e.g. short/long properties.

  1. Any idea how the benefits are split between parsing (i.e. the feature isn't used by the customer, but still included in SYCL headers) vs when it's instantiated (customers uses some property)?

I could imagine that it might benefit a bit given that the new implementation is shorter which should benefit parsing. But I haven't looked at any benchmarks for parsing. Note that the instantiation cost does affect users which don't use any properties. Properties are used internally by multiple of our extensions and they instantiate properties, e.g. as part of the default arguments. And it's not just empty property list because there are some default properties used.

  1. Any idea what features/guarantees of std::tuple make it heavier than your approach?

I didn't invent anything new. std::tuple being slow to compile is well documented. The most obvious problem was that std::tuple was being used in places a simple type list is sufficient (i.e. no runtime storage needed). detail::type_list doesn't require any work by the compiler to instantiate (it doesn't even have an implementation). For the runtime storage, the base class storage is more efficient because: 1) tuple requires extra overhead to support get by index (which isn't needed for us) 2) the std tuple implementations don't use base class storage but recursion for ABI/historic reasons. And templated recursion is always slow to instantiate because you need to instantiate each level. And additional having our own storage makes the constructor much simpler and faster to compile because we get the sorting of properties for free.

  1. Should we be introducing features/built-ins in FE to make std::tuple faster to compiler instead?

I don't think it is possible. You can't address the main issue of the recursive instanstiation with builtins. Suggestions for compilation speed improvements beyond this PR:

  • Other headers use tuple as type list. We should replace those with detail::type_list. This is a low hanging fruit we have to improve compilation speed.
  • It is easy to create a lightweight tuple with much faster compilation speed if one doesn't care about ABI combability. There are many of such on github (e.g. https://github.com/codeinred/tuplet). We could use something like that instead of the std::tuple internally in places we need the runtime storage of tuple but don't require it to be ABI compatibility with std::tuple. Note that even if we add such a lightweight tuple it would still be beneficial to integrate the storage directly into properties because of auto-sorting in the constructor.
  • The biggest remaining issue with compilation speed of properties is the sorting. I'm planning to upload a PR which uses insertion merge instead of qsort (used by mp_sort) for small lists where it is faster. But even insertion merge using template meta programming is quite slow. Here a built-in which does type-list sorting to replace the meta-programming could provide significant speedup.

rolandschulz avatar May 23 '24 02:05 rolandschulz

Given bench:

#include <sycl/ext/oneapi/properties/properties.hpp>

using namespace sycl::ext::oneapi::experimental;
namespace mp11 = sycl::detail::boost::mp11;

template<class N> struct K 
#if OLD
: detail::compile_time_property_key_base_tag
#endif
{
    static constexpr detail::PropKind Kind = (detail::PropKind)N::value;
}; 
template<class N> using V = property_value<K<N>, N>;
using IDs = mp11::mp_iota_c<100>;
#if OLD
using P = properties<mp11::mp_apply<std::tuple, mp11::mp_transform<V, IDs>>>;
#else
using P = properties<mp11::mp_transform<V, IDs>>;
#endif

//template<class K> using is_key = std::bool_constant<P::has_property_key<K>()>;
template<class N> using correct_value = std::is_same<decltype(P::get_property<K<N>>()), V<N>>;

static_assert(mp11::mp_all_of<IDs, correct_value>());

I see ~2s for the new version and ~3.4s for the old version.

rolandschulz avatar May 23 '24 05:05 rolandschulz

Hi @rolandschulz , thank you for the detailed reply to my questions. Can you please put most of it into the PR description? I think that reasoning is important to preserve/be found easy and I'd like to see it in-tree vs. hosting platform that can be changed in future (however unlikely that is).

aelovikov-intel avatar May 23 '24 15:05 aelovikov-intel

I see ~2s for the new version and ~3.4s for the old version.

I have similar numbers for your case (3.15s->1.7s), but

# Old
$ time clang++ -fsycl -include 'sycl/sycl.hpp' -x c++ /dev/null -c -o /dev/null

real    0m3.575s
user    0m3.431s
sys 0m0.143s
# New
$ time clang++ -fsycl -include 'sycl/sycl.hpp' -x c++ /dev/null -c -o /dev/null

real    0m3.634s
user    0m3.460s
sys 0m0.173s

To be clear, this is not a request/objection, just an observation.

aelovikov-intel avatar May 23 '24 16:05 aelovikov-intel

Looks good to me but @steffenlarsen is more fluent with this than I am, so I'm deferring the approval to him.

aelovikov-intel avatar May 24 '24 15:05 aelovikov-intel