eth-hash icon indicating copy to clipboard operation
eth-hash copied to clipboard

Consider dropping type checks in Keccak256 methods

Open gsalgado opened this issue 7 years ago • 17 comments

The __call__() and new() methods there will check the type of their arguments using isinstance, and although that should be a cheap check, these are code paths where performance is paramount given the number of times they are called when we perform trie operations on a fully synced DB. For example, when doing a full sync around 10% of the time is being spent into isinstance calls. Even though they're probably not all originating from these checks, I think we should consider removing them as these two methods have type annotations that should make it clear enough what is wrong in case they crash when passed an argument of the wrong type

gsalgado avatar Apr 30 '18 11:04 gsalgado

So, it turns out that during a full sync we're spending so much time on isinstance() calls because we're making an insane amount of them: 21926745 calls in a 3 minute run, but only 421334 (2%) of those are coming from these codepaths, so we might not gain anything by dropping these checks after all

gsalgado avatar Apr 30 '18 12:04 gsalgado

@gsalgado I'm betting most of those isinstance checks are happening in the validation code within the EVM thats checking things like this.:

  • https://github.com/ethereum/py-evm/blob/master/evm/vm/memory.py#L45-L49
  • https://github.com/ethereum/py-evm/blob/master/evm/vm/gas_meter.py#L27
  • https://github.com/ethereum/py-evm/blob/master/evm/vm/stack.py#L38

pipermerriam avatar Apr 30 '18 14:04 pipermerriam

The biggest offenders are actually rlp and trie: rlp/codec.py:72(encode_raw), trie/validation.py:11(validate_is_bytes), rlp/sedes/serializable.py:300(make_immutable). Just those 3 seem to account to almost half of all isinstance() calls

gsalgado avatar Apr 30 '18 14:04 gsalgado

Hrm, I could get onboard with dropping the check from trie. It's not clear to me how to drop the check from rlp but we do call make_immutable on every argument during object instantiation. It'd be nice to drop this but I like that when you pass a list into an rlp object it gets cast to a tuple.

pipermerriam avatar Apr 30 '18 16:04 pipermerriam

I haven't looked at the specific code, but I wonder if the behavior of some of those checks could be duplicated by appropriately coded try/except blocks. According to the docs, try/except is very efficient when an exception is not raised: https://docs.python.org/3.6/faq/design.html#how-fast-are-exceptions.

davesque avatar Apr 30 '18 19:04 davesque

not the right solution but I did have a look at functools.singledispatch as a potential remedy for this but it's about 4x slower than a manual isinstance check (but way more elegant)

pipermerriam avatar Apr 30 '18 20:04 pipermerriam

rlp/codec.py:72(encode_raw) trie/validation.py:11(validate_is_bytes) rlp/sedes/serializable.py:300(make_immutable)

I think we can improve performance on all of these with assert statements. That way, default usages (and our tests) will work slower & safer, but we can disable asserts when running trinity client.

trie/validation.py:11(validate_is_bytes)

Trivial to move to asserts

rlp/sedes/serializable.py:300(make_immutable)

Not sure about this one. Can we simply disallow list as a valid input type, and use assert statements to reject lists?

rlp/codec.py:72(encode_raw)

This one is tougher, but here is one way to reduce the calls of isinstance by 2/3rds:

old code

    if isinstance(item, Atomic):
        if len(item) == 1 and item[0] < 128:
            return item
        payload = item
        prefix_offset = 128  # string
    elif not isinstance(item, str) and isinstance(item, collections.Sequence):
        payload = b''.join(encode_raw(x) for x in item)
        prefix_offset = 192  # list
    else:
        msg = 'Cannot encode object of type {0}'.format(type(item).__name__)
        raise EncodingError(msg, item)

new code

    if isinstance(item, Atomic):
        if len(item) == 1 and item[0] < 128:
            return item
        payload = item
        prefix_offset = 128  # string
    else:
        try:
            assert not isinstance(item, str)
        except AssertionError as exc:
            raise encoding_error(item) from exc

        try:
            payload = b''.join(encode_raw(x) for x in item)
        except TypeError as exc:
            raise encoding_error(item) from exc

        prefix_offset = 192  # list

def encoding_error(item)
        msg = 'Cannot encode object of type {0}'.format(type(item).__name__)
        return EncodingError(msg, item)

carver avatar Apr 30 '18 22:04 carver

I guess this should really be moved to py-evm/pyrlp/py-trie -- seems like we shouldn't remove the type checks in eth-hash until they are clearly a bottleneck.

carver avatar Apr 30 '18 22:04 carver

We can vendor the encode_raw implementation into py-trie. That's an easy optimization, allowing us to drop checks without any compromise to the actual rlp library. Bonus is that if we go ahead and vendor the decode as well we can drop the rlp dependency.

pipermerriam avatar Apr 30 '18 23:04 pipermerriam

I've always felt weird about disabling asserts since some libraries might incorrectly use them as a general exception throwing mechanism. Thoughts on that?

davesque avatar May 01 '18 00:05 davesque

I hope that having libraries break when you turn off assertions would be very rare. I would certainly consider that broken. Is that something you've run into before? @davesque

carver avatar May 01 '18 01:05 carver

@carver Actually, no thankfully. And I guess it's probably not a very big risk and we'd notice it pretty quickly if a library broke in optimized mode. But it does seems a bit undesirable that one should have to use -O to get decent performance. I've never run across a library or tool where that was the case. Maybe I've just never had occasion to use one that worked that way. Or is there a way to disable asserts on a per module basis? If so, that would probably be a decent way to go.

But maybe it's possible that some of that manual type checking could just be removed altogether since a wrong value type would lead pretty quickly to a fairly obvious error (that's why I suggested the try/except approach)? It just feels a bit like we might be trying to shoehorn the wrong concerns into a dynamically typed language.

davesque avatar May 01 '18 04:05 davesque

eurika! (maybe)

We should be able to get roughly the same level of confidence with type checking via mypy.

pipermerriam avatar May 01 '18 04:05 pipermerriam

@pipermerriam Yes, I agree with that. Even if third party users of our libraries didn't use mypy, it would still give some guarantees about their internal correctness and the interactions between them.

davesque avatar May 01 '18 04:05 davesque

Well, currently we have type definitions for eth-keys defined like this: https://github.com/ethereum/py-evm/tree/master/stubs

We can move this stuff to a submodule and add type definitions for anything else which is stable, share the submodule across repositories and get some stronger type safety checks across library boundaries, so even consumers of py-trie/eth-hash/whatever should be able to still get solid type safety checks.

pipermerriam avatar May 01 '18 14:05 pipermerriam

We should be able to get roughly the same level of confidence with type checking via mypy.

Which comes at the cost of requiring that any module that uses the method to use mypy to get a reasonable developer experience. For example, if a module uses encode_raw() and sends a str in as the item (without running mypy checks), we end up in an infinite recursion.

Infinite recursion also happens if running with python -O in the above example, but at least that was "opt-out" of nicer error messages, rather than mypy, which requires "opt-in" to get nicer error messages.

carver avatar May 01 '18 20:05 carver

It still doesn't sit well with me to disable assertions for performance gains.

pipermerriam avatar May 01 '18 20:05 pipermerriam