petals
petals copied to clipboard
Allow filtering by max sequence length
Problem: if some (but not all) servers support longer sequence length, inferencing with that sequence length would be very inefficient because the client will constantly bump into short-length servers.
Suggested solution: if we ask servers to report max sequence length to the DHT, a client will be able to filter by sequence length as they read DHT entries.
Could max seq len for forward/backward also be reported? And, as issue's solution is to filter by sequence length, could the absolute max seq length be tied to the hosts set max so 65b could be trained for and inferred above 2048 please? If a server chooses to host over that of course. The petals team is awesome btw :)
@Jeduh, I agree that it would make sense. Just FYI, right now the limits are 8192 for Llama 2 (70B, 70B-Chat), 2048 for all other models.
@justheuristic What do these numbers 256 * 1024 * 1024 represent?