gemma.cpp unexpected crash in unexpected space

I'm trying to use gemma with unreal engine and I got an exception in an unexpected place:

gcpp::N_AVX2::Decompress<gcpp::CompressedArray<hwy::bfloat16_t,524288000>,float>(gcpp::CompressedArray<hwy::bfloat16_t,524288000> const & __ptr64,unsigned __int64,float * __ptr64,unsigned __int64) 0x000002528cc071b0
gcpp::N_AVX2::EmbedToken<gcpp::ConfigGemma2B<gcpp::SfpStream> >(int,unsigned __int64,unsigned __int64,gcpp::CompressedWeights<gcpp::ConfigGemma2B<gcpp::SfpStream>,void> const & __ptr64,gcpp::RowVectorBatch<float> & __ptr64) 0x000002528cd59486
`gcpp::N_AVX2::PrefillState::Prefill<gcpp::ConfigGemma2B<gcpp::SfpStream> >(hwy::Span<hwy::Span<int const > const > const & __ptr64,unsigned __int64,unsigned __int64,unsigned __int64,gcpp::CompressedWeights<gcpp::ConfigGemma2B<gcpp::SfpStream>,void> const & __ptr64,gcpp::RuntimeConfig const & __ptr64,hwy::Span<gcpp::KVCache> const & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(unsigned __int64,unsigned __int64)const __ptr64 0x000002528cd592f6
hwy::ThreadPool::Run<`gcpp::N_AVX2::PrefillState::Prefill<gcpp::ConfigGemma2B<gcpp::SfpStream> >(hwy::Span<hwy::Span<int const > const > const & __ptr64,unsigned __int64,unsigned __int64,unsigned __int64,gcpp::CompressedWeights<gcpp::ConfigGemma2B<gcpp::SfpStream>,void> const & __ptr64,gcpp::RuntimeConfig const & __ptr64,hwy::Span<gcpp::KVCache> const & __ptr64) __ptr64'::`1'::<lambda_1> >(unsigned __int64,unsigned __int64,`gcpp::N_AVX2::PrefillState::Prefill<gcpp::ConfigGemma2B<gcpp::SfpStream> >(hwy::Span<hwy::Span<int const > const > const & __ptr64,unsigned __int64,unsigned __int64,unsigned __int64,gcpp::CompressedWeights<gcpp::ConfigGemma2B<gcpp::SfpStream>,void> const & __ptr64,gcpp::RuntimeConfig const & __ptr64,hwy::Span<gcpp::KVCache> const & __ptr64) __ptr64'::`1'::<lambda_1> const & __ptr64) __ptr64 0x000002528cd591ce
gcpp::N_AVX2::PrefillState::Prefill<gcpp::ConfigGemma2B<gcpp::SfpStream> >(hwy::Span<hwy::Span<int const > const > const & __ptr64,unsigned __int64,unsigned __int64,unsigned __int64,gcpp::CompressedWeights<gcpp::ConfigGemma2B<gcpp::SfpStream>,void> const & __ptr64,gcpp::RuntimeConfig const & __ptr64,hwy::Span<gcpp::KVCache> const & __ptr64) __ptr64 0x000002528cd582ab
gcpp::N_AVX2::GenerateT<gcpp::ConfigGemma2B<gcpp::SfpStream> >(std::unique_ptr<unsigned char [0],hwy::AlignedFreer> const & __ptr64,gcpp::Activations & __ptr64,gcpp::RuntimeConfig const & __ptr64,hwy::Span<hwy::Span<int const > const > const & __ptr64,unsigned __int64,unsigned __int64,hwy::Span<gcpp::KVCache> const & __ptr64,hwy::ThreadPool & __ptr64,gcpp::TimingInfo & __ptr64) 0x000002528cd57587
gcpp::N_AVX2::GenerateSingleT<gcpp::ConfigGemma2B<gcpp::SfpStream> >(std::unique_ptr<unsigned char [0],hwy::AlignedFreer> const & __ptr64,gcpp::RuntimeConfig const & __ptr64,hwy::Span<int const > const & __ptr64,unsigned __int64,gcpp::KVCache & __ptr64,hwy::ThreadPool & __ptr64,gcpp::TimingInfo & __ptr64) 0x000002528cd3a7c2
gcpp::GenerateSingle(gcpp::ConfigGemma2B<gcpp::SfpStream>,std::unique_ptr<unsigned char [0],hwy::AlignedFreer> const & __ptr64,gcpp::RuntimeConfig const & __ptr64,hwy::Span<int const > const & __ptr64,unsigned __int64,gcpp::KVCache & __ptr64,hwy::ThreadPool & __ptr64,gcpp::TimingInfo & __ptr64) 0x000002528cd3a1f4
gcpp::CallForModel<gcpp::SfpStream,gcpp::GenerateSingleT,std::unique_ptr<unsigned char [0],hwy::AlignedFreer> & __ptr64,gcpp::RuntimeConfig const & __ptr64,hwy::Span<int const > const & __ptr64,unsigned __int64 & __ptr64,gcpp::KVCache & __ptr64,hwy::ThreadPool & __ptr64,gcpp::TimingInfo & __ptr64>(gcpp::Model,std::unique_ptr<unsigned char [0],hwy::AlignedFreer> & __ptr64,gcpp::RuntimeConfig const & __ptr64,hwy::Span<int const > const & __ptr64,unsigned __int64 & __ptr64,gcpp::KVCache & __ptr64,hwy::ThreadPool & __ptr64,gcpp::TimingInfo & __ptr64) 0x000002528ca83dbf
gcpp::Gemma::Generate(gcpp::RuntimeConfig const & __ptr64,hwy::Span<int const > const & __ptr64,unsigned __int64,gcpp::KVCache & __ptr64,gcpp::TimingInfo & __ptr64) __ptr64 0x000002528ca8159a

I don't think it's a bug, the cli tool works fine. What could be causing this?

Sep 18 '24 15:09 NickTsaizer

Interesting :) Do I understand correctly that Gemma is being called from a C++ process that uses Unreal, and that the same prompt (including wrapping) works in the CLI? Looks like it's an AVX2-capable machine and we're unpacking bf16 to float32. That's pretty straightforward, hard to see why that should crash.

Is it possible the heap is corrupted? Does asan detect anything? Are you initializing the shared state in the same way that run.cc does?

Sep 18 '24 16:09 jan-wassenberg

As I know, there is no way to use asan with unreal engine on windows, so I can't test this. The heap might be corrupted by the UE garbage collector, but I think I did everything necessary to avoid that.

My initialization is different from run.cc, but only to make up for the lack of arguments as such. I also use the UE logger instead of cout. The rest is identical because I just copied the code from run.cc. Here is the modified code, maybe I made an obvious mistake and didn't realize it.

void UGemmaModel::Test()
{
	// Instantiate model and KV Cache
	hwy::ThreadPool pool(gcpp::AppArgs::GetSupportedThreadCount());
	gcpp::ModelInfo model_info(gcpp::Model::GEMMA_2B, gcpp::ModelTraining::GEMMA_PT, gcpp::Type::kSFP);
	gcpp::Gemma model = gcpp::Gemma(
	        gcpp::Path("D:\\Workspace\\trainhead\\Rubic\\Models\\Gemma\\tokenizer.spm"),
	        gcpp::Path("D:\\Workspace\\trainhead\\Rubic\\Models\\Gemma\\2b-it-sfp.sbs"), 
	        model_info, pool);
	
	gcpp::KVCache kv_cache =
		gcpp::KVCache::Create(model_info.model, size_t{64});
	size_t pos = 0;  // KV Cache position

	// Initialize random number generator
	std::mt19937 gen;
	std::random_device rd;
	gen.seed(rd());

	// Tokenize instructions.
	std::string prompt = "Write a greeting to the world.";
	const std::vector<int> tokens =
		gcpp::WrapAndTokenize(model.Tokenizer(), model_info, pos, prompt);
	size_t ntokens = tokens.size();

	// This callback function gets invoked every time a token is generated
	auto stream_token = [&pos, &ntokens, &model](int token, float) {
		++pos;
		if (pos < ntokens) {
			// print feedback
		} else if (token != gcpp::EOS_ID) {
			std::string token_text;
			HWY_ASSERT(model.Tokenizer().Decode({token}, &token_text));
			UE_LOG(LogTemp, Warning, TEXT("%hs"), UTF8_TO_TCHAR(token_text.c_str()))
		}
		return true;
	};

	gcpp::TimingInfo timing_info;
	gcpp::RuntimeConfig runtime_config = {
		.max_tokens = 1536,
		.max_generated_tokens = 1024,
		.temperature = 1.0,
		.verbosity = 0,
		.gen = &gen,
		.stream_token = stream_token,
	};
	model.Generate(runtime_config, tokens, 0, kv_cache, timing_info);
}

Anyway thanks for the reply, since it's not some common crash, I must have made a mistake somewhere else. I'll keep digging into my code and report back what was wrong

Sep 19 '24 09:09 NickTsaizer

hm, one thing I notice is that you want ModelTraining::GEMMA_IT instead of ModelTraining::GEMMA_PT to match the model you've got. But that is not going to cause a crash.

For Windows heap checking, maybe BoundsChecker or GFlags?

Might also help to add InferenceArgs().CopyTo(runtime_config); to match our run.cc.

Sep 19 '24 11:09 jan-wassenberg

Hi @NickTsaizer,

Could you please confirm if this issue is resolved for you with the above comment ? Please feel free to close the issue if it is resolved ?

Thank you.

Oct 16 '24 06:10 Gopi-Uppari

Hi! Sorry for long resonse. No, it's not resolved, but it's not your problem so i'm closing the issue. Thank you for advices

Oct 16 '24 06:10 NickTsaizer