Piotr Wilkin (ilintar)
Piotr Wilkin (ilintar)
Update: we have output! My 500M version is producing very nice outputs already: ```console user Let's go! assistant Javier斫 fond𬸚עמק(cursorStick面對 Cunningham.semgetNumjest茶叶ador Ce serão_BG Delete Regular.LoadScene anchppelin.win้ม indexing een닙)object עצמו markedbaby干部继承所能...
@theo77186 Nah, I wouldn't expect the first version that actually produces output to produce correct output, that would be a miracle :) Now comes the part of comparing intermediate results...
@theo77186 added the exclusion of MTP layers from conversion
Argh, it doesn't use the standard RMS norm either: ```python class Qwen3NextRMSNormGated(nn.Module): def __init__(self, hidden_size, eps=1e-6, **kwargs): super().__init__() self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps def forward(self, hidden_states, gate=None): input_dtype =...
> glad im not an ai engineer Neither am I :laughing:
Now that's a new one I haven't seen before :) I'll probably resume tomorrow, my brain is a bit fried.
> For some reason, for the 70M model, `conv_states` is 50% larger than expected, will try to see what's going on. Just for reference, I can't make your 70M model...
TTS looks reasonable, since whisper.cpp has whisper-server, so we can run the Whisper model from there. Also, llama.cpp has support for some TTS models, though not through the server endpoint....
Any reason why you want to immediately stop the server? I'd see it more like another instance - start a whisper server on demand if needed, stop it if any...
Cool! Doing a refactoring now to fix the thread launching logic, I'll try to merge when I'm done with that.