k2
k2 copied to clipboard
Processing fst text lines
Hi
I didn't have this problem before when i installed k2 with conda I have recently cloned and compiled directly from sources, and I have this problem in reading fst (created by kaldilm)
k2/build_release_cpu_torch_cpu/k2/csrc/fsa_utils.cc:295:void k2::OpenFstStreamReader::ProcessLine(std::string&) Invalid line: 5 0 4 99458 0, eof=true, fail=true, src_state=5, dest_state=0
looks to me that the absence of a cost field in the line causes this issue (i.e. fail=true) If I add a 0.0 field as a 5th field does not happen
suggestions?
Are you using the latest master?
right, after a merge it worked. Thanks.
by the way, this was when trying to save an arpa for use in LM rescoring. THe unpruned arpa has 6GB and it causes segmentation fault when saving with torch torch.save(G.as_dict(), f"{args.lm_dir}/G_4_gram_asdict.new.pt")
if I prune to have a 1 GB, I have no issues I understand it's not k2, but may be you are aware of this issue? it's something expected with LMs of that size?
Can you show some debug info for the segmentation fault?
contrary to what I said, the segmentation fault is caused by the call to
G.as_dict() rather than to torch.save
I'm not sure if it helps, I run the python script with dgb:
Thread 1 "python" received signal SIGSEGV, Segmentation fault. __memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:500 500 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
gdb --args python /path/to/xxx.py
(gdb) catch throw
(gdb) run
# When it segfaults
(gdb) backtrace
Please show the backtrace.
#0 __memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:500
#1 0x00007fff92b8cfcf in k2::Array1
does that help?
does that help?
Thanks!
Could you build a debug version of k2 and show the information about
(gdb) frame 1
(gdb) list
It calls Cat on 4 arrays, including the arcs linearized to where each arc is 4 int32_t's. The size of that could definitely overflow int32_t, if the number of arcs were more than about 2**(32 - 3) [-1 because it's signed, -2 because of the factor of 4]. I can't see an easy way to fix that without breaking older formats or introducing redundant formats.
.. I do see a problem though, at array_ops_in.h:349,
int32_t elem_size = src[0]->ElementSize();
this should be int64_t, so that when we multiply by the size it doesn't overflow.
(gdb) frame 1
#1 0x00007fff928de83d in k2::Cat
replacing int32_t with int64_t has indeed solved the problem in my case (6GB 4-gram fst)
(gdb) print elem_size
(gdb) print this_dim
(gdb) print elem_size * this_dim
to see whether elem_size * this_dim overflows.
so anything bigger than 4GB would fail? a 4gram LM of that size is probably something that can be pruned, but I have seen big HLG
not sure if it's related but I think the Kaldi code of arpa2fst is smart and uses different data type depending on how big the LM is. I could imagine this causing issue somewhere where the graph would be processed by another tool not knowing about this. y.
On Wed, Sep 14, 2022 at 12:37 PM armusc @.***> wrote:
so anything bigger than 4GB would fail? a 4gram LM of that size is probably something that can be pruned, but I have seen big HLG
— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/1055#issuecomment-1246573434, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXZ2MFCQJKJBZBE4CELV6GTFPANCNFSM6AAAAAAQJ34YNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
(gdb) print elem_size (gdb) print this_dim (gdb) print elem_size * this_dimto see whether
elem_size * this_dimoverflows.
(gdb) print elem_size
$2 = 4
(gdb) print this_dim
$3 = 771885848
(gdb) print elem_size * this_dim
$4 = -1207423904
(gdb)
@armusc can you please make PR?