k2 icon indicating copy to clipboard operation
k2 copied to clipboard

Processing fst text lines

Open armusc opened this issue 3 years ago • 16 comments

Hi

I didn't have this problem before when i installed k2 with conda I have recently cloned and compiled directly from sources, and I have this problem in reading fst (created by kaldilm)

k2/build_release_cpu_torch_cpu/k2/csrc/fsa_utils.cc:295:void k2::OpenFstStreamReader::ProcessLine(std::string&) Invalid line: 5 0 4 99458 0, eof=true, fail=true, src_state=5, dest_state=0

looks to me that the absence of a cost field in the line causes this issue (i.e. fail=true) If I add a 0.0 field as a 5th field does not happen

suggestions?

armusc avatar Sep 11 '22 17:09 armusc

Are you using the latest master?

csukuangfj avatar Sep 12 '22 01:09 csukuangfj

right, after a merge it worked. Thanks.

by the way, this was when trying to save an arpa for use in LM rescoring. THe unpruned arpa has 6GB and it causes segmentation fault when saving with torch torch.save(G.as_dict(), f"{args.lm_dir}/G_4_gram_asdict.new.pt")

if I prune to have a 1 GB, I have no issues I understand it's not k2, but may be you are aware of this issue? it's something expected with LMs of that size?

armusc avatar Sep 12 '22 07:09 armusc

Can you show some debug info for the segmentation fault?

danpovey avatar Sep 12 '22 09:09 danpovey

contrary to what I said, the segmentation fault is caused by the call to

G.as_dict() rather than to torch.save

I'm not sure if it helps, I run the python script with dgb:

Thread 1 "python" received signal SIGSEGV, Segmentation fault. __memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:500 500 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.

armusc avatar Sep 12 '22 10:09 armusc

gdb --args python /path/to/xxx.py
(gdb) catch throw
(gdb) run
# When it segfaults
(gdb) backtrace

Please show the backtrace.

csukuangfj avatar Sep 13 '22 03:09 csukuangfj

#0 __memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:500 #1 0x00007fff92b8cfcf in k2::Array1 k2::Cat(std::shared_ptrk2::Context, int, k2::Array1 const**) () from /home/amuscariello/mediaspeech/k2/build_release_cpu_torch_cpu/lib/libk2context.so #2 0x00007fff92b85471 in k2::FsaVecToTensor(k2::Raggedk2::Arc const&) () from /home/amuscariello/mediaspeech/k2/build_release_cpu_torch_cpu/lib/libk2context.so #3 0x00007fff92ea21bb in ?? () from /home/amuscariello/mediaspeech/k2/build_debug_cpu_torch_cpu/lib/_k2.cpython-38-x86_64-linux-gnu.so #4 0x00007fff92ec6d85 in ?? () from /home/amuscariello/mediaspeech/k2/build_debug_cpu_torch_cpu/lib/_k2.cpython-38-x86_64-linux-gnu.so #5 0x000055555568ff8e in cfunction_call_varargs (kwargs=0x0, args=0x7ffff7963400, func=0x7fff92f48590) at /usr/local/src/conda/python-3.8.13/Objects/call.c:743 #6 PyCFunction_Call (func=0x7fff92f48590, args=0x7ffff7963400, kwargs=0x0) at /usr/local/src/conda/python-3.8.13/Objects/call.c:773 #7 0x0000555555678651 in _PyObject_MakeTpCall (callable=0x7fff92f48590, args=, nargs=, keywords=) at /usr/local/src/conda/python-3.8.13/Python/errors.c:219 #8 0x0000555555674471 in _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7fff8b097fd8, callable=0x7fff92f48590) at /usr/local/src/conda/python-3.8.13/Include/cpython/abstract.h:125 #9 _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7fff8b097fd8, callable=0x7fff92f48590) at /usr/local/src/conda/python-3.8.13/Include/cpython/abstract.h:115 #10 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x5555558e64a0) at /usr/local/src/conda/python-3.8.13/Python/ceval.c:4963 #11 _PyEval_EvalFrameDefault (f=, throwflag=) at /usr/local/src/conda/python-3.8.13/Python/ceval.c:3469 #12 0x000055555568f886 in PyEval_EvalFrameEx (throwflag=0, f=0x7fff8b097e40) at /usr/local/src/conda/python-3.8.13/Python/ceval.c:738 #13 function_code_fastcall (globals=, nargs=, args=, co=) at /usr/local/src/conda/python-3.8.13/Objects/call.c:284 #14 _PyFunction_Vectorcall (kwnames=, nargsf=, stack=0x555557ca93b8, func=0x7fff8e377b80) at /usr/local/src/conda/python-3.8.13/Objects/call.c:411 #15 _PyObject_Vectorcall (kwnames=, nargsf=, args=0x555557ca93b8, callable=0x7fff8e377b80) at /usr/local/src/conda/python-3.8.13/Include/cpython/abstract.h:127 #16 method_vectorcall (method=, args=0x555557ca93c0, nargsf=, kwnames=) at /usr/local/src/conda/python-3.8.13/Objects/classobject.c:60

does that help?

armusc avatar Sep 13 '22 18:09 armusc

does that help?

Thanks!

Could you build a debug version of k2 and show the information about

(gdb) frame 1
(gdb) list

csukuangfj avatar Sep 14 '22 02:09 csukuangfj

It calls Cat on 4 arrays, including the arcs linearized to where each arc is 4 int32_t's. The size of that could definitely overflow int32_t, if the number of arcs were more than about 2**(32 - 3) [-1 because it's signed, -2 because of the factor of 4]. I can't see an easy way to fix that without breaking older formats or introducing redundant formats.

danpovey avatar Sep 14 '22 03:09 danpovey

.. I do see a problem though, at array_ops_in.h:349, int32_t elem_size = src[0]->ElementSize(); this should be int64_t, so that when we multiply by the size it doesn't overflow.

danpovey avatar Sep 14 '22 03:09 danpovey

(gdb) frame 1 #1 0x00007fff928de83d in k2::Cat (c=..., num_arrays=4, src=0x7fffffffd020) at /home/amuscariello/mediaspeech/k2/k2/csrc/array_ops_inl.h:353 353 memcpy(static_cast<void *>(ans_data), (gdb) list 348 // CPU. 349 int32_t elem_size = src[0]->ElementSize(); 350 for (int32_t i = 0; i < num_arrays; ++i) { 351 int32_t this_dim = src[i]->Dim(); 352 const T *this_src_data = src[i]->Data(); 353 memcpy(static_cast<void *>(ans_data), 354 static_cast<const void *>(this_src_data), elem_size * this_dim); 355 ans_data += this_dim; 356 } 357 } else { (gdb)

armusc avatar Sep 14 '22 09:09 armusc

replacing int32_t with int64_t has indeed solved the problem in my case (6GB 4-gram fst)

armusc avatar Sep 14 '22 10:09 armusc

(gdb) print elem_size
(gdb) print this_dim
(gdb) print elem_size * this_dim

to see whether elem_size * this_dim overflows.

csukuangfj avatar Sep 14 '22 10:09 csukuangfj

so anything bigger than 4GB would fail? a 4gram LM of that size is probably something that can be pruned, but I have seen big HLG

armusc avatar Sep 14 '22 10:09 armusc

not sure if it's related but I think the Kaldi code of arpa2fst is smart and uses different data type depending on how big the LM is. I could imagine this causing issue somewhere where the graph would be processed by another tool not knowing about this. y.

On Wed, Sep 14, 2022 at 12:37 PM armusc @.***> wrote:

so anything bigger than 4GB would fail? a 4gram LM of that size is probably something that can be pruned, but I have seen big HLG

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/1055#issuecomment-1246573434, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXZ2MFCQJKJBZBE4CELV6GTFPANCNFSM6AAAAAAQJ34YNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jtrmal avatar Sep 14 '22 10:09 jtrmal

(gdb) print elem_size
(gdb) print this_dim
(gdb) print elem_size * this_dim

to see whether elem_size * this_dim overflows.

(gdb) print elem_size
$2 = 4 (gdb) print this_dim $3 = 771885848 (gdb) print elem_size * this_dim $4 = -1207423904 (gdb)

armusc avatar Sep 14 '22 10:09 armusc

@armusc can you please make PR?

danpovey avatar Sep 14 '22 13:09 danpovey