Top2Vec
Top2Vec copied to clipboard
Memory and Performance of Top2vec
I did some memory and performance analysis of Top2vec on a collection of the Covid papers e.g. metadata.csv
1015768 rows × 19 columns
If you modify this line of code in filter_shorts() ... you get big memory savings: #papers_df["token_counts"] = papers_df["text"].str.split().map(len) papers_df["token_counts"] = papers_df["text"].map(lambda s: len(s.split()))
Here's a discussion:
https://github.com/pythonspeed/filprofiler/issues/369
Here's performance from cProfile E.g. - $ python -m cProfile -s cumtime notebooks/test.py
1553599000 function calls (1550046781 primitive calls) in 808.029 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
7473/1 0.084 0.000 808.080 808.080 {built-in method builtins.exec}
1 3.063 3.063 808.080 808.080 test.py:1(<module>)
1 0.118 0.118 717.212 717.212 Top2Vec.py:369(__init__)
1 0.249 0.249 332.328 332.328 Top2Vec.py:820(_embed_documents)
7328/6776 0.019 0.000 326.557 0.048 traceback_utils.py:138(error_handler)
6410 0.125 0.000 326.303 0.051 def_function.py:887(__call__)
6410 0.153 0.000 325.936 0.051 def_function.py:934(_call)
6409 0.015 0.000 324.596 0.051 load.py:743(_call_attribute)
6411 0.032 0.000 323.104 0.050 execute.py:29(quick_execute)
6411 323.064 0.050 323.064 0.050 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
6412 0.086 0.000 322.956 0.050 function.py:1782(_call_flat)
6408 0.063 0.000 322.646 0.050 function.py:2951(__call__)
6412 0.210 0.000 322.644 0.050 function.py:454(call)
1 0.017 0.017 289.578 289.578 Top2Vec.py:573(<listcomp>)
31103 0.031 0.000 289.560 0.009 Top2Vec.py:71(default_tokenizer)
31103 0.038 0.000 287.631 0.009 utils.py:287(simple_preprocess)
31103 0.048 0.000 179.592 0.006 utils.py:226(tokenize)
31103 1.544 0.000 178.418 0.006 utils.py:173(deaccent)
1268125/1266762 38.636 0.000 176.398 0.000 {method 'join' of 'str' objects}
564544074 90.115 0.000 137.545 0.000 utils.py:199(<genexpr>)
31103 25.852 0.001 108.002 0.003 utils.py:309(<listcomp>)
85562598 58.641 0.000 68.938 0.000 utils.py:269(simple_tokenize)
1 0.005 0.005 60.718 60.718 test.py:14(filter_short)
1 0.045 0.045 60.670 60.670 series.py:4086(map)
1 0.000 0.000 60.625 60.625 base.py:787(_map_values)
1 1.148 1.148 60.385 60.385 {pandas._libs.lib.map_infer}
3775007 7.763 0.000 59.236 0.000 test.py:16(<lambda>)
1 0.083 0.083 59.007 59.007 umap_.py:2249(fit)
3837331 51.258 0.000 51.258 0.000 {method 'split' of 'str' objects}
564700691 47.430 0.000 47.430 0.000 {built-in method unicodedata.category}
261/46 0.005 0.000 36.519 0.794 dispatcher.py:929(compile)
149/6 0.000 0.000 36.328 6.055 dispatcher.py:140(compile)
149/6 0.000 0.000 36.328 6.055 dispatcher.py:147(_compile_cached)
149/6 0.012 0.000 36.327 6.055 dispatcher.py:162(_compile_core)
149/6 0.000 0.000 36.326 6.054 compiler.py:660(compile_extra)
4073/149 0.008 0.000 36.228 0.243 compiler_lock.py:32(_acquire_compile_lock)
162/6 0.001 0.000 36.145 6.024 compiler.py:422(compile_extra)
162/6 0.001 0.000 36.141 6.024 compiler.py:487(_compile_bytecode)
169/6 0.002 0.000 36.141 6.024 compiler.py:445(_compile_core)
169/6 0.010 0.000 36.139 6.023 compiler_machinery.py:320(run)
4064/148 0.041 0.000 36.136 0.244 compiler_machinery.py:257(_runPass)
12192/444 0.012 0.000 36.129 0.081 compiler_machinery.py:261(check)
5 0.000 0.000 35.888 7.178 dispatcher.py:402(_compile_for_args)
1 0.000 0.000 35.609 35.609 umap_.py:254(nearest_neighbors)
1 0.000 0.000 35.572 35.572 pynndescent_.py:639(__init__)
169/21 0.003 0.000 30.614 1.458 typed_passes.py:98(run_pass)
169/21 0.004 0.000 30.614 1.458 typed_passes.py:61(type_inference_stage)
169/21 0.001 0.000 30.565 1.455 typeinfer.py:1055(propagate)
345/43 0.035 0.000 30.564 0.711 typeinfer.py:141(propagate)
9167/1171 0.053 0.000 30.448 0.026 typeinfer.py:568(resolve)
9711/1247 0.006 0.000 30.418 0.024 context.py:188(resolve_function_type)
9178/1171 0.010 0.000 30.414 0.026 typeinfer.py:1503(resolve_call)
9711/1247 0.032 0.000 30.412 0.024 context.py:232(_resolve_user_function_type)
2632/310 0.012 0.000 30.130 0.097 typeinfer.py:558(__call__)
130/59 0.000 0.000 30.059 0.509 functions.py:548(get_call_type)
130/59 0.001 0.000 30.057 0.509 dispatcher.py:362(get_call_template)
531 0.009 0.000 28.887 0.054 __init__.py:1(<module>)
169/144 0.010 0.000 24.016 0.167 typed_passes.py:365(run_pass)
50 0.000 0.000 23.570 0.471 caching.py:636(load_overload)
50 0.000 0.000 23.560 0.471 caching.py:647(_load_overload)
55 0.001 0.000 23.403 0.426 caching.py:505(_load_index)
50 0.000 0.000 23.401 0.468 caching.py:491(load)
724/104 0.033 0.000 23.377 0.225 {built-in method _pickle.loads}
620 0.001 0.000 23.307 0.038 serialize.py:133(custom_rebuild)
620 0.001 0.000 23.306 0.038 dispatcher.py:909(_rebuild)
1 0.004 0.004 19.895 19.895 umap_.py:2713(_fit_embed_data)
1 0.005 0.005 19.891 19.891 umap_.py:923(simplicial_set_embedding)
1 0.006 0.006 18.418 18.418 layouts.py:220(optimize_layout_euclidean)
89 0.005 0.000 17.159 0.193 __init__.py:15(<module>)
1 0.117 0.117 17.037 17.037 text.py:1288(fit_transform)
200 16.460 0.082 16.460 0.082 layouts.py:61(_optimize_layout_euclidean_single_epoch)
1 0.000 0.000 15.456 15.456 feather_format.py:90(read_feather)
1 0.356 0.356 15.446 15.446 feather.py:197(read_feather)
169/144 0.002 0.000 15.101 0.105 lowering.py:133(lower)
1 11.966 11.966 15.059 15.059 text.py:1192(_count_vocab)
169/144 0.001 0.000 13.495 0.094 lowering.py:184(lower_normal_function)
169/144 0.004 0.000 13.378 0.093 lowering.py:201(lower_function_body)
1785/1597 0.053 0.000 13.364 0.008 lowering.py:225(lower_block)
25062/22814 0.085 0.000 13.172 0.001 lowering.py:374(lower_inst)
10729/3978 0.076 0.000 11.717 0.003 <frozen importlib._bootstrap>:986(_find_and_load)
6623/10 0.020 0.000 11.652 1.165 <frozen importlib._bootstrap>:956(_find_and_load_unlocked)
6403/10 0.023 0.000 11.646 1.165 <frozen importlib._bootstrap>:650(_load_unlocked)
9753/12 0.004 0.000 11.644 0.970 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
6009/8 0.013 0.000 11.637 1.455 <frozen importlib._bootstrap_external>:837(exec_module)
167616/167273 4.247 0.000 11.151 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
85570153 10.280 0.000 10.280 0.000 {method 'group' of 're.Match' objects}
1 0.000 0.000 9.937 9.937 Top2Vec.py:4(<module>)
3954/850 0.008 0.000 9.862 0.012 {built-in method builtins.__import__}
10963/10509 0.019