Memory and Performance of Top2vec

Open dbl001 opened this issue 3 years ago • 0 comments
I did some memory and performance analysis of Top2vec on a collection of the Covid papers e.g. metadata.csv
1015768 rows × 19 columns
If you modify this line of code in filter_shorts() ... you get big memory savings: #papers_df["token_counts"] = papers_df["text"].str.split().map(len) papers_df["token_counts"] = papers_df["text"].map(lambda s: len(s.split()))
Here's a discussion:
https://github.com/pythonspeed/filprofiler/issues/369
Here's performance from cProfile E.g. - $ python -m cProfile -s cumtime notebooks/test.py
         1553599000 function calls (1550046781 primitive calls) in 808.029 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   7473/1    0.084    0.000  808.080  808.080 {built-in method builtins.exec}
        1    3.063    3.063  808.080  808.080 test.py:1(<module>)
        1    0.118    0.118  717.212  717.212 Top2Vec.py:369(__init__)
        1    0.249    0.249  332.328  332.328 Top2Vec.py:820(_embed_documents)
7328/6776    0.019    0.000  326.557    0.048 traceback_utils.py:138(error_handler)
     6410    0.125    0.000  326.303    0.051 def_function.py:887(__call__)
     6410    0.153    0.000  325.936    0.051 def_function.py:934(_call)
     6409    0.015    0.000  324.596    0.051 load.py:743(_call_attribute)
     6411    0.032    0.000  323.104    0.050 execute.py:29(quick_execute)
     6411  323.064    0.050  323.064    0.050 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
     6412    0.086    0.000  322.956    0.050 function.py:1782(_call_flat)
     6408    0.063    0.000  322.646    0.050 function.py:2951(__call__)
     6412    0.210    0.000  322.644    0.050 function.py:454(call)
        1    0.017    0.017  289.578  289.578 Top2Vec.py:573(<listcomp>)
    31103    0.031    0.000  289.560    0.009 Top2Vec.py:71(default_tokenizer)
    31103    0.038    0.000  287.631    0.009 utils.py:287(simple_preprocess)
    31103    0.048    0.000  179.592    0.006 utils.py:226(tokenize)
    31103    1.544    0.000  178.418    0.006 utils.py:173(deaccent)
1268125/1266762   38.636    0.000  176.398    0.000 {method 'join' of 'str' objects}
564544074   90.115    0.000  137.545    0.000 utils.py:199(<genexpr>)
    31103   25.852    0.001  108.002    0.003 utils.py:309(<listcomp>)
 85562598   58.641    0.000   68.938    0.000 utils.py:269(simple_tokenize)
        1    0.005    0.005   60.718   60.718 test.py:14(filter_short)
        1    0.045    0.045   60.670   60.670 series.py:4086(map)
        1    0.000    0.000   60.625   60.625 base.py:787(_map_values)
        1    1.148    1.148   60.385   60.385 {pandas._libs.lib.map_infer}
  3775007    7.763    0.000   59.236    0.000 test.py:16(<lambda>)
        1    0.083    0.083   59.007   59.007 umap_.py:2249(fit)
  3837331   51.258    0.000   51.258    0.000 {method 'split' of 'str' objects}
564700691   47.430    0.000   47.430    0.000 {built-in method unicodedata.category}
   261/46    0.005    0.000   36.519    0.794 dispatcher.py:929(compile)
    149/6    0.000    0.000   36.328    6.055 dispatcher.py:140(compile)
    149/6    0.000    0.000   36.328    6.055 dispatcher.py:147(_compile_cached)
    149/6    0.012    0.000   36.327    6.055 dispatcher.py:162(_compile_core)
    149/6    0.000    0.000   36.326    6.054 compiler.py:660(compile_extra)
 4073/149    0.008    0.000   36.228    0.243 compiler_lock.py:32(_acquire_compile_lock)
    162/6    0.001    0.000   36.145    6.024 compiler.py:422(compile_extra)
    162/6    0.001    0.000   36.141    6.024 compiler.py:487(_compile_bytecode)
    169/6    0.002    0.000   36.141    6.024 compiler.py:445(_compile_core)
    169/6    0.010    0.000   36.139    6.023 compiler_machinery.py:320(run)
 4064/148    0.041    0.000   36.136    0.244 compiler_machinery.py:257(_runPass)
12192/444    0.012    0.000   36.129    0.081 compiler_machinery.py:261(check)
        5    0.000    0.000   35.888    7.178 dispatcher.py:402(_compile_for_args)
        1    0.000    0.000   35.609   35.609 umap_.py:254(nearest_neighbors)
        1    0.000    0.000   35.572   35.572 pynndescent_.py:639(__init__)
   169/21    0.003    0.000   30.614    1.458 typed_passes.py:98(run_pass)
   169/21    0.004    0.000   30.614    1.458 typed_passes.py:61(type_inference_stage)
   169/21    0.001    0.000   30.565    1.455 typeinfer.py:1055(propagate)
   345/43    0.035    0.000   30.564    0.711 typeinfer.py:141(propagate)
9167/1171    0.053    0.000   30.448    0.026 typeinfer.py:568(resolve)
9711/1247    0.006    0.000   30.418    0.024 context.py:188(resolve_function_type)
9178/1171    0.010    0.000   30.414    0.026 typeinfer.py:1503(resolve_call)
9711/1247    0.032    0.000   30.412    0.024 context.py:232(_resolve_user_function_type)
 2632/310    0.012    0.000   30.130    0.097 typeinfer.py:558(__call__)
   130/59    0.000    0.000   30.059    0.509 functions.py:548(get_call_type)
   130/59    0.001    0.000   30.057    0.509 dispatcher.py:362(get_call_template)
      531    0.009    0.000   28.887    0.054 __init__.py:1(<module>)
  169/144    0.010    0.000   24.016    0.167 typed_passes.py:365(run_pass)
       50    0.000    0.000   23.570    0.471 caching.py:636(load_overload)
       50    0.000    0.000   23.560    0.471 caching.py:647(_load_overload)
       55    0.001    0.000   23.403    0.426 caching.py:505(_load_index)
       50    0.000    0.000   23.401    0.468 caching.py:491(load)
  724/104    0.033    0.000   23.377    0.225 {built-in method _pickle.loads}
      620    0.001    0.000   23.307    0.038 serialize.py:133(custom_rebuild)
      620    0.001    0.000   23.306    0.038 dispatcher.py:909(_rebuild)
        1    0.004    0.004   19.895   19.895 umap_.py:2713(_fit_embed_data)
        1    0.005    0.005   19.891   19.891 umap_.py:923(simplicial_set_embedding)
        1    0.006    0.006   18.418   18.418 layouts.py:220(optimize_layout_euclidean)
       89    0.005    0.000   17.159    0.193 __init__.py:15(<module>)
        1    0.117    0.117   17.037   17.037 text.py:1288(fit_transform)
      200   16.460    0.082   16.460    0.082 layouts.py:61(_optimize_layout_euclidean_single_epoch)
        1    0.000    0.000   15.456   15.456 feather_format.py:90(read_feather)
        1    0.356    0.356   15.446   15.446 feather.py:197(read_feather)
  169/144    0.002    0.000   15.101    0.105 lowering.py:133(lower)
        1   11.966   11.966   15.059   15.059 text.py:1192(_count_vocab)
  169/144    0.001    0.000   13.495    0.094 lowering.py:184(lower_normal_function)
  169/144    0.004    0.000   13.378    0.093 lowering.py:201(lower_function_body)
1785/1597    0.053    0.000   13.364    0.008 lowering.py:225(lower_block)
25062/22814    0.085    0.000   13.172    0.001 lowering.py:374(lower_inst)
10729/3978    0.076    0.000   11.717    0.003 <frozen importlib._bootstrap>:986(_find_and_load)
  6623/10    0.020    0.000   11.652    1.165 <frozen importlib._bootstrap>:956(_find_and_load_unlocked)
  6403/10    0.023    0.000   11.646    1.165 <frozen importlib._bootstrap>:650(_load_unlocked)
  9753/12    0.004    0.000   11.644    0.970 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
   6009/8    0.013    0.000   11.637    1.455 <frozen importlib._bootstrap_external>:837(exec_module)
167616/167273    4.247    0.000   11.151    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
 85570153   10.280    0.000   10.280    0.000 {method 'group' of 're.Match' objects}
        1    0.000    0.000    9.937    9.937 Top2Vec.py:4(<module>)
 3954/850    0.008    0.000    9.862    0.012 {built-in method builtins.__import__}
10963/10509    0.019
May 18 '22 19:05 dbl001