string_grouper icon indicating copy to clipboard operation
string_grouper copied to clipboard

Error When matching Chinese name

Open ZhihaoMa opened this issue 3 years ago • 9 comments

Hi, I try to match the Chinese firm name and get errors

File "C:/Users/acemec/Documents/firm_data/name_match.py", line 14, in matches = match_most_similar(companies['company_name'], new_companies['assignee'], ignore_index=True) File "C:\Users\acemec\anaconda3\lib\site-packages\string_grouper\string_grouper.py", line 108, in match_most_similar string_grouper = StringGrouper(master, File "C:\Users\acemec\anaconda3\lib\site-packages\string_grouper\string_grouper.py", line 218, in init raise TypeError('Input does not consist of pandas.Series containing only Strings') TypeError: Input does not consist of pandas.Series containing only Strings

Here is my code:

import pandas as pd import numpy as np from string_grouper import match_strings, match_most_similar, group_similar_strings, compute_pairwise_similarities, StringGrouper import dask.dataframe as dd company_names = 'C:/Users/acemec/Documents/firm_data/company_annual.csv' companies = dd.read_csv(company_names, on_bad_lines='skip',dtype=str,low_memory=False)

new_companies_name = 'C:/Users/acemec/Documents/firm_data/Pat_firm_list.csv' new_companies = dd.read_csv(new_companies_name, on_bad_lines='skip',dtype=str,low_memory=False)

matches = match_most_similar(companies['company_name'], new_companies['assignee'], ignore_index=True)

match_result = pd.concat([new_companies, matches], axis=1)

df = pd.DataFrame(match_result) df.to_csv('C:/Users/acemec/Documents/firm_data/file_name.csv', encoding='utf-8')

Could you give me some suggestions?

ZhihaoMa avatar Oct 14 '21 06:10 ZhihaoMa

Hi @ZhihaoMa

Thanks for your interest in string_grouper.

Before now, have you used dask DataFrames with string_grouper with success? The reason I ask is to find out if the error is being caused by your use of dask rather than Chinese characters.

ParticularMiner avatar Oct 14 '21 08:10 ParticularMiner

I use dask FataFrames because the csv file is too large (~20G). When I directly use Pandas (pd.read_csv), the errors are:

Traceback (most recent call last): File "C:/Users/acemec/Documents/firm_data/name_match.py", line 9, in companies = pd.read_csv(company_names, on_bad_lines='skip',dtype=str,low_memory=False) File "C:\Users\acemec\anaconda3\lib\site-packages\pandas\util_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "C:\Users\acemec\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 586, in read_csv return _read(filepath_or_buffer, kwds) File "C:\Users\acemec\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 488, in _read return parser.read(nrows) File "C:\Users\acemec\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 1047, in read index, columns, col_dict = self._engine.read(nrows) File "C:\Users\acemec\anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 228, in read data = self._reader.read(nrows) File "pandas_libs\parsers.pyx", line 783, in pandas._libs.parsers.TextReader.read File "pandas_libs\parsers.pyx", line 872, in pandas._libs.parsers.TextReader._read_rows File "pandas_libs\parsers.pyx", line 1925, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: out of memory

ZhihaoMa avatar Oct 14 '21 09:10 ZhihaoMa

@ZhihaoMa

I understand that.

What I want to know is this: when you use pandas Dataframe with a small dataset of Chinese strings, does string_grouper work or not?

If it works, then the problem is coming from dask, not the Chinese characters.

If it does not work, then the problem is the Chinese characters.

ParticularMiner avatar Oct 14 '21 09:10 ParticularMiner

@ZhihaoMa

string_grouper was not made with dask in mind. That being said, I see that considering dask as a viable alternative to pandas would be very useful. Perhaps a future version of string_grouper will support it.

So I would be very grateful if you could let me know the answer to the above question, to know how best to incorporate dask into string_grouper.

ParticularMiner avatar Oct 14 '21 10:10 ParticularMiner

@ParticularMiner Sorry for responding late. The package works well for Chinese files after encoding. But I find it doesn't support dask. When using dd.read_csv, I find:

TypeError: Input does not consist of pandas.Series containing only Strings

ZhihaoMa avatar Oct 16 '21 14:10 ZhihaoMa

Thanks @ZhihaoMa

I will take a closer look at dask. Or have you found another way?

ParticularMiner avatar Oct 16 '21 15:10 ParticularMiner

Hello @ZhihaoMa How did you go about the encoding? Can you explain what you did exactly? I am facing the same issue

vherasme avatar Aug 10 '22 12:08 vherasme

Encountered this error and solved it by adding .astype(str):

companyNames = names['Name'].astype(str).drop_duplicates()
df = sg.match_strings(companyNames)

liri2006 avatar Mar 09 '23 13:03 liri2006

Encountered this error and solved it by adding .astype(str):

companyNames = names['Name'].astype(str).drop_duplicates()
df = sg.match_strings(companyNames)

Thanks!

ZhihaoMa avatar Mar 11 '23 13:03 ZhihaoMa