weld
weld copied to clipboard
Perform the lazy encoding conversion
I found that memory usage of grizzle is much larger than pandas. Then I go into it and find that it is may be caused by change the encoding type when calling raw_column = np.array(self.df[key], dtype=str).
Can it be optimized by keeping the original encoding type in dataframe[key].values and perform the conversion at runtime (lazy encoding conversion)
If the way I proposed to optimize it is correct. I can take this issue. Thanks.
Hi @hustnn , I have encountered the same problem, details refer to #127.
I am trying to read and understand weld code, so I can't comprehend why the conversion is needed? I have tried to comment out this sentence, but another error was triggered. How to do that the way you proposed?
Thanks.
@wjliu I think they are different problems. In you issue, the memory is increasing continuously because you may doesn't release the object. You can try free the object manually in the code.
Here, in my issue, the memory overhead is larger than the native pandas implementation. You can also try data_clearning pandas implementation and grizzle implementation. I debug the code and find the main overhead is caused after performing the str type convention. Therefore, I am trying to propose this lazy type convention.
Simply deleting that sentence is not enough. More modifications are needed. I am looking into it.
@hustnn
Thanks, I will try it again.
In my issue, memory continuously increasing is the reason to be killed, but the memory usage is also larger than pandas. I have debugged the code and found the same sentence that enlarge memory usage.
Hi @hustnn,
The conversion is happening at runtime, but I do agree that it would be nice to do the conversion directly over the native column type (in this case, dataframe[key].values
) and not have to force the conversion to str
.
If you can get this to work, I am happy to look at a PR. Thanks!
@deepakn94 I am working on it.