BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Suggestion: make documents optional for BERTopic.transform()

Open jhrystrom opened this issue 4 years ago • 3 comments

Hi Maarten,

Thank you once again for your fantastic work :))

I'm using BERTopic as part of a sklearn-style pipeline. Looking at the transform function, it seems that it only uses the documents-input if embeddings are not provided. Therefore it might make sense to make documents an optional parameter. Right not I use a hacky workaround where I provide it with some sham-documents, but this introduces a bit of overhead for large datasets. Would this be possible or would it disturb the API too much?

I would be more than happy to implement it, but it shouldn't be too difficult (if you think it is a good idea)

jhrystrom avatar Dec 13 '21 15:12 jhrystrom

That is indeed quite difficult! Although I agree that it would be nicer to make documents optional in your use case, I am hesistant about both documents and embeddings being optional parameters. In theory, that would allow you to do .transform() without passing any documents or embeddings. Seeing as you definitely need one or the other, both being optional could be misleading.

MaartenGr avatar Dec 13 '21 16:12 MaartenGr

I have thought a bit about this, and haven't come up with any great solutions. The "best" solution would be to have one argument that should be either documents or embeddings - however, this would break a lot of stuff (probably) which isn't so great. Another option would be to throw a warning if both are None, which would alleviate the problem slightly. I do get if it's too much of a hassle though :))

jhrystrom avatar Jan 03 '22 12:01 jhrystrom

Hmmm, you are right that the cleanest way would be one argument to replace them both, and like you said that might break some stuff. Also, I like separating them for clarity.

One, perhaps hacky, way of approaching this might be to keep documents a required argument but allow it to have the value of None and, as you mentioned, check if both of those values are None. However, that might seem strange from an API perspective: topics, probs = topic_model.transform(None, embeddings) whilst you much rather would have topics, probs = topic_model.transform(embeddings=embeddings)

Hmmm, I am starting to lean towards your last option. Making them both optional but allowing for an error of some kind if both are None. It is not the most elegant way of approaching this but if I make sure the documentation is nicely written it should not confuse users.

A fix for this could take a bit longer as I am struggling with some errors in the GitHub action workflow pipeline. Those need to be fixed first before fixes and features are merged.

MaartenGr avatar Jan 04 '22 06:01 MaartenGr