When I create a text from merging two training texts and predict it, the returned labels aren't the two that the two texts are labelled as.

Open xiejen opened this issue 8 years ago • 1 comments

I'm new to multi-label classification, thanks very much for your project that I can create an example very quickly. I merged 2 training texts from your example data, example as below, to create one text and use it for prediction. I am expecting the returned labels favor the 2 labels that the 2 training text are labelled. But the result favors only one label.

For example, The two training texts I took is:

Text 1, which is labelled as 'Gravitation and Cosmology':

"The Central singularity in spherical collapse
The gravitational strength of the central singularity in spherically symmetric space-times is investigated. Necessary conditions for the singularity to be gravitationally weak are derived and it is shown that these are violated in a wide variety of circumstances. These conditions allow conclusions to be drawn about the nature of the singularity without having to integrate the geodesic equations. In particular, any geodesic with a non-zero amount of angular momentum which impinges on the singularity terminates in a strong curvature singularity."

Text 2, which is labelled as 'Theory-HEP':

"Proving the PP wave / CFT(2) duality
We study the duality between IIB string theory on a pp-wave background, arising as a Penrose limit of the $AdS_3 \times S^3\times M$, where $M$ is $T^4$ (or $K3$), and the 2D CFT which is given by the ${\cal N}=(4,4)$ orbifold $(M)^N/S_N$, resolved by a blowing-up mode. After analizying the action of the supercharges on both sides, we establish a correspondence between the states of the two theories. In particular and for the $T^4$ case, we identify both massive and massless oscillators on the pp-wave, with certain classes of excited states in the resolved CFT carrying large $R$-charge $n$. For the former, the excited states involve fractional modes of the generators of the ${\cal N}=4$ chiral algebra acting on the $Z_n$ ground states. For the latter, they involve, fractional modes of the $U(1)^4_L\times U(1)^4_R$ super-current algebra acting on the $Z_n$ ground states. By using conformal perturbation theory we compute the leading order correction to the conformal dimensions of the first class of states, due to the presence of the blowing up mode. We find agreement, to this order, with the corresponding spectrum of massive oscillators on the pp-wave. We also discuss the issue of higher order corrections.}"

When I create a merged text from the above two and predict it, the returned labels only favors 'Theory-HEP'. I expect 'Gravitation and Cosmology' to have high possibility too. [('Theory-HEP', 0.62880212), ('Gravitation and Cosmology', 0.22713795), ('Experiment-HEP', 0.025445288)]

I might not have the correct understanding of multi-label classification. The problem I am trying to address is, that I have a set of very long documents that each document has multiple topics. I can train classifiers for each topic. But when processing the whole document, I cann't chunk it into multiple texts, I don't know where to chunk, so that I can't fed the relevant chunk into a classifier. I am hoping to use multi-label classification, that I can train the model using training data set per label. Then when I predict on the whole document, it will return multiple labels as the topics.

Would appreciate your suggestions. Thanks.

Jen

Mar 07 '18 11:03 xiejen

@xiejen

With regard to your observation - the dataset included is very small and is intended more to be a guidance how to structure your own dataset rather than give sane results, so the results don't surprise me. On top of that, the default parameters cut the document to its first 300 words, so the second document in the merge might end up being discarded (you may change this parameter if you wish).

With regards to your problem - multi-label classification can assigns a document to more than one class. If you have large documents that belong to many classes e.g. news article that is in both "sports" and "international" category, then this approach is for you. The alternative that you mentioned and it would be sane, would be to train K independent binary classifiers (K - number of classes) and run each of your docs through every classifier. This approach would not leverage the latent relations between classes, but might be sufficient for your usecase.

Hope that helps :)

Mar 14 '18 17:03 jstypka