azure-sdk-for-python
azure-sdk-for-python copied to clipboard
Info about sentence segmentation used by the model
Good morning, First of all I'd like to thank you for the amazing work done. I'm writing because I want to ask you if you can provide further information about the deep learning model which performs the text summarization task. Is there a paper or something similar? In particular I'm interested in the kind of sentence segmentation which is performed as preprocessing step. Do you use an external sentence segmentation (e.g. spacy)? Besides I would like to know if you are planning to increase the number of maximum sentences (which is 20 at the moment). I thank you in advance and I apologize for bothering you!
Document Details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
- ID: 09a0a57d-b105-3d20-5384-ba399f3a1302
- Version Independent ID: bdba0610-7e75-e843-e711-c0e74136449d
- Content: azure.ai.textanalytics.ExtractSummaryAction class
- Content Source: preview/docs-ref-autogen/azure-ai-textanalytics/azure.ai.textanalytics.ExtractSummaryAction.yml
- GitHub Login: @VSC-Service-Account
Hi @EmanueleGusso, thank you for opening an issue! I'll tag the appropriate folks so we can look into this as soon as possible.
@abhahn are you able to share more background information about extractive summarization?
Hi everyone @kristapratico @abhahn @iscai-msft . Is there any update? Sorry if I write to you again but having this information would very important to me because it will influence my choice to subscribe for Azure Cognitive Services or not. Thank you again for your great work!
I noticed that the Extractive Summarization by Azure performs its sentence segmentation. Is it possible to provide the list of sentences and not directly the whole document? I mean, I'd like to split the document in sentences by myself and then I want the ranking of those sentences where the rankings are provided by Azure.
Besides I noticed that, given a short document (shorter than 20 sentences) even if fixing the maximum number of sentences for the summarization to 20, some sentences get lost and they are not reported in the result. Is is normal? What does it mean? Does it mean that those sentences are ranked as 0 importance? Please help me.
Hi @EmanueleGusso , thanks for your questions.
We are not able to disclose details of the model internals, other than that we are using a SotA transformer-based model. I was also informed that I could say that we are not using SpaCy, but cannot say more than that in the affirmative about what we are using for sentence segmentation.
As for your question about increasing the maximum number of sentences returned, could you provide us with some feedback about why this increase is important for your use case? I can forward your feedback to my colleagues in order to determine when / if we can implement this change.
For your second set of questions in comments, as of now it is not possible to send the input already split into sentences. We currently only support sending whole documents, which are segmented by the service. I can also ask my colleagues about the option to send a document already split into sentences, but this will require an API update and would likely take some time to implement if it were given the green light.
If the document is shorter than 20 sentences, the total number of sentences returned will not exceed this even if you set the value in the request to the maximum of 20. This parameter has no impact on how the document is segmented for processing, it only tells the service the maximum number of summarizing sentences to return.
For the issue of sentences getting lost, could you share a repro for this? I can take a look at the output to see if there is a bug in the service or if the behavior is expected.
@abhahn Thank you very much for your availability.
My need to increase the maximum number of sentences stems from the fact that I am interested in having the importance scores of all the sentences in my text. So, since this function is not available and the API returns only the sentences chosen for the summary and their importance score, I was hoping to set a huge number as maximum number in order to get back all the sentences and their score (which is what I'm interested in).
But, if I understood correctly, you are telling me that even if I set the maximum number to 20, it is not sure that the summary will have 20 sentences but only a number less than 20, am I right?
So, to recap, I'm interested in the importance score of all the sentences which compose my text. Could the API return this information? (as I think that internally your model has this information in the step preceding the choice of sentences for the summary). Thank you again for your support :)
@EmanueleGusso apologies for the late response on this, but I wanted to check in and make sure your questions were answered regarding the SDK. Is there anything else we can help you with?
Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!