dataverse
dataverse copied to clipboard
Resolve performance issues with large datasets
What this PR does / why we need it
Resolves performance issues for API calls involving heavy datasets and their files.
The main reason for the detected issues is due to the use of the getFileMetadatas
method from the DatasetVersion
class in different areas of the application.
This method queries the database for returning all files present in a version, which for small datasets is not an expensive operation but generates a performance bottleneck for big datasets like the heavy one on beta (10000) files.
Issue 1) Slow collections page / search API endpoint
Although the search API endpoint uses Solr to quickly search for results, there was a performance bottleneck when composing the json object returned by the API when one of the returned elements was a heavy dataset.
In particular, the json converter method of the SolrSearchResult
class was calling the getFileMetadatas
method for obtaining the total number of files. See:
https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/search/SolrSearchResult.java#L574
I have replaced this expensive call with a custom query call, which was already present in the code (DatasetVersionFilesServiceBean
):
https://github.com/IQSS/dataverse/blob/63a09cb5a9de10551557a6d1e6237ee1dd7d5d48/src/main/java/edu/harvard/iq/dataverse/DatasetVersionFilesServiceBean.java#L52
I also reorganized the code a bit and added a general cleanup.
Performance monitoring
I have tested the affected Search endpoint, requesting the first page ordered by date (desc), forcing the heavy dataset to appear in the results. This is the same call js-dataverse uses.
curl -H "XXXXXXXXXXXXXX" -w "\n\n%{time_connect} + %{time_starttransfer} = %{time_total}\n" "https://beta.dataverse.org/api/v1/search?q=*&type=dataset&sort=date&order=desc&per_page=10&start=0&subtree=root"
The performance improvement obtained after the change is presented below:
Before optimization
0.140740 + 8.130421 = 8.131392 seconds
After optimization
0.134426 + 0.508241 = 0.509249 seconds
Achieved optimization: ~x16 times faster.
Considerations
While solving this problem, I tried to optimize the index search, to see if we can improve the performance in that part too. However, I did not achieve any noticeable improvement.
For example, I tested the search operation after configuring the dateSort field (used to sort the collection page results by date) to use docValues, which is a recommended mechanism for efficient sorting and faceting in Solr. But as I mentioned above, I found no significant improvements.
Issue 2) Slow Files Tab / API endpoints using PermissionsServiceBean
The GetDataFileCommand command is a widely used command in the API to obtain a file. This command handles permission checks to verify that the calling user has permissions to access the requested file.
The permissions checking logic is located in the PermissionsServiceBean class.
We discovered possible performance bottlenecks in this class, especially when dealing with files belonging to large datasets. In particular, within the isPublicallyDownloadable method, a call to getFileMetadatas was made with a for-loop iteration over the files that caused a significant performance downgrade.
I developed a new native query to replace this behavior. This new query checks if a datafile is present in a specific dataset version. In the case of this particular scenario, it checks if the datafile is present in the released dataset version. See:
-
https://github.com/IQSS/dataverse/blob/solr-date-sort-optimization/src/main/java/edu/harvard/iq/dataverse/PermissionServiceBean.java#L455
-
https://github.com/IQSS/dataverse/blob/solr-date-sort-optimization/src/main/java/edu/harvard/iq/dataverse/DatasetVersionFilesServiceBean.java#L209
Performance monitoring
To test GetDataFileCommand, I used the getFileData
endpoint for an affected datafile.
curl -H "X-Datavese-Key:XXXXXX" -w "\n\n%{time_connect} + %{time_starttransfer} = %{time_total}\n" "https://beta.dataverse.org/api/v1/files/16588"
The performance improvement obtained after the change is presented below:
Before optimization
0.194373 + 8.678860 = 8.679027 seconds
After optimization
0.139459 + 0.443956 = 0.444019 seconds
Achieved optimization: ~x19 times faster.
Conclusions
Observing the nature of the issues found, we can affirm that the use of the getFileMetadatas
method should be avoided, or at least, meticulously controlled, to ensure that we do not generate performance bottlenecks in the code.
In all cases where we have found this problem, it has been possible to replace the call to this method with a custom database query. We must keep in mind that a custom query designed for the particular use case will always be much more optimal than this method + the associated post filtering code.
Which issue(s) this PR closes
- Closes #10415
Is there a release notes update needed for this change?
Yes, attached.