dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

Resolve performance issues with large datasets

Open GPortas opened this issue 11 months ago • 24 comments

What this PR does / why we need it

Resolves performance issues for API calls involving heavy datasets and their files.

The main reason for the detected issues is due to the use of the getFileMetadatas method from the DatasetVersion class in different areas of the application.

This method queries the database for returning all files present in a version, which for small datasets is not an expensive operation but generates a performance bottleneck for big datasets like the heavy one on beta (10000) files.

Issue 1) Slow collections page / search API endpoint

Although the search API endpoint uses Solr to quickly search for results, there was a performance bottleneck when composing the json object returned by the API when one of the returned elements was a heavy dataset.

In particular, the json converter method of the SolrSearchResult class was calling the getFileMetadatas method for obtaining the total number of files. See: https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/search/SolrSearchResult.java#L574

I have replaced this expensive call with a custom query call, which was already present in the code (DatasetVersionFilesServiceBean):

https://github.com/IQSS/dataverse/blob/63a09cb5a9de10551557a6d1e6237ee1dd7d5d48/src/main/java/edu/harvard/iq/dataverse/DatasetVersionFilesServiceBean.java#L52

I also reorganized the code a bit and added a general cleanup.

Performance monitoring

I have tested the affected Search endpoint, requesting the first page ordered by date (desc), forcing the heavy dataset to appear in the results. This is the same call js-dataverse uses.

curl -H "XXXXXXXXXXXXXX" -w "\n\n%{time_connect} + %{time_starttransfer} = %{time_total}\n" "https://beta.dataverse.org/api/v1/search?q=*&type=dataset&sort=date&order=desc&per_page=10&start=0&subtree=root"

The performance improvement obtained after the change is presented below:

Before optimization

0.140740 + 8.130421 = 8.131392 seconds

After optimization

0.134426 + 0.508241 = 0.509249 seconds

Achieved optimization: ~x16 times faster.

Considerations

While solving this problem, I tried to optimize the index search, to see if we can improve the performance in that part too. However, I did not achieve any noticeable improvement.

For example, I tested the search operation after configuring the dateSort field (used to sort the collection page results by date) to use docValues, which is a recommended mechanism for efficient sorting and faceting in Solr. But as I mentioned above, I found no significant improvements.

Issue 2) Slow Files Tab / API endpoints using PermissionsServiceBean

The GetDataFileCommand command is a widely used command in the API to obtain a file. This command handles permission checks to verify that the calling user has permissions to access the requested file.

The permissions checking logic is located in the PermissionsServiceBean class.

We discovered possible performance bottlenecks in this class, especially when dealing with files belonging to large datasets. In particular, within the isPublicallyDownloadable method, a call to getFileMetadatas was made with a for-loop iteration over the files that caused a significant performance downgrade.

I developed a new native query to replace this behavior. This new query checks if a datafile is present in a specific dataset version. In the case of this particular scenario, it checks if the datafile is present in the released dataset version. See:

  • https://github.com/IQSS/dataverse/blob/solr-date-sort-optimization/src/main/java/edu/harvard/iq/dataverse/PermissionServiceBean.java#L455

  • https://github.com/IQSS/dataverse/blob/solr-date-sort-optimization/src/main/java/edu/harvard/iq/dataverse/DatasetVersionFilesServiceBean.java#L209

Performance monitoring

To test GetDataFileCommand, I used the getFileData endpoint for an affected datafile.

curl -H "X-Datavese-Key:XXXXXX" -w "\n\n%{time_connect} + %{time_starttransfer} = %{time_total}\n" "https://beta.dataverse.org/api/v1/files/16588"

The performance improvement obtained after the change is presented below:

Before optimization

0.194373 + 8.678860 = 8.679027 seconds

After optimization

0.139459 + 0.443956 = 0.444019 seconds

Achieved optimization: ~x19 times faster.

Conclusions

Observing the nature of the issues found, we can affirm that the use of the getFileMetadatas method should be avoided, or at least, meticulously controlled, to ensure that we do not generate performance bottlenecks in the code.

In all cases where we have found this problem, it has been possible to replace the call to this method with a custom database query. We must keep in mind that a custom query designed for the particular use case will always be much more optimal than this method + the associated post filtering code.

Which issue(s) this PR closes

  • Closes #10415

Is there a release notes update needed for this change?

Yes, attached.

GPortas avatar Mar 18 '24 11:03 GPortas