Federated Search Implementation
The federated search functionality enables PhysioNet to query and aggregate research data from multiple external PhysioNet instances in real-time, providing users with a unified search experience across a distributed network of repositories. The implementation follows a modular adapter pattern architecture that separates the concerns of search orchestration, API communication, and result normalization.
Workflow
When a user performs a search with federated search enabled, the system executes a multi-step process orchestrated by the FederatedSearchService. First, the service queries the database for all enabled federated sites and initializes a ThreadPoolExecutor with up to 10 concurrent workers to parallelize API requests. For each federated site, a PhysioNetAdapter is instantiated, which constructs the full API endpoint URL by combining the site's base URL with the configured search endpoint. The adapter then makes an HTTP GET request with the user's search term and resource type filters, optionally including authentication headers if an API token is configured. The API response, expected in JSON format with a standardized PhysioNet schema, is parsed and each result is normalized to a common internal format that includes both the original project metadata (title, slug, version, abstract, resource type, etc.) and federated-specific metadata (source site name, external URL, federated flag). Results from all sites are collected as they complete (maintaining order of completion for faster perceived performance) and returned alongside local search results. The entire process implements fail-silent behavior—if one site times out or returns an error, it's logged but doesn't prevent other sites from contributing results, ensuring robustness in the face of network issues or site unavailability.
@tompollard @bemoody We need ot further flush out how the search results will be displayed alongside the local search results.
- Currently implemented basic implementation: Just have a different section, with 10 from local & 10 from federated sites. Further, within the federated sites - the results are appended on a first-recieved basis from the API calls.
- Another proposed implementation with limitation: Update the API implementation to return a score, however this is only possible for the physionet api, which we have control over. This will do for the time, as long as the federated sites only contain an implementation of physionet. This is not a preferred solution, as it would be ideal to integrate search across several open-source repositories and make Physionet/HDN/others the primary data repository to go to in the respective zones (US/Canada/Other)
- Another proposed implementation with Complexity: A long term solution, which would accomodate all of features we might want to extend to, is to have a local scoring mechanism. This might make the search implementation a bit heavy, but fully independent and extensible. This is my preferred implementation, as it would independently allow the search functionalities in the Physionet Implementations to scale beyond Physionet deployments. A simple scoring metric I propose we can start with might be semantic matching.
Thanks Rutvik. How much of an issue is latency (i.e. waiting for external repositories to return their results)?
Currently implemented basic implementation: Just have a different section, with 10 from local & 10 from federated sites. Further, within the federated sites - the results are appended on a first-recieved basis from the API calls.
I'm not especially keen on this approach, because I think the external results will just get lost below the page.
Another proposed implementation with limitation: Update the API implementation to return a score, however this is only possible for the physionet api, which we have control over.
I prefer this approach to the one above.
Another proposed implementation with Complexity: A long term solution, which would accomodate all of features we might want to extend to, is to have a local scoring mechanism.
In the longer term, I think this is the approach we should use (particularly if integrating with a broad set of repositories, because we will need to be able to harmonize priority scores).
Things to Consider & Challenges for Merged Ranking
Technical Challenges
Result fetching strategy: Need to fetch significantly more results from all sources upfront (e.g., 50-100 per source) to properly rank across pages, increasing latency and API load
Pagination complexity: To show page 3 (results 21-30), must fetch and rank at least 30 results from every source; can't paginate per-source independently
Performance impact: Every search now waits for slowest federated site instead of showing local results immediately Caching requirements: Need sophisticated caching (Redis/Memcached) to store ranked result pools across pagination requests; session state management becomes complex
Text search scoring limitations: Can't run complex text relevance algorithms (BM25/TF-IDF) properly without full-text index of federated content; either need to pre-index external content or settle for simple keyword matching
IDF calculation problem: Inverse Document Frequency requires knowing the full corpus; federated results don't contribute to your local IDF statistics, making fair comparison difficult
Data Quality & Normalization
Inconsistent metadata: Federated sites may have varying completeness (some have DOI, citation counts, download stats; others don't)
Missing features: How do you score quality/authority when federated sites lack fields like citation counts or usage statistics?
Algorithmic & Ranking Challenges
Score normalization: Need to normalize scores across disparate sources so a high-quality local result isn't artificially boosted just because you have more metadata about it
Source bias decisions: Should local results be preferred (home field advantage)? How much weight should federated site priority/reputation carry?
Algorithm complexity: Simple weighted scoring may not capture relevance well; sophisticated algorithms (BM25, LTR) require significant implementation effort
User Experience Concerns
Increased latency: Must wait for all federated sites to respond before showing any results (even if local results are ready) Timeout handling: If a federated site times out, do you re-rank without it? Do rankings change between page loads?
In saying the points - none of these are not solvable - however, these are not covered in the scope of this issue/pr. This issue /pr is only to implement the federated site search capability. The above points need to be thoroughly discussed, and once implemented will improve the search functionality multi-fold, but will take signficant time before the specifications can be finalized and implemented.
Hence the final suggestion on my end stays we go with the simplest first one (baby-steps - small prs with focused implementations). We display the federated site results if a fedrated site is available and completely disregard this feature if one does not exist.
Summarizing the discussion with @tompollard & @bemoody
- Federated site search and search aggregator are two different things.
- Federated site search aims to have a single search index, with a proper scoring. This only scans the physionet instances, which allows us to make several assumptions. Ranking across instances is consistent, the scores are reliable for sorting, and the local schemas can be re-used (mostly)
We are looking to implement a federated site search and not a generic search aggregator. This implementation, trying to keep it general and extendable, is leaning towards search aggregator. This will be re-designed and re-implemented, and hence the current pr is being closed.