web-search-engine
web-search-engine copied to clipboard
API - a simple web search engine
web-search-engine
API - a simple web search engine. The goal is to index an infinite list of URLs (web pages), and then be able to quickly search relevant URLs against a query. This engine uses the ElasticSearch database.
Indexing
The indexing operation of a new URL first crawls URL, then extracts the title and main text content from the page. Then, a new document representing the URL's data is saved in ElasticSearch, and goes for indexing.
Searching
When searching for relevant URLs, the engine will compare the query with the data of each document (web page), and retrieve a list of URLs matching the query, sorted by relevance.
UI
This search engine can be used with an UI : https://github.com/AnthonySigogne/web-search-engine-ui
Note
This API works for a finite list of languages, see here for the complete list : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html.
DEMO
A demo can be found here : http://searchengine.byprog.com/
About 500 French URLs and 500 English URLs of the news network http://www.france24.com/ have been indexed.
INSTALL AND RUN
REQUIREMENTS
This tool requires Python3+ and ElasticSearch5+.
WITH PIP
git clone https://github.com/AnthonySigogne/web-search-engine.git
cd web-search-engine
pip install -r requirements.txt
Then, run the tool :
FLASK_APP=index.py HOST=<ip> PORT=<port> USERNAME=<username> PASSWORD=<password> flask run
Where :
-
ip
+port
: route to ElasticSearch -
username
+password
: credentials to access
To run in debug mode, prepend FLASK_DEBUG=1
to the command :
FLASK_DEBUG=1 ... flask run
WITH DOCKER
To run the tool with Docker, you can use my DockerHub image : https://hub.docker.com/r/anthonysigogne/web-search-engine/
docker run -p 5000:5000 \
-e "HOST=<ip>" \
-e "PORT=<port>" \
-e "USERNAME=<username>" \
-e "PASSWORD=<password>" \
anthonysigogne/web-search-engine
Where :
-
ip
+port
: route to ElasticSearch -
username
+password
: credentials to access ElasticSearch
Or, build yourself a Docker image :
git clone https://github.com/AnthonySigogne/web-search-engine.git
cd web-search-engine
docker build -t web-search-engine .
USAGE AND EXAMPLES
To list all services of API, type this endpoint in your web browser : http://localhost:5000/
INDEXING
Index a web page through its URL.
-
URL
/index
-
Method
POST
-
Form Data Params
Required:
url=[string]
, the url to index -
Success Response
-
Code: 200
Content:Success
-
Code: 200
-
Error Response
-
Code: 400 INVALID USAGE
-
Code: 400 INVALID USAGE
-
Sample Call (with cURL)
curl http://localhost:5000/index --data "language=en&url=https://www.byprog.com/en/"
SEARCHING
Query engine to find a list of relevant URLs. Return the sublist of matching URLs sorted by relevance, and the total of matching URLs, in JSON.
-
URL
/search
-
Method
POST
-
Form Data Params
Required:
query=[string]
, the search queryNot required:
start=[integer]
, the start of hits (0 by default)hits=[integer]
, the number of hits returned by query (10 by default)highlight=[integer]
, return highlight parts for each URL (0 or 1, 0 by default) -
Success Response
-
Code: 200
Content:{ "total": 1, "results": [ { "title": "Anthony Sigogne / Freelance / Full-Stack Developer", "description": "Full-Stack Developer specialized in new technologies and innovative IT solutions.", "url": "https://www.byprog.com/en/" } ] }
-
Code: 200
-
Error Response
-
Code: 400 INVALID USAGE
-
Code: 400 INVALID USAGE
-
Sample Call (with cURL)
curl http://localhost:5000/search --data "query=freelance fullstack"
FUTURE FEATURES
- index more page features like keywords,...
- better scoring function
- filter bad results
- create a docker compose
- traduct tools in several languages
- connect to pixel tool
- better description of results
- redis to index a single url
LICENCE
MIT