Search Engine

Detailed Description:

Firstly, we applied Python to scrape 37,497 valid URLs from the domain ics.uci.edu and stored the crawled pages in a zip file.

After that, we parsed and processed the extracted tokens using the crawlable HTML files, removed stop words from and applied lemmatization to the identified tokens, and constructed these tokens into an inverted index saved in a json file to improve the efficiency of search results.

Then, we created an HTML template for prompting the user for a query. The program will look up indexes, perform calculations based on their TF-IDF score and importance of words in HTML tags, and give out a ranked list of relevant query pages for the query.

Intended Audience:

The intended audience is UCI students who try to search for information on the ics.uci.edu website.

Team Info:

We currently have a three-person team and expect our project to last two years, with a budget of around $20,000.

​Search Engine

Detailed Description:

Intended Audience:

The intended audience is UCI students who try to search for information on the ics.uci.edu website.

Team Info:

Search Engine