What is a web crawler and how does it discover new pages on the internet?

Heisenbug logo
TheHeisenBug

Search

Search across questions, learning content, and hands-on projects

Ace Your Next Tech Interview

5,986+ interview questions across 87 technologies — with expert answers, advanced search, AI-powered assistance, personal highlights, structured learning paths, and hands-on practice projects.

5,986+Questions & Answers
87Technologies
AdvancedSearch
Built-inAsk AI
PersonalHighlights
StructuredLearning Paths
Hands-onPractice Projects

Lifetime Access

One-time payment. No subscriptions. Unlock everything, forever.

$19.90USD

or R$49.90 BRL

FeatureFreePremium
Question titlesAllAll
Answers per topicTop 5All
Learning chaptersFirst 5All
Practice projectsFirst 3All
Highlights
Ask AI
Read tracking
Search

Search Engines Interview Questions

  1. [JUNIOR] What is the difference between crawling, indexing, and ranking in a search engine?
  2. [JUNIOR] What is an inverted index and why is it the core data structure of search engines?
  3. [SENIOR] How would you design a web-scale search engine that handles hundreds of billions of pages?
  4. [JUNIOR] What are the main components of a search engine architecture?
  5. [JUNIOR] What is a web crawler and how does it discover new pages on the internet?
  6. [JUNIOR] What is TF-IDF and how does it measure the relevance of a document to a query?
  7. [JUNIOR] What is PageRank and how does it determine the importance of a web page?
  8. [MID] How does BM25 improve upon TF-IDF for relevance scoring, and what role do its parameters k1 and b play?
  9. [MID] How does an inverted index enable fast full-text search across billions of documents?
  10. [SENIOR] How does a multi-stage ranking pipeline balance retrieval latency and relevance quality?
  11. [SENIOR] How would you shard a search index across thousands of machines and handle query fan-out?
  12. [SENIOR] How would you design a distributed web crawler that scales to billions of pages while respecting politeness constraints?
  13. [JUNIOR] What is the role of the robots.txt file in web crawling?
  14. [JUNIOR] What is tokenization and why is it a critical step in search engine indexing?
  15. [JUNIOR] What is the difference between keyword-based search and semantic search?
  16. [MID] What are the trade-offs between batch indexing and incremental real-time indexing?
  17. [MID] How does a URL frontier manage crawl prioritization and scheduling in a web crawler?
  18. [MID] How do search engines detect and handle duplicate content at both the URL and page level?
  19. [MID] How do sharding and replication work in a distributed search system like Elasticsearch?
  20. [MID] What role does caching play in achieving low-latency search responses?
  21. [SENIOR] How does learning-to-rank (LTR) use machine learning to improve search relevance beyond traditional scoring?
  22. [SENIOR] How do you handle index updates and schema changes without rebuilding the entire index?
  23. [SENIOR] What strategies ensure search system availability and fault tolerance when individual components fail?
  24. [SENIOR] How would you design an autocomplete or typeahead suggestion system for a search engine?
  25. [SENIOR] How would you optimize a search system to serve results with sub-second latency at global scale?
  26. [EXPERT] How do vector embeddings and approximate nearest neighbor (ANN) algorithms enable semantic search at scale?
  27. [JUNIOR] What is a posting list in the context of an inverted index?
  28. [JUNIOR] What are stop words and why might a search engine remove them during indexing?
  29. [JUNIOR] What is a forward index and how does it differ from an inverted index?
  30. [MID] How do stemming and lemmatization improve search recall?
  31. [MID] How does document length normalization affect relevance scoring?
  32. [MID] How does politeness work in web crawling and what mechanisms enforce it?
  33. [MID] What is a search analyzer and what roles do tokenizers, token filters, and character filters play?
  34. [MID] How does query expansion improve search results?
  35. [MID] How do search engines handle spelling correction and query auto-suggestions?
  36. [SENIOR] What compression techniques are used to reduce the memory footprint of search indexes?
  37. [SENIOR] How does deep pagination become problematic in distributed search systems, and what alternatives exist?
  38. [EXPERT] How would you design a hybrid retrieval system that combines lexical matching with vector-based semantic search?
  39. [EXPERT] What are the trade-offs between strong consistency and eventual consistency for distributed search indexes?
  40. [JUNIOR] What is a sitemap and how does it assist search engine crawlers?
  41. [JUNIOR] What is near real-time (NRT) search and why is it important?
  42. [MID] How do search engines handle multi-language queries and multilingual content?
  43. [MID] What are the key differences between Elasticsearch and Apache Solr?
  44. [MID] What is the difference between a term-level query and a full-text query in a search engine?
  45. [SENIOR] How would you design a search system that handles multiple content types such as web pages, images, and videos?
  46. [SENIOR] How would you implement efficient geospatial search indexing and querying?
  47. [EXPERT] How do search engines use click-through data and user engagement signals to refine ranking models without introducing bias or feedback loops?
  48. [EXPERT] What is the relationship between the CAP theorem and search engine design decisions?
  49. [EXPERT] What are the challenges of implementing real-time personalization in search results at scale?
  50. [EXPERT] How would you handle spider traps and adversarial content in a large-scale web crawler?
  51. [EXPERT] How would you implement exactly-once processing semantics in a distributed crawling and indexing pipeline?
  52. [EXPERT] How does zone scoring work and how can field-weighted relevance be used to boost specific document fields?
  53. [EXPERT] What are BKD trees and how are they used for efficient geospatial and numeric range queries in search engines?