Here are some details regarding the way the data is stored in the backend, and how it’s processed.
The goal of this model is to keep things simple to avoid unnecessary processing and facilitate the data manipulation. As I already said many times, the content retrieved from the web cannot be taken for granted. It can be corrupted, duplicated, malformed, etc. So before the results send back by the crawlers is integrated to the main database, some processing is required to control it.
Basically, there is 3 main tables in the system. On table called work_proposed simply contains all the links retrieved and approved, proposed to the client crawlers. Another one called page, which simply contains all the pages already parsed, received and approved, including the links, keywords, etc. And finally, page_proposed which will store all the pages loaded by the crawlers and sent back to the server.
Now, let’s see how the data is flowing between all those tables. First, the crawler needs to get some workload. This is retrieved from the work_proposed table. This table only contains URLs that need to be downloaded from the Internet, parsed and sent back to the server. When the page is sent back to the server, it’s stored into the page_proposed table. The is nothing more than just the URL stored there. No need of any extra argument. A separate process will parse the page_proposed table to validate the entries. Each entry is first removed from the table. It’s then checked for duplication from the page table. If this is a new page never retrieved in the past, or if it’s an existing page but the content changed, the page is added to the table, and all the links it contains are added to the work_proposed table, if not already there or not already existing in the page table. The diagram below show how the links/content is flowing through the different tables and process. It’s also showing the table size as of today. Since there is an average of 140 links per page, if all proposed pages are correct, more than 600 000 000 links might be added to the work_poposed table. This will be run soon and the results are going to be post here. So to estimate the number of pages already processed, we have to add the numbers of entries from the page table and from the page_proposed table. And to estimate the number of pages still to be retrieved, simply need to count the entries from the work_proposed page.