it is:

  • easy to set up
  • self-made (requires little or no configuration)
  • its collect is not always perfect
  • Ideal for quickly building up massive sourcing

The initialisation crawl

Some collection errors may appear, such as an incorrect publication date of the article. For example, if several dates are present in the article, the robot will not necessarily take the correct one. Also, not all pages on a site are dated. The robot will date them as “today”.

This can cause the “Initialisation Crawl” phenomenon.

Right after creating this kind of robot, many articles will be created and dated the same day. They will therefore all appear in one block in the results of your themes. This phenomenon is temporary. The new pages published in the coming days will be dated with a maximum lag of six hours from the original.
This therefore only applies to the site’s archives.

How the Crawler works

The Website crawler will go to the URL link you provided. It will then wander from URL link to URL link, from page to page, clicking on all the buttons it encounters.

This robot will try to create one item per page encountered.

Create a website crawler

  1. Source > Website Crawler > Paste the URL of a page of the site you are interested in
  2. Fill out the form (not required except the name)
  3. Click “create”

Indeed, this resourceful robot does not really require any configuration.

However, you can filter using “must include” and “add block”.

Paste the parts of URL in these fields to force the crawler to collect only the pages having this extension in the URL (Must Include), or precisely to ignore them.

Example: https://www.lafermedigitale.fr

On this site, all articles have “/news/ in their URL.

So you can add that in the MUST INCLUDE forcing the robot to only collect articles containing /news/.

Don’t forget to paste this section with the slashes “/ /” (see example above).

Revision: 3

Feedback

Was this helpful?

Yes No
You indicated this topic was not helpful to you ...
Could you please leave a comment telling us why? Thank you!
Thanks for your feedback.

Post your comment on this topic.

Post Comment