Not all sources can be inserted via the “point & click” method described in the previous tutorial.
This can be prevented because:
- the page isn’t browsed in our module
- clicking on an article element is not possible
- the robot does not understand which specific box you are trying to insert
However, it is often possible to write manually these selectors.
To do this, it is necessary to know some basics in HTML.
HTML is an IT language used to create a web page.
How do developers design a web page?
They create boxes, embed them inside each other and then inject content inside (URL links, images, text, etc.).
The goal is to find which box contains the content of a selector and to indicate it to the Cikisi robot.
These boxes are of certain types, called “html tags”. You will use a dozen at most and very often the same ones for the same selectors.
Here are the most common:
- h1, h2, h3, h4, h5 and h6: h stands for “header”. These are boxes for titles. “h1” will be the biggest title possible and h6 the smallest.
- p: for “paragraph”, contains text, like a description
- img: a box to put images (95% of “image” selectors)
- a: the box provided for URL links. 100% of “link” selectors.
- div: a catch-all box. When a developer wants to store content somewhere, but has no idea where to put it (because there is no dedicated type of box), he stores it in a “div”.
- ul: a bulleted list of several items (a cooking recipe for example)
- li: inside an “ul”, these are the points of the list
How do I find the type of box that interests me?
- Open the original page that interests you on your browser.
- Do a right click on the element of your choice (a title for example)
- Click on “inspect” at the bottom, an HTML panel opens
- The “boxes” are written in pink on the flap that opens on the right
In the example above, you see in pink (indicated by arrows) and from top to bottom the wrapper, then h2 for the title and “a” for the link URL.
All you have to do is encode the type of box in the scrapping bot using the right selector.
Specify a selector
Sometimes indicating the class and the HTML tag is not enough. Indeed, several boxes can correspond and not all of them are articles. This would then cause an error during the collect process.
There are several techniques for specifying a selector:
- Indicate the parent with “>”
Example: div > ul > li
This tells the robot that you want a “LI” type box which is in a “UL” which is itself in a “DIV”.
- Indicate the child number with :nth-child(X)
Example: li:nth-child(3) indicates that you are looking for a box “LI” being the third child of the ‘family’, to distinguish it from siblings of the same parent. The “siblings” are on the same line in ascending order, with the first being at the very top, just below the parent.
Post your comment on this topic.