Skip to content

Forum in maintenance, we will back soon 🙂

Notifications
Clear all

Web Scraping

32 Posts
4 Users
8 Reactions
372 Views
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

@ssadvisor any feedback please?

 
Posted : 04/10/2024 5:57 pm
Hasan Aboul Hasan
(@admin)
Posts: 1252
Member Admin
 

@google-rayazsiddiqi, We are just getting all h2 headings with the class "'listing-company" and then looping and printing them.

h2 is heading 2 in HTML, and classes are usually used to identify and select specific tags and objects in a web page 

does this make sense?

 
Posted : 04/10/2024 6:11 pm
SSAdvisor
(@ssadvisor)
Posts: 1139
Noble Member
 

@google-rayazsiddiqi I plan to create an example for you. I need time to do that.

Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack

 
Posted : 04/10/2024 6:43 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

@ssadvisor thank you very much, sorry for bugging you!!

 
Posted : 04/10/2024 11:05 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

@admin not sure

 
Posted : 04/10/2024 11:05 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

I am struggling to understand how the job title comes under listing company!!

 
Posted : 04/11/2024 8:03 am
SSAdvisor
(@ssadvisor)
Posts: 1139
Noble Member
 

@google-rayazsiddiqi I've modified the code in the Solution of the training page to help explain what is happening. I hope this clears up your confusion.

import requests
from bs4 import BeautifulSoup

# Send a GET request to the Python job board
response = requests.get('https://www.python.org/jobs/')

# Parse the content of the response with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the job posts
job_posts = soup.find_all('h2', class_='listing-company')

# Print the title of each job post
# Loop through all of the job_posts
for job_post in job_posts:
    # job_post contains all the elements of the <h2> tag
    print(f"JOB_POST: {job_post}\n")
    # job_post.a is the <a href="..."> tag
    print(f"JOB_POST.A: {job_post.a}\n")
    # job_post.a.text is the text portion of the link.
    title = job_post.a.text
    print(f"JOB_POST.A.TEXT: {title}\n")

Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack

 
Posted : 04/11/2024 2:05 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

@ssadvisor I think I need 121 help on this, I'll schedule something in your diary!!

 
Posted : 04/11/2024 2:54 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

@ssadvisor I think this is a lot clearer, couple of things:

What do the f" and n"in this code do: print(f"JOB_POST: {job_post}\n")

Also, where do you define the variable job_post?

 

Thanks

 
Posted : 04/11/2024 3:08 pm
(@husein)
Posts: 531
Member Moderator
 

@google-rayazsiddiqi the f is to format the variable you want to print and replace the {job_post} with a predefined variable.

The \n is to go to a new line.

 
Posted : 04/11/2024 3:12 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

@husein Thanks, I think this makes a lot more sense now. I'll have a play around with another website to see if I can do it myself!

 
Posted : 04/11/2024 3:21 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

I tried this on another website, see HTML below:

image

 

THis is my code:

import requests
from bs4 import BeautifulSoup

# Send a GET request to the Python job board
response = requests.get('https://www.clinks.org/vacancies')

# Parse the content of the response with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the job posts
job_posts = soup.find_all('div', class_='field-item')

# Print the title of each job post
# Loop through all of the job_posts
for job_post in job_posts:
    # job_post contains all the elements of the <h2> tag
    title = job_post.text
    print(f"JOB_POST.TEXT: {title}\n")
 
Posted : 04/12/2024 4:00 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

And this is the output, it printed all the text, how do I modify to print just the job title?

JOB_POST.TEXT: The Thames Valley

JOB_POST.TEXT: We have two Trustee vacancies we wish to fill, to complement our existing Board and guide our successful and growing organisation into the next phase of development.

JOB_POST.TEXT: June 30, 2024

JOB_POST.TEXT: Voluntary

JOB_POST.TEXT: Full Time

JOB_POST.TEXT: Clinks
82A James Carter Road
Mildenhall
Suffolk
IP28 7DE
020 4502 6774
info@clinks.org
Clinks is a registered charity no. 1074546 and a company limited by guarantee, registered in England no. 3562176

JOB_POST.TEXT:

 

 
Posted : 04/12/2024 4:01 pm
SSAdvisor
(@ssadvisor)
Posts: 1139
Noble Member
 

@google-rayazsiddiqi you have an extra character in your code; it should be "class" instead of "class_".

Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack

 
Posted : 04/12/2024 4:11 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

@ssadvisor I chnaged it and I got a syntax error...strange!

PS C:\Users\rayaz> & D:/Python/python.exe "c:/Users/rayaz/test file.py"
File "c:\Users\rayaz\test file.py", line 11
job_posts = soup.find_all('div', class='field-item')
^^^^^
SyntaxError: invalid syntax

 
Posted : 04/12/2024 4:22 pm
Page 2 / 3
Share: