assignment following links in html using beautifulsoup

Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers
Advertising & Talent Reach devs & technologists worldwide about your product, service or employer brand
OverflowAI GenAI features for Teams
OverflowAPI Train & fine-tune LLMs
Labs The future of collective knowledge sharing
About the company Visit the blog

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Get early access and see previews of new features.

Following links in python assignment using Beautifulsoup

I have this assignment for a python class where I have to start from a specific link at a specific position, then follow that link for a specific number of times. Supposedly the first link has the position 1. This is the link: http://python-data.dr-chuck.net/known_by_Fikret.html

traceback error picture I have trouble with locating the link, the error "index out of range" comes out. can anyone help with figuring out how to locate the link/position? This is my code:

OK I wrote this code and it kind of works:

I'm still getting other links than the ones in the example however when I print the whole list of links the positions match so I don't know. Very weird.

beautifulsoup

show your traceback completely, and if possible what is the url? is it public? – joel goldstick Commented Jul 8, 2016 at 13:31
i updated the description – mthe25 Commented Jul 8, 2016 at 13:46
you are importing bs wrong; from bs4 import BeautifulSoup . but what do you mean by count? and position. are you saying from position take the next count links? – joel goldstick Commented Jul 8, 2016 at 13:46
You should check that your tags list has the correct length before you access it at arbitrary positions – OneCricketeer Commented Jul 8, 2016 at 13:50
Yes Joel, the point of the assignment is to locate a specific link then print the x-number of next links from there on. However I can't seem to find how to find the position of the specific link – mthe25 Commented Jul 8, 2016 at 13:52

9 Answers 9

I put the solution below, tested and working well as of today.

importing the require modules

Accessing websites, retrieve all of the anchor tags.

[Edit: Cut+pasted this line from comments] Hi! I had to work in a similar exercise, and because i had some doubts i found your question. Here is my code and I think it works. I hope it will be helpful for you

Hi! I had to work in a similar exercise, and because i had some doubts i found your question. Here is my code and I think it works. I hope it will be helpful for you. – Giselle Santamaria Commented Dec 5, 2017 at 18:22
You may consider editing to answer to explain why and how it might solve this problem, to improve its long-term value. – Luis Orduz Commented Dec 5, 2017 at 18:26

Try this. You can leave entering the URL. There is sample of your former link. Good Luck!

Your BeautifulSoup import was wrong. I don't think it works with the code you show. Also your lower loop was confusing. You can get the list of urls you want by slicing the completely retrieved list.

I've hardcoded your url in my code because it was easier than typing it in each run.

Almost all solutions to this assignment have two sections to load the urls. Instead, I defined a function that prints the relevant link for any given url.

Initially, the function will use the Fikret.html url as input. Subsequent inputs rely on refreshed urls that appear on the required position. The important line of code is this one: url = allerretour(url)[position-1] This gets the new url that feeds the loop another round.

This is my solution:

This is my answer that worked for me in Python 2.7:

Here is the working code giving the desired output

Would you please add some minimal explanation to your code? – sɐunıɔןɐqɐp Commented Jun 10, 2018 at 10:56

Briefly explain your answer. – papabiceps Commented Feb 2, 2019 at 16:07

Not the answer you're looking for? Browse other questions tagged python python-2.7 beautifulsoup or ask your own question .

Featured on Meta
We've made changes to our Terms of Service & Privacy Policy - July 2024
Bringing clarity to status tag usage on meta sites
Feedback requested: How do you use tag hover descriptions for curating and do...

Hot Network Questions

Why is the identity of the actor voicing Spider-Man kept secret even in the commentary?
Can science inform philosophy?
Derivation in Robert Solow's "A Contribution to the Theory of Economic Growth"
How Subjective is Entropy Really?
Just got a new job... huh?
What is the meaning of the biblical term "divine nature", and what does it tell us about the biblical use of the title "God"?
What is the origin of this quote on telling a big lie?
Does a cube under high pressure transform into a ball?
If I purchase a house through an installment sale, can I use it as collateral for a loan?
Fitting the 9th piece into the pizza box
A man hires someone to murders his wife, but she kills the attacker in self-defense. What crime has the husband committed?
How to install a second ground bar on a Square D Homeline subpanel
How can I prove both series are equal?
Can You Build a Propeller or Airfoil for a Higgs Field?
How should Form 990: Part IV Question 3 be answered?
Is a Taproot output with unparseable x-only pubkey unspendable?
Sci-fi book about humanity warring against aliens that eliminate all species in the galaxy
General equation to calculate time required to travel a distance given initial speed and constant acceleration
Is math a bad discipline for teaching jobs or is it just me?
VerificationTest leaks message?
What do all branches of Mathematics have in common to be considered "Mathematics", or parts of the same field?
How much was Boole influenced by Indian logic?
Why is there a custom to say "Kiddush Levana" (Moon Blessing) on Motsaei Tisha Be'av
One of my grammar books written by a Japanese teacher/Japanese teachers

Guide to Parsing HTML with BeautifulSoup in Python

assignment following links in html using beautifulsoup

Introduction

Web scraping is programmatically collecting information from various websites. While there are many libraries and frameworks in various languages that can extract web data, Python has long been a popular choice because of its plethora of options for web scraping.

This article will give you a crash course on web scraping in Python with Beautiful Soup - a popular Python library for parsing HTML and XML.

Ethical Web Scraping

Web scraping is ubiquitous and gives us data as we would get with an API. However, as good citizens of the internet, it's our responsibility to respect the site owners we scrape from. Here are some principles that a web scraper should adhere to:

Don't claim scraped content as our own. Website owners sometimes spend a lengthy amount of time creating articles, collecting details about products or harvesting other content. We must respect their labor and originality.
Don't scrape a website that doesn't want to be scraped. Websites sometimes come with a robots.txt file - which defines the parts of a website that can be scraped. Many websites also have Terms of Use which may not allow scraping. We must respect websites that do not want to be scraped.
Is there an API available already? Splendid, there's no need for us to write a scraper. APIs are created to provide access to data in a controlled way as defined by the owners of the data. We prefer to use APIs if they're available.
Making requests to a website can cause a toll on a website's performance. A web scraper that makes too many requests can be as debilitating as a DDOS attack. We must scrape responsibly so we won't cause any disruption to the regular functioning of the website.
An Overview of Beautiful Soup

The HTML content of the web pages can be parsed and scraped with Beautiful Soup. In the following section, we will be covering those functions that are useful for scraping web pages.

What makes Beautiful Soup so useful is the myriad functions it provides to extract data from HTML. This image below illustrates some of the functions we can use:

Let's get hands-on and see how we can parse HTML with Beautiful Soup. Consider the following HTML page saved to file as doc.html :

The following code snippets are tested on Ubuntu 20.04.1 LTS . You can install the BeautifulSoup module by typing the following command in the terminal:

The HTML file doc.html needs to be prepared. This is done by passing the file to the BeautifulSoup constructor, let's use the interactive Python shell for this, so we can instantly print the contents of a specific part of a page:

Now we can use Beautiful Soup to navigate our website and extract data.

Navigating to Specific Tags

From the soup object created in the previous section, let's get the title tag of doc.html :

Here's a breakdown of each component we used to get the title:

Beautiful Soup is powerful because our Python objects match the nested structure of the HTML document we are scraping.

To get the text of the first <a> tag, enter this:

To get the title within the HTML's body tag (denoted by the "title" class), type the following in your terminal:

For deeply nested HTML documents, navigation could quickly become tedious. Luckily, Beautiful Soup comes with a search function so we don't have to navigate to retrieve HTML elements.

Searching the Elements of Tags

The find_all() method takes an HTML tag as a string argument and returns the list of elements that match with the provided tag. For example, if we want all a tags in doc.html :

We'll see this list of a tags as output:

Here's a breakdown of each component we used to search for a tag:

We can search for tags of a specific class as well by providing the class_ argument. Beautiful Soup uses class_ because class is a reserved keyword in Python. Let's search for all a tags that have the "element" class:

As we only have two links with the "element" class, you'll see this output:

What if we wanted to fetch the links embedded inside the a tags? Let's retrieve a link's href attribute using the find() option. It works just like find_all() but it returns the first matching element instead of a list. Type this in your shell:

The find() and find_all() functions also accept a regular expression instead of a string. Behind the scenes, the text will be filtered using the compiled regular expression's search() method. For example:

The list upon iteration, fetches the tags starting with the character b which includes <body> and <b> :

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

We've covered the most popular ways to get tags and their attributes. Sometimes, especially for less dynamic web pages, we just want the text from it. Let's see how we can get it!

Getting the Whole Text

The get_text() function retrieves all the text from the HTML document. Let's get all the text of the HTML document:

Your output should be like this:

Sometimes the newline characters are printed, so your output may look like this as well:

Now that we have a feel for how to use Beautiful Soup, let's scrape a website!

Beautiful Soup in Action - Scraping a Book List

Now that we have mastered the components of Beautiful Soup, it's time to put our learning to use. Let's build a scraper to extract data from https://books.toscrape.com/ and save it to a CSV file. The site contains random data about books and is a great space to test out your web scraping techniques.

First, create a new file called scraper.py . Let's import all the libraries we need for this script:

In the modules mentioned above:

requests - performs the URL request and fetches the website's HTML
time - limits how many times we scrape the page at once
csv - helps us export our scraped data to a CSV file
re - allows us to write regular expressions that will come in handy for picking text based on its pattern
bs4 - yours truly, the scraping module to parse the HTML

You would have bs4 already installed, and time , csv , and re are built-in packages in Python. You'll need to install the requests module directly like this:

Before you begin, you need to understand how the webpage's HTML is structured. In your browser, let's go to http://books.toscrape.com/catalogue/page-1.html . Then right-click on the components of the web page to be scraped, and click on the inspect button to understand the hierarchy of the tags as shown below.

This will show you the underlying HTML for what you're inspecting. The following picture illustrates these steps:

From inspecting the HTML, we learn how to access the URL of the book, the cover image, the title, the rating, the price, and more fields from the HTML. Let's write a function that scrapes a book item and extract its data:

The last line of the above snippet points to a function to write the list of scraped strings to a CSV file. Let's add that function now:

As we have a function that can scrape a page and export to CSV, we want another function that crawls through the paginated website, collecting book data on each page.

To do this, let's look at the URL we are writing this scraper for:

The only varying element in the URL is the page number. We can format the URL dynamically so it becomes a seed URL :

This string formatted URL with the page number can be fetched using the method requests.get() . We can then create a new BeautifulSoup object. Every time we get the soup object, the presence of the "next" button is checked so we can stop at the last page. We keep track of a counter for the page number that's incremented by 1 after successfully scraping a page.

The function above, browse_and_scrape() , is recursively called until the function soup.find("li",class_="next") returns None . At this point, the code will scrape the remaining part of the webpage and exit.

For the final piece to the puzzle, we initiate the scraping flow. We define the seed_url and call the browse_and_scrape() to get the data. This is done under the if __name__ == "__main__" block:

If you'd like to learn more about the if __name__ == "__main__" block, check out our guide on how it works .

You can execute the script as shown below in your terminal and get the output as:

The scraped data can be found in the current working directory under the filename allBooks.csv . Here's a sample the file's content:

Good job! If you wanted to have a look at the scraper code as a whole, you can find it on GitHub .

In this tutorial, we learned the ethics of writing good web scrapers. We then used Beautiful Soup to extract data from an HTML file using the Beautiful Soup's object properties, and its various methods like find() , find_all() and get_text() . We then built a scraper that retrieves a book list online and exports to CSV.

Web scraping is a useful skill that helps in various activities such as extracting data like an API, performing QA on a website, checking for broken URLs on a website, and more. What's the next scraper you're going to build?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Pythonist 🐍| Linux Geek who codes on WSL | Data & Cloud Fanatic | Blogging Advocate | Author

Monitor with Ping Bot

Reliable monitoring for your app, databases, infrastructure, and the vendors they rely on. Ping Bot is a powerful uptime and performance monitoring tool that helps notify you and resolve issues before they affect your customers.

Vendor Alerts with Ping Bot

Get detailed incident alerts about the status of your favorite vendors. Don't learn about downtime from your customers, be the first to know with Ping Bot.

Python Course
Python Basics
Interview Questions
Python Quiz
Popular Packages
Python Projects
Practice Python
AI With Python
Learn Python3
Python Automation
Python Web Dev
DSA with Python
Python OOPs
Dictionaries

BeautifulSoup – Scraping Link from HTML

Prerequisite: Implementing Web Scraping in Python with BeautifulSoup

In this article, we will understand how we can extract all the links from a URL or an HTML document using Python.

Libraries Required:

bs4 (BeautifulSoup): It is a library in python which makes it easy to scrape information from web pages, and helps in extracting the data from HTML and XML files. This library needs to be downloaded externally as it does not come readily with Python package. To install this library, type the following command in your terminal.
requests: This library enables to send the HTTP requests and fetch the web page content very easily. This library also needs to be downloaded externally as it does not come readily with Python package. To install this library, type the following command in your terminal.

Steps to be followed:

Import the required libraries (bs4 and requests)
Create a function to get the HTML document from the URL using requests.get() method by passing URL to it.
Create a Parse Tree object i.e. soup object using of BeautifulSoup() method, passing it HTML document extracted above and Python built-in HTML parser.
Use the a tag to extract the links from the BeautifulSoup object.
Get the actual URLs from the form all anchor tag objects with get() method and passing href argument to it.
Moreover, you can get the title of the URLs with get() method and passing title argument to it.

Implementation:

" " "

Please Login to comment...

Improve your Coding Skills with Practice

What kind of Experience do you want to share?

Instantly share code, notes, and snippets.

Abul-Hasan2338 / Practice_problem_12.2.py

Download ZIP
Star ( 0 ) 0 You must be signed in to star a gist
Fork ( 0 ) 0 You must be signed in to fork a gist
Embed Embed this gist in your website.
Share Copy sharable link for this gist.
Clone via HTTPS Clone using the web URL.
Learn more about clone URLs
Save Abul-Hasan2338/30cac620e04b96062c8085f5b5c58b57 to your computer and use it in GitHub Desktop.

	import urllib.request, urllib.parse, urllib.error
	from bs4 import BeautifulSoup
	import ssl

	# Ignore SSL certificate errors
	ctx = ssl.create_default_context()
	ctx.check_hostname = False
	ctx.verify_mode = ssl.CERT_NONE


	#url = input('Enter - ')
	url= 'http://python-data.dr-chuck.net/known_by_Fikret.html'
	html = urllib.request.urlopen(url, context=ctx).read()
	soup = BeautifulSoup(html, 'html.parser')

	#print(soup)
	# Retrieve all of the anchor tags
	tags = soup('a')
	val=int(input('Enter Count\n'))
	b=int(input('Enter position\n'))
	i=1
	while i<=val:
	i=i+1
	c=0
	for tag in tags:
	c=c+1
	if c==b:
	print(tag.get('href', None))
	x=tag.get('href')
	#print(x)
	html = urllib.request.urlopen(x, context=ctx).read()
	soup = BeautifulSoup(html, 'html.parser')
	tags = soup('a')
	#print(tag.contents[0])
	break
	else:
	continue
	print(tag.contents[0])

	#Fikret Montgomery Mhairade Butchi Anayah

Abul-Hasan2338 commented Oct 19, 2020 • edited Loading

`import urllib.request, urllib.parse, urllib.error import ssl from bs4 import BeautifulSoup

ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE

url = input('Enter URL -') repeat = int(input('Enter number of repeatations: ')) position = int(input('Enter the link position: '))

#to repeat desired times for i in range(repeat): html = urllib.request.urlopen(url, context=ctx).read() soup = BeautifulSoup(html, 'html.parser') tags = soup('a') count = 0 for tag in tags: count = count +1

print(name)`

Sorry, something went wrong.

Wednesday, July 6, 2016

Using python to access web data week 4 following links in html using beautifulsoup.

Sample problem: Start at http://python-data.dr-chuck.net/known_by_Fikret.html Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve. Sequence of names: Fikret Montgomery Mhairade Butchi Anayah Last name in sequence: Anayah
Actual problem: Start at: http://python-data.dr-chuck.net/known_by_Lana.html Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve. Hint: The first character of the name of the last page that you will load is: L

6 comments:

seq = seq + tags[pos].contents[0]+' ' what this step will do what is contents[0]+' '

could you please explain the algorithm? I seem to be having trouble understanding the algorithm. Thanks.

what is seq will do

Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve. Hint: The first character of the name of the last page that you will load is: Z

mmorpg İnstagram Takipçi Satın Al tiktok jeton hilesi tiktok jeton hilesi antalya saç ekimi Takipci satin al instagram takipçi satın al MT2 PVP SERVERLER instagram takipçi satın al

Yeni perde modelleri sms onay mobil ödeme bozdurma nftnasilalinir.com Ankara Evden Eve Nakliyat Trafik Sigortasi Dedektör website kurma Ask kitaplari

COMMENTS

Following Links in HTML using BeautifulSoup - Stack Overflow
I am doing a course which requires me to parse this using BeautifulSoup: http://python-data.dr-chuck.net/known_by_Fikret.html. The instructions are: Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times.
Following Links in HTML Using BeautifulSoup.py - GitHub
In this assignment you will write a Python program that expands on http://www.pythonlearn.com/code/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat ...
Following links in python assignment using Beautifulsoup
Almost all solutions to this assignment have two sections to load the urls. Instead, I defined a function that prints the relevant link for any given url. Initially, the function will use the Fikret.html url as input. Subsequent inputs rely on refreshed urls that appear on the required position.
Assignment 4.2 Following Links in HTML Using BeautifulSoup.txt
In this assignment you will write a Python program that expands on http://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the ...
GitHub - lageshay/Following_Links_in_HTML_Using_BeautifulSoup ...
Following_Links_in_HTML_Using_BeautifulSoup. Assignment: Use urllib to read the HTML from the data file, extract the href= values from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
Following Links in HTML Using BeautifulSoup in python to ...
Following Links in HTML Using BeautifulSoup in python to access web data. 68 Likes. 10,733 Views. 2020 Jun 27. Code Link:- https://docs.google.com/document/d/1K...
Guide to Parsing HTML with BeautifulSoup in Python - Stack Abuse
We then used Beautiful Soup to extract data from an HTML file using the Beautiful Soup's object properties, and its various methods like find(), find_all() and get_text(). We then built a scraper that retrieves a book list online and exports to CSV.
BeautifulSoup - Scraping Link from HTML - GeeksforGeeks
Use the a tag to extract the links from the BeautifulSoup object. Get the actual URLs from the form all anchor tag objects with get() method and passing href argument to it. Moreover, you can get the title of the URLs with get() method and passing title argument to it.
Following Links in HTML Using BeautifulSoup · GitHub
from bs4 import BeautifulSoup. ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE. url = input('Enter URL -') repeat = int(input('Enter number of repeatations: ')) position = int(input('Enter the link position: ')) #to repeat desired times for i in range(repeat): html = urllib.request.urlopen(url ...
Using Python to access web data Week 4 Following Links in ...">Using Python to access web data Week 4 Following Links in ...
Using Python to access web data Week 4 Following Links in HTML Using BeautifulSoup. The problem- In this assignment you will write a Python program that expands on http://www.pythonlearn.com/code/urllinks.py.