zhaopinboai.com

Building an Efficient Web Scraper for News Aggregation

Written on

Introduction to NewsFoldr

Hello, everyone! I’m thrilled to introduce you to NewsFoldr, my latest initiative in the news aggregation space. NewsFoldr gathers articles from various news outlets and consolidates them into a user-friendly web application. Explore it at NewsFoldr.com.

In addition, I’ve launched an API for NewsFoldr, which serves as a useful resource for other media organizations. You can find it listed on RapidAPI: NewsFoldr — The Ultimate News Aggregation Hub.

The Technical Foundation of NewsFoldr

The effectiveness of NewsFoldr hinges on the development of efficient web scrapers tailored for each news source. A prime example is the scraper I created for Al Jazeera, a prominent source for global news. You can visit their website here.

Al Jazeera’s platform is well-organized, making it ideal for scraping. However, the site's asynchronous nature, where data loads dynamically, significantly shaped my approach to scraping.

Delving into the Scraping Process

Let’s take a closer look at the technical details involved:

  1. Analyzing the Website Structure: The initial step is to examine the page source and recognize patterns for effective navigation through the web pages.
  2. Selecting the Appropriate Tools: My toolkit includes libraries such as requests, Selenium, and web drivers for browsers like Firefox and Chrome, which are essential for managing the complexities of dynamic content.
  3. Scraping Strategy: For Al Jazeera, I chose screen scraping. This method entails inspecting page sources to pinpoint key elements like IDs and classes, which are crucial for data extraction. Another option is to scrape the API that powers the page by making timed requests and mapping the returned payload to our requirements.
  4. Implementing the Code: The process flows from requesting the page using web drivers to extracting category links and story details. Each segment of code is meticulously designed to ensure efficient and precise data retrieval.

A Peek at the Code

The scraper I developed operates by loading the homepage, retrieving the news categories available on the site, gathering stories from these categories, and ultimately scraping the story details.

Visiting a Page

The function below enables me to read the details of a page, designed as a utility for reuse:

def requestPageUsingWebDriver(self, link, web_agent="chrome"):

options = driver = None

if web_agent == "firefox":

options = webdriver.FirefoxOptions()

options.add_argument("--headless")

driver = webdriver.Firefox(options=options)

elif web_agent == "chrome":

options = Options()

options.add_argument("--disable-dev-shm-usage")

options.add_argument("--headless")

options.add_argument("--no-sandbox")

driver = webdriver.Chrome(options=options)

else:

raise Exception("Invalid Web Agent! Please use Firefox or Chrome.")

driver.get(link)

driver.implicitly_wait(50)

soup = BeautifulSoup(driver.page_source, "html.parser")

driver.quit()

return soup

Retrieving Category Links

The code for the navigation bar can be seen in the screenshot below. After obtaining the homepage, we search for the header to identify the anchor elements and extract the href values along with the display text representing the actual categories.

def findNavigation(base_url):

parsed_page = utils.requestPageUsingWebDriver(base_url)

nav_tag = parsed_page.find('nav', class_='site-header__navigation css-15ru6p1')

li_tags = nav_tag.find_all('li', class_='menu__item menu__item--aje')

categories = {

a.text.strip(): f"{domain_url}{a['href']}" if domain_url not in a["href"] else a['href']

for li in li_tags for a in li.find_all('a', href=True)

}

return categories

def alJazeera():

global GLOBAL_CURRENT_ARTICLE_NO

global GLOBAL_START_TIME

global GLOBAL_MAX_ARTICLES

global GLOBAL_MAX_RUN_TIME_IN_MINS

try:

categories = findNavigation(base_url)

for category, category_link in categories.items():

scrapeCategoryLinks(category_link, category)

except requests.HTTPError as err:

print(f"HTTP error occurred {err}")

Retrieving Article Links from a Category

This section illustrates how story links are collected and passed to the next function:

def extractLatestStories(parsed_page):

result_links = []

for a_tag in parsed_page.find_all("a", class_='u-clickable-card__link'):

if (

a_tag

and "page" not in a_tag["href"]

and "https" not in a_tag["href"]

):

result_links.append(f'{domain_url}{a_tag["href"]}')

return list(set(result_links))

Please note that the class searched for in this codebase is identified as 'u-clickable-card__link', which returns all anchor tags that lead to the available stories.

Scraping the Article Data

At this stage, the script can return various stories, necessitating another function to collect all the metadata associated with each story:

def scrapeArticle(link, category):

global GLOBAL_CURRENT_ARTICLE_NO

try:

parsed_page = utils.requestPage(link)

title = story_date = image_url = youtube = ""

author = "Al Jazeera"

paragraphs = []

try:

header = parsed_page.find('header', class_='article-header')

title = header.find('h1').text if header and header.find('h1') else "No Title"

author_info = parsed_page.find('div', class_='article-author-name')

author = author_info.find('a', class_='author-link').text if author_info and author_info.find('a', class_='author-link') else "Unknown"

date_info = parsed_page.find('div', class_='article-dates')

story_date = date_info.find('span').text.strip() if date_info and date_info.find('span') else "Unknown"

figure = parsed_page.find('figure', class_='article-featured-image')

image_url = figure.find('img')['src'] if figure and figure.find('img') else ""

content_area = parsed_page.find('div', class_='wysiwyg')

paragraphs = [para.text for para in content_area.find_all('p')] if content_area else []

except Exception as err:

print(err)

# categorize the data based on some weights (sorry can't share here)

# formulate data and push it to the database

# upload the data schema to a database

telegram_bot.sendArticle(data)

except IndexError:

print(IndexError)

except requests.HTTPError as err:

print(err)

except Exception as err:

print(err)

finally:

# register the link as an already scraped story to prevent double scraping

GLOBAL_CURRENT_ARTICLE_NO += 1

Conclusion

In summary, NewsFoldr is more than just a news aggregation website; it serves as a portal to endless opportunities, such as trend analysis and financial forecasting based on real-time news. If you’re interested in utilizing this data, don’t miss out on the NewsFoldr API.

Do you have a specific news source in mind that you would like me to scrape for the NewsFoldr project? Or are you interested in building custom scrapers for your needs? Let’s connect! Feel free to leave a comment or send a private message. If you found this article valuable, please give it a clap and subscribe to my newsletter for future updates. I’m working on some exciting projects, including the development of GPTs using this API, among other initiatives.

Thank you for your time!

This first video demonstrates how to scrape news websites using BeautifulSoup in Python.

The second video provides a straightforward guide on building a simple web scraper.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

# The Cosmic Influence of the Moon: Your Tarot Guide for Today

Discover how the Moon's journey impacts your zodiac sign today through tarot insights and cosmic energies.

Understanding Attitude Formation and Change: A Comprehensive Guide

Explore the dynamics of attitude formation, its impact on behavior, and the processes involved in changing attitudes.

Navigating the Future of Climate Forecasting: Beyond Category 6

An exploration of evolving climate forecasting methods and the potential for a new hurricane category.