Malmö University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
How to Build a Web Scraper for Social Media
Malmö University, Faculty of Technology and Society (TS).
Malmö University, Faculty of Technology and Society (TS).
2019 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

In recent years, the act of scraping websites for information has become increasingly relevant. However, along with this increase in interest, the internet has also grown substantially and advances and improvements to websites over the years have in fact made it more difficult to scrape. One key reason for this is that scrapers simply account for a significant portion of the traffic to many websites, and so developers often implement anti-scraping measures along with the Robots Exclusion Protocol (robots.txt) to try to stymie this traffic. The popular use of dynamically loaded content – content which loads after user interaction – poses another problem for scrapers. In this paper, we have researched what kinds of issues commonly occur when scraping and crawling websites – more specifically when scraping social media – and how to solve them. In order to understand these issues better and to test solutions, a literature review was performed and design and creation methods were used to develop a prototype scraper using the frameworks Scrapy and Selenium. We found that automating interaction with dynamic elements worked best to solve the problem of dynamically loaded content. We also theorize that having an artificial random delay when scraping and randomizing intervals between each visit to a website would counteract some of the anti-scraping measures. Another, smaller aspect of our research was the legality and ethicality of scraping. Further thoughts and comments on potential solutions to other issues have also been included.

Place, publisher, year, edition, pages
Malmö universitet/Teknik och samhälle , 2019. , p. 40
Keywords [en]
scraping, scraper, scrape, crawling, crawler, crawl, scrapy, selenium, social media, dynamic content, web, anti-scraping, anti-crawling, ajax
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:mau:diva-20594Local ID: 29173OAI: oai:DiVA.org:mau-20594DiVA, id: diva2:1480471
Educational program
TS Systemutvecklare
Supervisors
Examiners
Available from: 2020-10-27 Created: 2020-10-27Bibliographically approved

Open Access in DiVA

fulltext(682 kB)4472 downloads
File information
File name FULLTEXT01.pdfFile size 682 kBChecksum SHA-512
7e795d61cf06072aa8660be1b74ebbd5efa481ca43cb55cc3a677fd03f9aea9ea2daeddb62e95c2b21674d447ec3763a558cce670ceae6a9016851b7ee839f99
Type fulltextMimetype application/pdf

By organisation
Faculty of Technology and Society (TS)
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 4481 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1621 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf