Publikationer från Malmö universitet
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
adXtractor – Automated and Adaptive Generation of Wrappers for Information Retrieval
Malmö högskola, Fakulteten för teknik och samhälle (TS).
2017 (engelsk)Independent thesis Advanced level (degree of Master (Two Years)), 20 poäng / 30 hpOppgave
Abstract [en]

The aim of this project is to investigate the feasibility of retrieving unstructured automotive listings from structured web pages on the Internet. The research has two major purposes: (1) to investigate whether it is feasible to pair information extraction algorithms and compute wrappers (2) demonstrate the results of pairing these techniques and evaluate the measurements. We merge two training sets available on the web to construct reference sets which is the basis for the information extraction. The wrappers are computed by using information extraction techniques to identify data properties with a variety of techniques such as fuzzy string matching, regular expressions and document tree analysis. The results demonstrate that it is possible to pair these techniques successfully and retrieve the majority of the listings. Additionally, the findings also suggest that many platforms utilise lazy loading to populate image resources which the algorithm is unable to capture. In conclusion, the study demonstrated that it is possible to use information extraction to compute wrappers dynamically by identifying data properties. Furthermore, the study demonstrates the ability to open non-queryable domain data through a unified service.

sted, utgiver, år, opplag, sider
Malmö högskola/Teknik och samhälle , 2017. , s. 105
Emneord [en]
wrapper generation, information extraction, content of interest identification, wrapper rules, text extraction, key value pair, wrapper generate, main content identification, web scraping, information extraction algorithms, web extraction, dom tree analysis, dom analysis
HSV kategori
Identifikatorer
URN: urn:nbn:se:mau:diva-20071Lokal ID: 22427OAI: oai:DiVA.org:mau-20071DiVA, id: diva2:1479939
Utdanningsprogram
TS Media Software Design, Master's Programme in Computer Science
Tilgjengelig fra: 2020-10-27 Laget: 2020-10-27 Sist oppdatert: 2022-06-27bibliografisk kontrollert

Open Access i DiVA

fulltekst(9891 kB)231 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 9891 kBChecksum SHA-512
5e2b422177edc5ee5d2466c618fd084f57edf39714031954bd5ddf1a04e9f676b35332faa8123561bd5486982a3514b11d55116fb5be3406f5bb7700ca297693
Type fulltextMimetype application/pdf

Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 231 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

urn-nbn

Altmetric

urn-nbn
Totalt: 168 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf