Autore Topic: Trarre informazioni dall'html di una pagina web (Letto 439 volte)

vuott · « **il:** 13 Luglio 2013, 19:14:51 »

Riporto questa discussione apparsa nella Mailing List Internazionale:

« i am trying to get some info from a web page in the format of

Codice: html [Seleziona]

&lt;div class="result">
&lt;div class="col">Text I Want&lt;/div>
&lt;div class="col">

And Some More i Want

Codice: html [Seleziona]

&lt;/div>
&lt/div class="col">

And The last bit

Codice: html [Seleziona]

&lt;/div>
&lt;/div>

what would be the best way to go about this i have tried a few way but i
feel there must be an
easy way to do this

thanks
shane »

« Quite frankly, there is no one easy way to go about this. It depends on
how well structured the data in the web page is. Also, how likely it is
that the web page will change format. We scrape upwards of 300 pages
daily and have some fairly mature ways of approaching it. Here's some
of the techniques we use.

1) Always try and find an XML feed equivalent of the page data.
Sometimes this can be found as a raw feed or sometimes as a hidden feed
to an active page. Once you get the feed URL and either find or write a
schema then parsing the XML is relatively trivial.

2) If the page is well structured and relatively stable, then the next
best approach I suppose would be to follow Randall's suggestion and
write a HTML DOM parser. But, if you go down this route, then develop a
"meta" schema for your parser so you can accomodate changes to the page
format and raw HTML with minimal pain.

3) Sometimes we have fond that it is better to ignore the html
completely and process the page text only. This is particularly true
for pages that use large, well formed "tables" of data that is unlikely
to change in layout (such as if there is an "industry standard" way of
presenting the data. I find that the easiest way to get the raw text in
a format that allows reasonable scraping is to use wget, html2text,
links or lynx to download the page as you need it. The choice of which
downloader to use is dependant on which one can give you the best
"layout" of the text to make parsing easier. Again, try and develop
some meta-description of the text.

4) Always include code to detect possible page format changes and to
describe exactly which bit of the page is no longer scrapable! This can
save hours of work when a tiny bit changes and renders your parser
incorrect or unusable.

5) Finally, we have encountered some pages where it appears that the
target texts are seemingly impossible to predict. For example, one feed
we use randomly inserts advertising data inside the data table rows.
That is, only some of the rows include this extra stuff and some dont.
For these, we have had to resort to "restructuring" the semi-parsed data
and writing it out to an intermediate file. We then try to
automatically parse that file and if that fails, manually edit the
intermediate file back into a useable format.

Hope this gives you some ideas of how to approach your situation. Above
all, try and design an approach before leaping into the coding stage.
Several times we have been caught out assuming that a page is "simple"
and have had to go back and rethink the whole design for that feed
because the provider made some change to the page presentation.

regards
Bruce »

« There is a parsing tool in gambas for html.

Gb.xml.html

It's our own html dom parser. It allows to generate well formated html5 page
and or parsing existing html pages.

It's one of the most fast parser I know.

Fabien Bodard »

« You need to use the right tool for the job. I find the python tool
BeautifulSoup one of the best for parsing and extracting data from web
pages.

http://www.crummy.com/software/BeautifulSoup/

Kind regards,
Caveat »

« You can use the DOM parser of the gb.qt4.webkit component too.

--
Benoît Minisini »

« I too have done a lot of data scrapping for the past few years. I think
picking the right tool for the job will ease your development. I have many
python and Java tools and played with the Gambas parser. But I have found
very little that matches the ease of development I find with Python and
Selenium. Selenium is not just a scrapping tool. In-fact, it wasn't meant
for that at all. It is a browser automation tool and website test
framework. With it, I've had little problem dealing with typical changes in
content. It is also great for comparing the page code sent to different
browsers. BeautifulSoup is great for well structured pages. But once that
structure is lost it often fails. XMLlib and HTMLlib and other Python
modules just don't seem to match the productivity I find with Selenium.

It all comes down to how general do you need your solution to be? Is this a
one-off scrapping or something that you intend to do over a long period of
time? Do you know Python or Java and can you learn it quickly? Must your
solution scale to large projects, or just this one use?

So answer these questions and then review the options. If it is something
the GAMBAS parse can handle then use it. Or if the page is very stable and
well structured, then write a parser. A basic parser is not difficult to
write. Search the internet for Jack Crenshaw's article on building a simple
parser. However, if the page is complex and this is a long term project,
you may want to consider a more powerful and stable solution.

As Bruce said, put in lots of tests along the way because some pages do
change constantly. Having a reporting system that allows you to locate such
changes is very helpful in a high production environment.

Hope this helps
Randall Morgan »

« Like Bruce and Randall said, there is no perfect solution if the structure
of the parsed page change.

so you need some control point before the parsing time to be sure that you
get the good result at the end. If the control show a structure change then
inform the user that the parser need to be revewed.

Now there is two kind of parsing.

Manual: by using instr, and other common text manipulation tools. I use it
when i need to find one data on one line. Because it is more quick than a
DOM tool that need to parse all the html struct before.

But if you need many Info in many place of the web page, the Dom is better
and allow more change in the web page before you need to change the parser
structure.
Simply because we use Tags and attributes to make the searches.

I will send you an example tomorrow
I too have done a lot of data scrapping for the past few years. I think
picking the right tool for the job will ease your development. I have many
python and Java tools and played with the Gambas parser. But I have found
very little that matches the ease of development I find with Python and
Selenium. Selenium is not just a scrapping tool. In-fact, it wasn't meant
for that at all. It is a browser automation tool and website test
framework. With it, I've had little problem dealing with typical changes in
content. It is also great for comparing the page code sent to different
browsers. BeautifulSoup is great for well structured pages. But once that
structure is lost it often fails. XMLlib and HTMLlib and other Python
modules just don't seem to match the productivity I find with Selenium.

It all comes down to how general do you need your solution to be? Is this a
one-off scrapping or something that you intend to do over a long period of
time? Do you know Python or Java and can you learn it quickly? Must your
solution scale to large projects, or just this one use?

So answer these questions and then review the options. If it is something
the GAMBAS parse can handle then use it. Or if the page is very stable and
well structured, then write a parser. A basic parser is not difficult to
write. Search the internet for Jack Crenshaw's article on building a simple
parser. However, if the page is complex and this is a long term project,
you may want to consider a more powerful and stable solution.

As Bruce said, put in lots of tests along the way because some pages do
change constantly. Having a reporting system that allows you to locate such
changes is very helpful in a high production environment.

Fabien Bodard »

News:

Autore Topic: Trarre informazioni dall'html di una pagina web (Letto 439 volte)

vuott

Trarre informazioni dall'html di una pagina web