Go Back   HowtoForge Forums | HowtoForge - Linux Howtos and Tutorials > Linux Forums > Programming/Scripts

Do you like HowtoForge? Please consider supporting us by becoming a subscriber.
Reply
 
Thread Tools Display Modes
  #1  
Old 31st August 2009, 03:16
unknowngeek unknowngeek is offline
Junior Member
 
Join Date: Aug 2009
Posts: 2
Thanks: 0
Thanked 0 Times in 0 Posts
Question Need shell script for extracting website content for a list of given websites

Hello,

I have used bash scripting for sometime in the past and know quite a
bit of linux. But haven't used it in a long time.

I am off job and took some data entry job where I need to do some stats work on a very long list of websites.

This I get from a website called quantcast.com which gives
statistics of other web sites when I put the other site from the list
I already have, each into the SITENAME field here:

http://www.quantcast.com/SITENAME/demographics


Here's the url for google.nl stats on quantcast:

http://www.quantcast.com/google.nl/demographics


Here's an example below for google.nl in place of the above SITENAME:

http://s995.photobucket.com/albums/a...=quantcast.jpg


(The problem now is that the quantcast.com site gives the stats in images instead of text so I will have a hard time getting the values out of the images.)


After getting this kind of list, it has to be input into the xls
sheet. So I will need this information in some kind of file format
which can be converted or imported into excel easily. CSV etc I
guess.

I know this can be done but I will need to learn quite a lot of bash
scripting again. I have not used it since some time.

Can someone please help me making this script. I will be able to
understand the basics for sure.




This is what I had intended to to at first, when I didn't know that the stats are in images rather than text:


What I had intended to do was in these steps:

1. Copy / Paste the site list into a text file. (Each site is on a
newline already.)

2. Use something like sed/awk etc to insert the http://www.quantcast.com/
before each of these sitenames.

3, To each of the result in (2) above, insert the word "/demographics"
at the end of the urls.

4. The above will now be a file with links to each of the sites on
quantcast.com. Using wget, download each page and save them with some
names or numbers.

5. Tidy the html or somehow batch-convert the html to plain-text.

6. From this file, get the given field-names and their values (that
is, the numbers) and save this info for each site into a new file also
including only the original sitename (without the quantcast.com),
using the cut command or something.

7. Convert this file into a CSV or tab-delimited format.

The CSV conversion as per me would be the last thing to do.

But first, I need to try to get each page, save it with a numbered
name, then extract the images.

Save the images for each site in separate folders.
(The image names are the same for example they are like demograp.png,
demograq.png, demograr.png, demogras.png)

And then extract the content of each image using any utility I can
find. I hope this part really can be done.



Any help Very Much appreciated and thanks,
Regards,
mowgli
Reply With Quote
Sponsored Links
  #2  
Old 28th September 2009, 02:29
cfajohnson cfajohnson is offline
Junior Member
 
Join Date: Apr 2007
Location: Toronto, Canada
Posts: 8
Thanks: 0
Thanked 0 Times in 0 Posts
Default

Quote:
Originally Posted by unknowngeek View Post
(The problem now is that the quantcast.com site gives the stats in images instead of text so I will have a hard time getting the values out of the images.)

Do the images not have an alt text that you could use?
Reply With Quote
  #3  
Old 28th September 2009, 09:41
unknowngeek unknowngeek is offline
Junior Member
 
Join Date: Aug 2009
Posts: 2
Thanks: 0
Thanked 0 Times in 0 Posts
Unhappy Unfortunately, no

Quote:
Originally Posted by cfajohnson View Post
Do the images not have an alt text that you could use?
They don't have the text that is needed from the image, just a generic text, same for all the demographics.

Anyway, I've been typing all those and completed about 3000 of them, still 7K of them remain.
Reply With Quote
  #4  
Old 14th October 2009, 18:44
PatrickMc PatrickMc is offline
Junior Member
 
Join Date: May 2009
Posts: 3
Thanks: 0
Thanked 1 Time in 1 Post
 
Default Collecting data from web pages

Yes, automating manual data entry using a scripting language is easy. Collecting and organizing data using biterscripting is very easy. I could write a custom script for you. But instead of me scratching my head, do take a look at the google group article http://groups.google.com/group/biter...d3e7d953b7dc10 , where they are doing the exact same thing and they have posted some scripts for collecting business addresses, telephone numbers, etc. from web pages.

Patrick
Reply With Quote
Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
freebsd 7, samba 3, domain controller alexdimarco Suggest HOWTO 6 5th November 2010 16:54
Shell Script Execution Problem giganet Programming/Scripts 1 23rd December 2008 15:04
Please explain me this shell script. i.linus Programming/Scripts 1 13th September 2008 03:31
Is there a Script to add Multiple Domains (websites) to ISPCONFIG??? bpmee Programming/Scripts 2 23rd February 2007 01:33
Totally Confused?? :( kingtux Installation/Configuration 7 9th February 2006 22:14


All times are GMT +2. The time now is 10:36.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.