The www package

in the astronomy & astrophysics toolbox for MATLAB

Description:

The www package contains functions to interact with the world wide web. This include functions to retrieve files via the WWW or FTP, parse html pages, git access, and construction of html pages.

To view all the functions type: "www." followed by <tab>.

To avoid using the "www.pwget" syntax, you can use:

import www.*

help pwget

  Parallel wget to retrieve multiple files simultanously
  Package: www
  Description: Parallel wget function designed to retrieve multiple files
               using parallel wget commands.
               If fast communication is available, running several wget
               commands in parallel allows almost linear increase in the
               download speed.
               After exceuting pwget.m it is difficult to kill it. In
               order to stop the execuation while it is running you
               have to create a file name 'kill_pwget' in the directory
               in which pwget is running (e.g., "touch kill_pwget").
  Input  : - Cell array of URL file links to download.
             Alterantively a URL string.
           - Additional string to pass to the wget command
             e.g., '-q'. Default is empty string ''.
           - Maxium wget commands to run in parallel.
             Default is 5.
           - An optional URL base to concatenate to the begining of each
             link. This is useful if the Links cell array contains only
             relative positions. Default is empty string ''.
             If empty matrix then use default.
  Output : Original names of retrieved files.
  Tested : Matlab 2012a
      By : Eran O. Ofek                    Oct 2012
     URL : http://weizmann.ac.il/home/eofek/matlab/
  Example: tic;www.pwget(Links,'',10);toc
  Speed  : On my computer in the Weizmann network I get the following
           results while trying to download 20 corrected SDSS fits images:
           MaxGet=1  runs in 83 seconds
           MaxGet=2  runs in 41 seconds
           MaxGet=5  runs in 19 seconds
           MaxGet=10 runs in 9 seconds
           MaxGet=20 runs in 6 seconds
  Reliable: 2
--------------------------------------------------------------------------

This file is accessible through manual.www

Credit

If you are using this code or products in your scientific publication please give a reference to Ofek (2014; ascl.soft 07005).

License

Unless specified otherwise this code and products are released under the GNU general public license version 3.

Installation

See http://weizmann.ac.il/home/eofek/matlab/doc/install.html for installation instruction and additional documentation.

Selected examples

Find and retrieve URLs within a webpage:

Find all URLs in a web page, and return a cell array of URLs:

List=www.find_urls('http://www.weizmann.ac.il/home/eofek/matlab/');

You can also use reglualr expressions to search for URLs with specific syntax - for example search for URLs with the suffix "*.fits" (i.e., containing FITS files):

URL = 'https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/';

List=www.find_urls(URL,'match','.*?\.fits.gz');

Parallel wget

wget is a powerful LINUX command to download files from the web or ftp. It is possible to expedite the download time (sometimes by factor of >10) by running multiple wget commands. The command www.pwget is calling the LINUX command wget in parallel for multiple files, hence expediting the download time.

The following command get a cell array containing a list of URLs to retrieve and initiate 10 parallel instances of the LINUX command wget

www.pwget(List,'',10);

[1] 44431
[2] 44432
[3] 44433
[4] 44434
[5] 44435
[6] 44436
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: wget/usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
--2017-01-20 08:50:36--  https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533N004_full_img2.fits.gz
--2017-01-20 08:50:36--  https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533N004_cntr_img2.fits.gz
--2017-01-20 08:50:36--  https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533_000N004_dtf1.fits.gz
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
--2017-01-20 08:50:36--  https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533_000N004_fov1.fits.gz
--2017-01-20 08:50:36--  https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533N004_evt2.fits.gz
--2017-01-20 08:50:36--  https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/orbitf301665901N001_eph1.fits.gz
--2017-01-20 08:50:36--  https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/pcadf301669698N002_asol1.fits.gz
Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... 129.164.179.23, 2001:4d0:2310:150::23
129.164.179.23Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... , 2001:4d0:2310:150::23
Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... 129.164.179.23, 2001:4d0:2310:150::23
129.164.179.23, 2001:4d0:2310:150::23
Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... 129.164.179.23, 2001:4d0:2310:150::23
Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... 129.164.179.23, 2001:4d0:2310:150::23
Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... 129.164.179.23, 2001:4d0:2310:150::23
Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... connected.
connected.
connected.
connected.
connected.
connected.
connected.
ERROR: cannot verify heasarc.gsfc.nasa.gov's certificate, issued by ‘/C=US/O=Entrust, Inc./OU=See www.entrust.net/legal-terms/OU=(c) 2012 Entrust, Inc. - for authorized use only/CN=Entrust Certification Authority - L1K’:
  Unable to locally verify the issuer's authority.
To connect to heasarc.gsfc.nasa.gov insecurely, use `--no-check-certificate'.
ERROR: cannot verify heasarc.gsfc.nasa.gov's certificate, issued by ‘/C=US/O=Entrust, Inc./OU=See www.entrust.net/legal-terms/OU=(c) 2012 Entrust, Inc. - for authorized use only/CN=Entrust Certification Authority - L1K’:
  Unable to locally verify the issuer's authority.
To connect to heasarc.gsfc.nasa.gov insecurely, use `--no-check-certificate'.
ERROR: cannot verify heasarc.gsfc.nasa.gov's certificate, issued by ‘/C=US/O=Entrust, Inc./OU=See www.entrust.net/legal-terms/OU=(c) 2012 Entrust, Inc. - for authorized use only/CN=Entrust Certification Authority - L1K’:
  Unable to locally verify the issuer's authority.
To connect to heasarc.gsfc.nasa.gov insecurely, use `--no-check-certificate'.
ERROR: cannot verify heasarc.gsfc.nasa.gov's certificate, issued by ‘/C=US/O=Entrust, Inc./OU=See www.entrust.net/legal-terms/OU=(c) 2012 Entrust, Inc. - for authorized use only/CN=Entrust Certification Authority - L1K’:
  Unable to locally verify the issuer's authority.
To connect to heasarc.gsfc.nasa.gov insecurely, use `--no-check-certificate'.
ERROR: cannot verify heasarc.gsfc.nasa.gov's certificate, issued by ‘/C=US/O=Entrust, Inc./OU=See www.entrust.net/legal-terms/OU=(c) 2012 Entrust, Inc. - for authorized use only/CN=Entrust Certification Authority - L1K’:
  Unable to locally verify the issuer's authority.
To connect to heasarc.gsfc.nasa.gov insecurely, use `--no-check-certificate'.
[6]  - Exit 5                        wget https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/orbitf301665901N001_eph1.fits.gz
[5]  - Exit 5                        wget https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533_000N004_fov1.fits.gz
[4]  - Exit 5                        wget https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533_000N004_dtf1.fits.gz
[1]    Exit 5                        wget https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533N004_cntr_img2.fits.gz
Subscripted assignment dimension mismatch.

Error in Util.files.dir_cell (line 17)
    Dir(Icell) = dir(Cell{Icell});

Error in www.pwget (line 101)
    Dir1 = Util.files.dir_cell(Files);

You can pass additional string commands to wget command. The following example will execute wget in the quiet mode:

www.pwget(List,'-q',10);

% sometime wget will do nothing because the URL is untruste - in this case use:

www.pwget(Links,'--no-check-certificate',10);

FTP file listing

www.ftp_dir_list can be used to recursively search for file listings in FTP directory tree.

FullURL=www.ftp_dir_list('ftp://legacy.gsfc.nasa.gov/chandra/data/science/ao01/cat1/1/primary/');

Retrieve all files in an FTP site

The following example demonstrate how to use the www.rftpget command to retrieve recursively (using the LINUX wget commnad) all files in a FTP directory tree:

[DirsList,DirsListURL,FilesList,FilesListURL]=www.rftpget('ftp://legacy.gsfc.nasa.gov/chandra/data/science/ao01/cat1/1/','rftp',1);

Parsing HTML tables

Sometimes we would like to read online HTML tables directly into matlab. In many cases, but not always, the www.parse_html_table can deal with this task.

The following example will parse the OGLE 2008 microlensing events into a cell array:

[Cell,Header]=www.parse_html_table('http://ogle.astrouw.edu.pl/ogle3/ews/2008/ews.html',1,'y','y','cm');

Construct HTML webpages

You can use the commands www.html_page and www.html_table to construct an HTML page or table.

cgibin parsing

The www.cgibin_parse_query_str command can be use to parse an HTML query string into paramaeter, value pairs. This command either get a query string or read the query string from the LINUX environment varaible QUERY_STRING. The following example parse the following query string:

[ST,CT]=www.cgibin_parse_query_str('q=parse+post+get+cgi-bin+matlav&oq=parse+post+get+cgi-bin+matlav');

GIT access

The www.git call the git command with specific parameters, while the www.gitac command add and commit a list of files to a repository.