The www package
in the astronomy & astrophysics toolbox for MATLAB
Description:
The www package contains functions to interact with the world wide web. This include functions to retrieve files via the WWW or FTP, parse html pages, git access, and construction of html pages.
To view all the functions type: "www." followed by <tab>.
To avoid using the "www.pwget" syntax, you can use:
help pwget
Parallel wget to retrieve multiple files simultanously
Package: www
Description: Parallel wget function designed to retrieve multiple files
using parallel wget commands.
If fast communication is available, running several wget
commands in parallel allows almost linear increase in the
download speed.
After exceuting pwget.m it is difficult to kill it. In
order to stop the execuation while it is running you
have to create a file name 'kill_pwget' in the directory
in which pwget is running (e.g., "touch kill_pwget").
Input : - Cell array of URL file links to download.
Alterantively a URL string.
- Additional string to pass to the wget command
e.g., '-q'. Default is empty string ''.
- Maxium wget commands to run in parallel.
Default is 5.
- An optional URL base to concatenate to the begining of each
link. This is useful if the Links cell array contains only
relative positions. Default is empty string ''.
If empty matrix then use default.
Output : Original names of retrieved files.
Tested : Matlab 2012a
By : Eran O. Ofek Oct 2012
URL : http://weizmann.ac.il/home/eofek/matlab/
Example: tic;www.pwget(Links,'',10);toc
Speed : On my computer in the Weizmann network I get the following
results while trying to download 20 corrected SDSS fits images:
MaxGet=1 runs in 83 seconds
MaxGet=2 runs in 41 seconds
MaxGet=5 runs in 19 seconds
MaxGet=10 runs in 9 seconds
MaxGet=20 runs in 6 seconds
Reliable: 2
--------------------------------------------------------------------------
This file is accessible through manual.www
Credit
If you are using this code or products in your scientific publication please give a reference to Ofek (2014; ascl.soft 07005).
License
Unless specified otherwise this code and products are released under the GNU general public license version 3.
Installation
See http://weizmann.ac.il/home/eofek/matlab/doc/install.html for installation instruction and additional documentation.
Selected examples
Find and retrieve URLs within a webpage:
Find all URLs in a web page, and return a cell array of URLs:
List=www.find_urls('http://www.weizmann.ac.il/home/eofek/matlab/');
You can also use reglualr expressions to search for URLs with specific syntax - for example search for URLs with the suffix "*.fits" (i.e., containing FITS files):
URL = 'https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/';
List=www.find_urls(URL,'match','.*?\.fits.gz');
Parallel wget
wget is a powerful LINUX command to download files from the web or ftp. It is possible to expedite the download time (sometimes by factor of >10) by running multiple wget commands. The command www.pwget is calling the LINUX command wget in parallel for multiple files, hence expediting the download time.
The following command get a cell array containing a list of URLs to retrieve and initiate 10 parallel instances of the LINUX command wget
www.pwget(List,'',10);
[1] 44431
[2] 44432
[3] 44433
[4] 44434
[5] 44435
[6] 44436
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: wget/usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
--2017-01-20 08:50:36-- https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533N004_full_img2.fits.gz
--2017-01-20 08:50:36-- https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533N004_cntr_img2.fits.gz
--2017-01-20 08:50:36-- https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533_000N004_dtf1.fits.gz
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /usr/local/MATLAB/R2016b/bin/glnxa64/libssl.so.1.0.0: no version information available (required by wget)
--2017-01-20 08:50:36-- https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533_000N004_fov1.fits.gz
--2017-01-20 08:50:36-- https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533N004_evt2.fits.gz
--2017-01-20 08:50:36-- https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/orbitf301665901N001_eph1.fits.gz
--2017-01-20 08:50:36-- https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/pcadf301669698N002_asol1.fits.gz
Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... 129.164.179.23, 2001:4d0:2310:150::23
129.164.179.23Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... , 2001:4d0:2310:150::23
Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... Resolving heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)... 129.164.179.23, 2001:4d0:2310:150::23
129.164.179.23, 2001:4d0:2310:150::23
Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... 129.164.179.23, 2001:4d0:2310:150::23
Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... 129.164.179.23, 2001:4d0:2310:150::23
Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... 129.164.179.23, 2001:4d0:2310:150::23
Connecting to heasarc.gsfc.nasa.gov (heasarc.gsfc.nasa.gov)|129.164.179.23|:443... connected.
connected.
connected.
connected.
connected.
connected.
connected.
ERROR: cannot verify heasarc.gsfc.nasa.gov's certificate, issued by ‘/C=US/O=Entrust, Inc./OU=See www.entrust.net/legal-terms/OU=(c) 2012 Entrust, Inc. - for authorized use only/CN=Entrust Certification Authority - L1K’:
Unable to locally verify the issuer's authority.
To connect to heasarc.gsfc.nasa.gov insecurely, use `--no-check-certificate'.
ERROR: cannot verify heasarc.gsfc.nasa.gov's certificate, issued by ‘/C=US/O=Entrust, Inc./OU=See www.entrust.net/legal-terms/OU=(c) 2012 Entrust, Inc. - for authorized use only/CN=Entrust Certification Authority - L1K’:
Unable to locally verify the issuer's authority.
To connect to heasarc.gsfc.nasa.gov insecurely, use `--no-check-certificate'.
ERROR: cannot verify heasarc.gsfc.nasa.gov's certificate, issued by ‘/C=US/O=Entrust, Inc./OU=See www.entrust.net/legal-terms/OU=(c) 2012 Entrust, Inc. - for authorized use only/CN=Entrust Certification Authority - L1K’:
Unable to locally verify the issuer's authority.
To connect to heasarc.gsfc.nasa.gov insecurely, use `--no-check-certificate'.
ERROR: cannot verify heasarc.gsfc.nasa.gov's certificate, issued by ‘/C=US/O=Entrust, Inc./OU=See www.entrust.net/legal-terms/OU=(c) 2012 Entrust, Inc. - for authorized use only/CN=Entrust Certification Authority - L1K’:
Unable to locally verify the issuer's authority.
To connect to heasarc.gsfc.nasa.gov insecurely, use `--no-check-certificate'.
ERROR: cannot verify heasarc.gsfc.nasa.gov's certificate, issued by ‘/C=US/O=Entrust, Inc./OU=See www.entrust.net/legal-terms/OU=(c) 2012 Entrust, Inc. - for authorized use only/CN=Entrust Certification Authority - L1K’:
Unable to locally verify the issuer's authority.
To connect to heasarc.gsfc.nasa.gov insecurely, use `--no-check-certificate'.
[6] - Exit 5 wget https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/orbitf301665901N001_eph1.fits.gz
[5] - Exit 5 wget https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533_000N004_fov1.fits.gz
[4] - Exit 5 wget https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533_000N004_dtf1.fits.gz
[1] Exit 5 wget https://heasarc.gsfc.nasa.gov/FTP/chandra/data/science/ao09/cat4/8533/primary/hrcf08533N004_cntr_img2.fits.gz
Subscripted assignment dimension mismatch.
Error in Util.files.dir_cell (line 17)
Dir(Icell) = dir(Cell{Icell});
Error in www.pwget (line 101)
Dir1 = Util.files.dir_cell(Files);
You can pass additional string commands to wget command. The following example will execute wget in the quiet mode:
% sometime wget will do nothing because the URL is untruste - in this case use:
www.pwget(Links,'--no-check-certificate',10);
FTP file listing
www.ftp_dir_list can be used to recursively search for file listings in FTP directory tree.
FullURL=www.ftp_dir_list('ftp://legacy.gsfc.nasa.gov/chandra/data/science/ao01/cat1/1/primary/');
Retrieve all files in an FTP site
The following example demonstrate how to use the www.rftpget command to retrieve recursively (using the LINUX wget commnad) all files in a FTP directory tree:
[DirsList,DirsListURL,FilesList,FilesListURL]=www.rftpget('ftp://legacy.gsfc.nasa.gov/chandra/data/science/ao01/cat1/1/','rftp',1);
Parsing HTML tables
Sometimes we would like to read online HTML tables directly into matlab. In many cases, but not always, the www.parse_html_table can deal with this task.
The following example will parse the OGLE 2008 microlensing events into a cell array:
[Cell,Header]=www.parse_html_table('http://ogle.astrouw.edu.pl/ogle3/ews/2008/ews.html',1,'y','y','cm');
Construct HTML webpages
You can use the commands www.html_page and www.html_table to construct an HTML page or table.
cgibin parsing
The www.cgibin_parse_query_str command can be use to parse an HTML query string into paramaeter, value pairs. This command either get a query string or read the query string from the LINUX environment varaible QUERY_STRING. The following example parse the following query string:
[ST,CT]=www.cgibin_parse_query_str('q=parse+post+get+cgi-bin+matlav&oq=parse+post+get+cgi-bin+matlav');
GIT access
The www.git call the git command with specific parameters, while the www.gitac command add and commit a list of files to a repository.