NAME

HTML::ParseTables - Extract text from HTML tables. Version 0.05 (pre-alpha)


SYNOPSIS

    use HTML::ParseTables;

    my $p = new HTML::ParseTables();
    if ($html = shift) {
        $p->parse_file($html)
    }
    else {
        while (<DATA>) { $p->parse($_) }
    }
    
    print $p->table_count, " tables found\n";
    my %h = $p->all_tables;

    foreach $table (sort keys %h) {
        my @rows = $p->get_table($table);
        my $row_count = 0;
        print "\nTABLE $table. ",
              $p->row_count($table),
              " rows:\n";
        foreach $row (@rows) {
            print ++$row_count,
                  " (", scalar(@{$row}), " cells)\t",
                  join("\t", @{$row}), "\n";
        }
    }

    print "Table 1, Cell B2    : ",
          $p->get_cell(1, 'B2'), "\n";
    print "Last table, cell A1 : ",
          $p->get_cell('A1'), "\n";

    __DATA__

    <HTML><BODY>
    <P> paragraph before table </P>
    <TABLE>
        <TR> <TD>A1</TD> <TD>B1</TD> </TR>
        <TR> <TD>A2</TD> <TD>B2</TD> </TR>
    </TABLE>
    <TABLE>
        <TR> <TD>T2-A1</TD> <TD>T2-B1</TD> </TR>
        <TR> <TD>T2-A2</TD> <TD>T2-B2</TD> </TR>
    </TABLE>


DESCRIPTION

Easy extraction of text from HTML documents containing tables. Tries to focus on an intuitive interface to get at table content. Particularly, it allows different notations to to get at individual cells, among which the popular spreadsheet ``B2'' notation.

This version is to be considered ``pre-alpha'': it may contain many bugs, lots of things are not documented and the interface may change quickly. For the documentation, the only reliable thing to do is to look at the code. I have no time to polish it now, but I was asked to post it, so here it is.

It works well for me in a few scripts that run daily, so hopefully you can use it too.


DETAILS

Important: the module uses 1 as the index to the first table/column/row/cell, not 0!


%config

Should allow setting of user preferences for output and things retained during parsing.


get_table([$table])

Returns table $table as a list of rows (rows being references to a list of cells). First table is table 1. Without argument, returns last table.


get_table_as_text([$table])

Returns table $table as a string. Newlines between rows. The separator between cells depends on $config{format}. Without argument, returns last table.


get_row([$table,] $row)

Returns row $row from table $table as a list of cells. If $table is omitted, uses last table. First row is 1.


get_cell()

Accept different formats:

    get_cell($table, $column, $row)
     get_cell($table, 'B3')
    get_cell($column, $row)
    get_cell('A5')

Returns the cell content. If $table is omitted, uses last.


table_count()

Returns number of tables found. Takes no argument.


row_count([$table])

Returns number of rows in table $table or last table.


cell_count($table, $row)

Returns number of cells in row $row of table $table.


all_tables

Returns a hash with all tables. Keys are numbers from 1 to the number of tables. Values are references to lists of lists (rows of cells).


LIMITATIONS

Lot's for now:

Doesn't handle nested tables.

Doesn't understand colspan and rowspan.

Documentation incomplete and possibly even wrong.

This is really not finished.

...?


BUGS

Let me know what you find


AUTHOR

Milivoj Ivkovic <mi@alma.ch> Others welcome to extend it to more operating systems which don't have an uptime command.


COPYRIGHT

Copyright Milivoj Ivkovic, 1999. Same license as Perl itself.