FRDCSA:

sde

Jump To: Parent Description

Structured Data Extractor (sde) is an implementation of DEPTA (Data Extraction based on Partial Tree Alignment), a method to extract data from web pages (HTML documents). DEPTA was invented by Yanhong Zhai and Bing Liu from University of Illinois at Chicago and was published in their paper: "Structured Data Extraction from the Web based on Partial Tree Alignment" (IEEE Transactions on Knowledge and Data Engineering, 2006). Given a web page, sde will detect data records contained in the web page and extract them into table structure (rows and columns). You can download the application from this link: Download Structured Data Extractor.

Usage
1. You can pass URI_input parameter refering to a local file or remote file, as long as it is a valid URI. URI refering to a local file must be preceded by "file:///". For example in Windows environment: "file:///D:/Development/Proyek/structured_data_extractor/bin/input/input.html" or in UNIX environment: "file:///home/seagate/input/input.html".
2. Extracted data can be viewed in the output file. The output file is a HTML document and the extracted data is presented in HTML tables.
Source Code

SDE source code is available at GitHub.

Dependencies

SDE was developed using these libraries:
- Neko HTML Parser by Andy clark and Marc Guillemot. Licensed under Apache License Version 2.0.
- Xerces by The Apache Software Foundation. Licensed under Apache License Version 2.0.
License

SDE is licensed under the MIT license.

Author

Sigit Dewanto, sigitdewanto11[at]yahoo[dot]co[dot]uk, 2009.

sde

Usage

Source Code

Dependencies

License

Author