by jamesra » Sat Jan 11, 2014 6:15 am
In this particular case I'm using CEF for rendering an off-screen version of web page and then I want to do content extraction (title, image, text). In academia, most of the papers that have been published about content extraction usually only use the HTML code for extraction and not any style or layout information, which I believe would enhance the process a lot.
I have currently a prototype of a new content extraction algorithm running as a Chrome plug-in. Using a plug-in is an easy way to test and try out different content extraction approaches as I can get my hands on layout/style information & HTML code of a rendered page easily. However, I have now taken steps to convert the CE algorithm to C/C++ and would like to use CEF to render and off-screen version of a web page, get an access via CEF API to HTML source + required layout/style information per DOM node and then finally run my algorithm on top of it all. The layout/style information that I'm currently using include x,y coordinates and width & height of DOM node, css-display, css-fontfamily, css-fontstyle, css-fontweight, css-fontsize, css-color, css-visibility, css-float.
Based on your reply, I'm assuming that I can't have a direct access to CSS attributes per DOM node using API or is there any (relatively easy) way to do it? Naturally I would need to have "computed" version of each DOM node's style attributes, i.e. a combined, actual values based on the inherited values from parent nodes.