I'm trying to scrape some HTML (with permission from the author). I was using the PHP library suggested here, and it was working well until I encountered a link that looks like this:
Which I believe is some asp.net thing. When I click it, it doesn't change the URL, it just loads some new content into the page, which I'd also like to scrape.
How can I get around this?
I suppose I would need to simulate the click, but I can't do that when processing raw HTML, I'd need some kind of browser/JS interpreter, no?
Is there a better suited library for this task? I'm not limited to PHP, but it's preferred.
Answer
I ended up using Python with Selenium Firefox web driver. Since I'm using a real browser, I can do everything FF can.
No comments:
Post a Comment