Why Selenium? Selenium is a framework for automated GUI testing of web applications. To do this it allows to drive a real browser and to navigate a site. Having a real browser for screen scraping is good as it eases the handling of cookies, session ids, and JavaScript while it makes the scraping harder to detect for the server operator. Seems to be perfect for the job.
At first it was a dream. Navigating through the pages is pretty easy as you can find the links via id, class name, text, and even tag name. You can also use a id to find a certain part of the dom tree and then do a search via tag name in that part of the tree. The main task here is to find some fix but unique ids/class names to get the required link.
The real problems started in the inbox where Myspace uses AJAX to navigate within the inbox. The inbox overview uses pagination. To move to a certain page of it an AJAX request is made. Until this there is no problem. But when you now select a mail and then go back to the inbox the first page of the pagination is shown again. As soon as you view a mail that isn't listed on page one you have to select the page again. It took some time until I've realized that there are buttons that allow to navigate between mails without going over the inbox but jumping directly to the next/previous message.
As the navigation was solved it came the time to extract the content. Doing this for single information like sender or received date was easy as there was one element that contained the information. I only had to get the text. It got more exciting when I wanted to get the profile picture. Selenium can only take screen shots of the whole browser or deliver the link for the profile picture. Getting the already downloaded image wasn't possible. So I skipped the image extraction. I've decided that the name has to be sufficient.
After getting around the images the message where the next challenge. While getting a single element of the dom is quite easy there is no way to get parts of the tree. You can get only single elements. After some reading through the javadoc I've found out that there is a possibility to execute JavaScript and to get the result of the JavaScript. The bad thing about this solution is that the JavaScript starts at the dom of the whole page. It isn't possible to preselect some parts with Selenium. To get the content in JavaScript I've rebuild the getElementByClass as described on this page (German only). To get the subtree from the browser to my Java code I've converted it to a string (German again). This worked easier as I had expected.
Conclusion:
I've got all my mails but I won't use selenium for screen scraping in further projects because
- It was pretty slow. Starting the browser took ages and each request took longer than in a normal browser. The last point is pretty important as each action will wait until the whole site is loaded. Of course this is crucial for a testing software but you see that everything needed is there and you still have to wait until the last advertisement image has loaded.
- It sometimes failed to execute the commands correctly. Maybe Myspace failed the "test" but I tend to blame Selenium.
- Extraction of dom subtrees is difficult (as I've described above)