Wednesday, December 09, 2015

Playing with JSoup and crawling a greek newspaper ...in order to deliver news in the middle of the ocean :)

Recently I stumbled upon several articles and examples of this handy library called JSoup, and I wanted to give it a try. It was a good opportunity to play and experiment around with CSS Selectors

My main need was a family request. My father is a captain for the trade navy. He still travels around the oceans in big tankers and cargo ships. (Like those below, actually this is one of them)



Nowadays all  of these vessels have satellite coms, but in order to open the link and transfer any data, costs a lot. Most of the crew usually, gets some kind of prepaid cards from the satellite internet provider, and they can eventually use skype or any other service, for a short time...very short. To cut a long story short, in case anyone wants to read the news in a regular site, only by opening the site, the amount of images and content would eventually cost him a lot in credits. My father wants to keep in touch with the news back in Greece, so for some time, I was manually copy pasting news  in a simple html or Word file. Then I was sending him this summary of news through email. Of course this manual thing every morning was kind of boring and error prone. I needed to create something that will do the same thing for me.

I spent less than 2 hours, last night and with some basic calls and functionality provided by Jsoup, I hacked my custom crawler (its not rocket science). Make note that is a very specific tool, it crawls a specific newspaper (tovima.gr that is) and it ouputs the summery of the article content to a plain html file. 



You can find the project and code here. Feel free to use it, if by any chance you have the same need, extend it and maybe add a similar crawler for another newspaper? There is a README section and a helper bash script.

You need Java 8 to use it and maven 3 to build it. I am using the maven-shade plugin to create a small uber-jar. 


No comments:

Post a Comment