Thursday, December 17, 2015

My evolved news crawler :) v1.8

Well I needed to kill some time during this strange intermission period - between jobs. My original 1 hour hack (less than 100 lines of code), evolved to something more flexible and useful (I hope so). Eventually my father is very happy now, instead of 1 newspaper summary he now receives 10.

He was also kind enough, to email me (from the Pacific) some early bugs like duplicate entries and formatting issues, which I tried to resolve. It is always fun to have someone use your code, isn't it? 

Of course in order to honor my Java development heritage, in this small tool I had to create my own  mini framework / crawling logic  - all java devs do it!! It's not that complex actually, and now I can easily add more crawlers for similar sites.

So currently I support the following sites (greek at the time being) but I will keep adding more :
I have also added 2 optional command line arguments.
  • flag to control the max amount of articles to be crawled and included in the final report.
  • flag to control the creation of zip files, that contain each  html report. That way I manage to reduce the size even more. So when I email them the payload is far less :).
You can find more in the official github page. By the way I try to keep my documentation up to date.

You will find all the required material in order to run or compile this small utility, plus any requirements.

I will soon add a small section, for those (if there is anyone interested) that would like to plug, extra crawling implementations for other RSS based sites.

Of course there a lot of stuff that I could do, in order  to improve the utility and most probably I will continue to add, crawlers for sites and make the design more 'modular'.

happy crawling .

No comments:

Post a Comment