Thursday, August 13, 2009

Heritrix 1

    About Heritrix binaries:
  • To start the Heritrix server at 127.0.0.1:8080, just right click the project on Eclipse and run it.
  • The ARC Writer crawl output goes to:
    C:\tools\Heritrix\archive-crawler\heritrix\jobs
  • Unzip IAH-20090813141952-00000-stutzmann.arc.gz with 7zip
  • run perl C:\tools\Heritrix\ArcExtractor\ArcExtractor.pl
    You get crawled files with names changed, but extensions preserved
  • change in bin/heritrix: JAVA_OPTS=" -Xmx512M"
  • You can shut down through the web interface: http://127.0.0.1:8080/
    About the dev version on Vista:
  • I build the heritrix-1.15.4.jar on Windows using "maven dist" (version 1.0.2). I get a build error that is actually related to the tests (C:\Users\toto\.maven\cache\maven-test-plugin-1.6.2\plugin.jelly), but if you comment out the tests in project.xml, the jar is fine (heritrix-1.15.4.jar in C:\tools\Heritrix\archive-crawler\heritrix\target). I had to use jdk 1.5 because jdk 1.6 would not work because of tools.jar
  • I copy the jar to a Linux heritrix 1.14.3 binary install and rename the existing heritrix-1.14.3.jar to *.old
  • I run it on Linux doing:
    "bin/heritrix --bind=/ --admin=LOGIN:PASSWORD"
  • Note that if you add a writer in the jar file, you can choose it from the web interface from:
    "Profiles", edit your profile, "Modules", and writer
    About Maven (need maven 1.0.2 for Heritrix):
  • to display all the targets:
    maven -g > toto.txt
  • List of targets are in project.xml, see "build" target
  • "dist:" postgoals are described in maven.xml
  • To add a jar like mysql, you need to add it to project.properties and project.xml