skip to main |
skip to sidebar
Heritrix 1
About Heritrix binaries:
- To start the Heritrix server at 127.0.0.1:8080, just right click the project on Eclipse and run it.
- The ARC Writer crawl output goes to:
C:\tools\Heritrix\archive-crawler\heritrix\jobs
- Unzip IAH-20090813141952-00000-stutzmann.arc.gz with 7zip
- run perl C:\tools\Heritrix\ArcExtractor\ArcExtractor.pl
You get crawled files with names changed, but extensions preserved
- change in bin/heritrix: JAVA_OPTS=" -Xmx512M"
- You can shut down through the web interface: http://127.0.0.1:8080/
About the dev version on Vista:
- I build the heritrix-1.15.4.jar on Windows using "maven dist" (version 1.0.2). I get a build error that is actually related to the tests (C:\Users\toto\.maven\cache\maven-test-plugin-1.6.2\plugin.jelly), but if you comment out the tests in project.xml, the jar is fine (heritrix-1.15.4.jar in C:\tools\Heritrix\archive-crawler\heritrix\target). I had to use jdk 1.5 because jdk 1.6 would not work because of tools.jar
- I copy the jar to a Linux heritrix 1.14.3 binary install and rename the existing heritrix-1.14.3.jar to *.old
- I run it on Linux doing:
"bin/heritrix --bind=/ --admin=LOGIN:PASSWORD"
- Note that if you add a writer in the jar file, you can choose it from the web interface from:
"Profiles", edit your profile, "Modules", and writer
About Maven (need maven 1.0.2 for Heritrix):
- to display all the targets:
maven -g > toto.txt
- List of targets are in project.xml, see "build" target
- "dist:" postgoals are described in maven.xml
- To add a jar like mysql, you need to add it to project.properties and project.xml
No comments:
Post a Comment