MAC Nutch+MySQL集成笔记
- 发表于
 - 周边
 
目的:Nutch爬虫引擎抓取的数据自动存入MySQL
隶属:Nutch+Hadoop+HBase(MySQL)+Elasticsearch+PHP 系列实践
MAC MySQL安装
不需要什么配置,就是next最后记住弹出的窗口里的密码就行,如:

安装的时候忘记截图了,网上找了个配图。
下载地址:http://dev.mysql.com/downloads/mysql/
Nutch的安装与配置以及使用
1、Nutch-2.3.1下载:http://nutch.apache.org/downloads.html下载,然后解压至本地安装目录,如本地根目录为${NUTCH_HOME};
2、配置nutch对mysql的支持,修改${APACHE_NUTCH_HOME}/ivy/ivy.xml文件,分别:
1)找到以下行取消注释
|   1  |  <dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>  |  
2)修改以下行
默认为
|   1  |  <dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>  |  
修改后为
|   1  |  <dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>  |  
3)取消注释以下行
|   1  |  <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />  |  
注释:上2)、3)如果不修改会有异常异常信息为
Exception in thread “main” Java.lang.ClassNotFoundException:org.apache.gora.sql.store.SqlStore
3、数据库连接配置
编辑${NUTCH_HOME}/conf/gora.properties文件,注释掉默认的数据库连接配置,同时添加以下配置内容:
|   1 2 3 4 5 6 7  |  ############################### # MySQL properties            ############################### gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://192.168.58.1:3306/nutch?createDatabaseIfNotExist=true gora.sqlstore.jdbc.user=root gora.sqlstore.jdbc.password=  |  
写上你需要连接的数据库地址以及用户名密码
4、修改nutch-site配置文件
将以下内容添加至${NUTCH_HOME}/conf/nutch-site.xml中的configuration节点中
|   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34  |  <property> <name>http.agent.name</name> <value>LiuXun Nutch Spider</value> </property> <property> <name>http.accept.language</name> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> <description>Value of the “Accept-Language” request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group. </description> </property> <property> <name>parser.character.encoding.default</name> <value>utf-8</value> <description>The character encoding to fall back to when no other information is available</description> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.sql.store.SqlStore</value> <description>The Gora DataStore class for storing and retrieving data. Currently the following stores are available: …. </description> </property> //特别添加 <property>     <name>generate.batch.id</name>     <value>*</value> </property>  |  
5、编译Nutch-2.3.1
- 进入${NUTCH_HOME}目录下执行ant命令:ant runtime
 - 编译成功后${NUTCH_HOME}目录下会有runtime这个目录
 
编译Nutch
|   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29  |  ➜  apache-nutch-2.3.1 (master) ✗ ant Buildfile: /Users/hackgyj/apache-nutch-2.3.1/build.xml Trying to override old definition of task javac   [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-probe-antlib: ivy-download:   [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-download-unchecked: ivy-init-antlib: ivy-init: init:     [mkdir] Created dir: /Users/hackgyj/apache-nutch-2.3.1/build     [mkdir] Created dir: /Users/hackgyj/apache-nutch-2.3.1/build/classes     [mkdir] Created dir: /Users/hackgyj/apache-nutch-2.3.1/build/release     [mkdir] Created dir: /Users/hackgyj/apache-nutch-2.3.1/build/test     [mkdir] Created dir: /Users/hackgyj/apache-nutch-2.3.1/build/test/classes clean-lib: resolve-default: [ivy:resolve] :: Apache Ivy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ :: [ivy:resolve] :: loading settings :: file = /Users/hackgyj/apache-nutch-2.3.1/ivy/ivysettings.xml  |  
上面报错了,需要下载sonar的jar包(sonar-ant-task-2.2.jar),并将jar包放到解压好的apache-nutch-2.3.1文件夹内的lib文件内内。由于需要连接网络下载资源,需要一些时间,根据网络情况时间不等,我自己用了大概一小时!
然后命令行执行:
|   1  |  ant clear  |  
再执行
|   1  |  ant runtime  |  
OK,没再出错,编译成功,目录下多出:build、runtime两个文件夹,其中runtime就是编译好的目录。
6、网页抓取以及配置
- 进入${NUTCH_HOME}/runtime/local目录下
 - 设置抓取的网站
 
执行命令
|   1 2 3  |  mkdir -p urls //建议爬虫连接文件夹 echo 'http://www.oschina.net/' > urls/seed.txt //写入爬取的连接 bin/nutch crawl urls -depth 3 -topN 5 //开始爬虫工作  |  
Error: JAVA_HOME is not set.
提示JAVA_HOME未设置
MAC OS X El Capitan 10.11.6 查找和设置$JAVA_HOME,命令如下
|   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33  |  ➜  ~ (master) ✗ which java /usr/bin/java ➜  ~ (master) ✗ ls -l /usr/bin/java lrwxr-xr-x  1 root  wheel  74 Oct 20  2015 /usr/bin/java -> /System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/java ➜  ~ (master) ✗ ls -l /System/Library/Frameworks/JavaVM.framework/Versions total 64 lrwxr-xr-x   1 root  wheel   10 Oct 20  2015 1.4 -> CurrentJDK lrwxr-xr-x   1 root  wheel   10 Oct 20  2015 1.4.2 -> CurrentJDK lrwxr-xr-x   1 root  wheel   10 Oct 20  2015 1.5 -> CurrentJDK lrwxr-xr-x   1 root  wheel   10 Oct 20  2015 1.5.0 -> CurrentJDK lrwxr-xr-x   1 root  wheel   10 Oct 20  2015 1.6 -> CurrentJDK lrwxr-xr-x   1 root  wheel   10 Oct 20  2015 1.6.0 -> CurrentJDK drwxr-xr-x  10 root  wheel  340 Oct  7 13:55 A lrwxr-xr-x   1 root  wheel    1 Oct 20  2015 Current -> A lrwxr-xr-x   1 root  wheel   52 Oct 20  2015 CurrentJDK -> /Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents ➜  ~ (master) ✗ java -version java version "1.8.0_91" Java(TM) SE Runtime Environment (build 1.8.0_91-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode) ➜  ~ (master) ✗ /usr/libexec/java_home -V Matching Java Virtual Machines (3):     1.8.0_91, x86_64: "Java SE 8" /Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home     1.6.0_65-b14-468, x86_64: "Java SE 6" /Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home     1.6.0_65-b14-468, i386: "Java SE 6" /Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home /Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home //打开用户配置文件 //添加路径:export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home ➜  ~ (master) ✗ open ~/.profile //保存后刷新用户配置 ➜  ~ (master) ✗ source ~/.profile ➜  ~ (master) ✗ echo $JAVA_HOME /Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home  |  
Command crawl is deprecated, please use bin/crawl instead
当执行bin/nutch crawl urls -depth 3 -topN 5时显示这个错误,经查资料发现是因为Nutch2.3.1不支持这么写了。
1.7和2.2.1及以上版本用bin/crawl取代bin/nutch crawl.正确的写法:
|   1  |  bin/crawl url/ test 5  |  
好了,能执行了,但问题又出现:
|   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  |  Exception in thread "main" Java.lang.NoClassDefFoundError: org/apache/avro/ipc/ByteBufferOutputStream     at java.lang.Class.forName0(Native Method)     at java.lang.Class.forName(Class.java:191)     at org.apache.nutch.storage.StorageUtils.getDataStoreClass(StorageUtils.java:93)     at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:77)     at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)     at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)     at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)     at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284) Caused by: java.lang.ClassNotFoundException: org.apache.avro.ipc.ByteBufferOutputStream     at java.NET.URLClassLoader$1.run(URLClassLoader.java:366)     at java.Net.URLClassLoader$1.run(URLClassLoader.java:355)     at java.security.AccessController.doPrivileged(Native Method)     at java.net.URLClassLoader.findClass(URLClassLoader.java:354)     at java.lang.ClassLoader.loadClass(ClassLoader.java:425)     at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)     at java.lang.ClassLoader.loadClass(ClassLoader.java:358)     ... 9 more  |  
崩溃的感觉,再查发现答案是,Nutch2.3.1不支持MySQL,What………………
解决方法是:
- 要么使用2.2x版本,要么退回使用nutch1.x版本
 - 或者更换MySQL为hbase存储
 
显示我的选择是,放弃nutch2.3.1使用nutch2.2.1。浪费我大量时间!
如出现下面的错误,请搜索本文“特别添加”来解决。
|   1 2 3 4 5 6 7 8  |  Exception in thread "main" java.lang.RuntimeException: job failed: name=generate: null, jobid=job_local200289520_0002  at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:55)  at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:199)  at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)  at org.apache.nutch.crawl.Crawler.run(Crawler.java:152)  at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)  at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)  |  
nutch2.2.1成功
|   1 2 3 4 5 6 7 8 9 10 11 12  |  ➜  local (master) ✗ bin/nutch crawl urls -depth 3 -topN 5 InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 1 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost Fetcher: threads: 10 QueueFeeder finished: total 1 records. Hit by time limit :0  |  
nutch2.2.1的安装及配置和上面一样,其中细节版本号错误等看错误信息修正就行,最后成功。
nutch命令前面章节介绍到了执行完在mysql中即查看到爬虫抓取的内容,如下图:

原文连接
的情况下转载,若非则不得使用我方内容。