Fork me on GitHub

用Objective-C写一个简单的爬虫(二)

上次写过爬虫的初篇,我们已经能爬到基本的数据了。但是一张网页的数据是有限的,从我们上次的程序来看,能抓到的图片才不过十来张。很显然不能满足我们的需求。

接下来,我们继续完善我们的项目,来满足这个需求吧。

必备工具

必备技能

建议看完 用Objective-C写一个简单的爬虫(一) ,明白一些简单的知识,虽然我写的很烂,但也不至于看这篇文章的时候一头雾水。

本篇目标

深度抓取更多的数据
如图:

分析

在上篇中,我们给定一个url抓取完图片之后直接就结束了,那么如何不让程序结束,而继续抓取别的ur呢?总不能给SpiderOption定义一个url数组,程序的输入是有限的,这种办法解决不了问题。聪明的你一定想到了,在爬虫爬取图片的时候,不也是可以爬到这个页面的跳转页面么,如果我们在把跳转页面的图片爬下来,甚至把跳转页面里面的跳转页面爬下来,这样一直往深层次的抓取数据,就不是源源不断的图片爬到手了么?😎下面我们就开始吧!!!

编码

准备

首先我们改造一下Spider类,给Spider添加几个变量:
1、_finashUrl 用来存放已经爬取过的url,防止重复抓取,导致重复循环。
2、_session 用来进行获取html的网络请求;在第一篇中我们直接在方法里面定义了相同的变量,这里我们把它声明成变量,防止重复创建对象.
3、_urlExpression 用来匹配url的正则表达式.
4、_imgExpression 用来匹配img的正则表达式.
5、fetchHtmlQueue 正则匹配的队列.
6、operationQueue 网络请求的队列.

定义上面变量之后并且在init方法中初始化:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
@interface Spider()<NSURLSessionDataDelegate>

@property (nonatomic, strong) NSOperationQueue* fetchHtmlQueue;
@property (nonatomic, strong) NSOperationQueue* operationQueue;

@end

@implementation Spider
{
//已经抓取的url
NSMutableSet<NSString*>* _finashUrl;

NSRegularExpression* _urlExpression;
NSRegularExpression* _imgExpression;

NSURLSession* _session;
}

- (instancetype)initWithOption:(SpiderOption *)option{
self = [super init];
if (self) {
_option = option;

_finashUrl = [NSMutableSet set];

_imgExpression = [NSRegularExpression regularExpressionWithPattern:option.pattern
options:NSRegularExpressionCaseInsensitive
error:nil];

_urlExpression = [NSRegularExpression regularExpressionWithPattern:@"https?:.+?.html"
options:NSRegularExpressionCaseInsensitive
error:nil];

_operationQueue = [[NSOperationQueue alloc] init];
_operationQueue.maxConcurrentOperationCount = 3;
_operationQueue.name = @"com.html.loadQueue";

_fetchHtmlQueue = [[NSOperationQueue alloc] init];
_fetchHtmlQueue.maxConcurrentOperationCount = 3;
_fetchHtmlQueue.name = @"com.html.fetchQueue";

NSURLSessionConfiguration* configuration = [NSURLSessionConfiguration defaultSessionConfiguration];
configuration.timeoutIntervalForRequest = 30;
_session = [NSURLSession sessionWithConfiguration:configuration delegate:self delegateQueue:_fetchHtmlQueue];

}
return self;
}

改写方法

由于我们用了多线程的技术,每次加载一个url都要放到队列去,所以我们需要修改一下loadUrl方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
- (void)loadUrl:(NSURL*)url{

WeakSelf;

@synchronized(self){
[self->_operationQueue addOperationWithBlock:^{

StrongSelf;

NSURLSessionDataTask* task = [sSelf->_session dataTaskWithURL:url];

[task resume];

}];
}
}

上一步请求成功之后,会调用session的代理方法,所有我们需要实现session的代理方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
///MARK:- NSURLSessionDataDelegate
- (void)URLSession:(NSURLSession *)session
dataTask:(NSURLSessionDataTask *)dataTask
didReceiveResponse:(NSURLResponse *)response
completionHandler:(void (^)(NSURLSessionResponseDisposition))completionHandler{

NSURLSessionResponseDisposition disposition = NSURLSessionResponseCancel;

NSInteger statusCode = 0;

if ([response respondsToSelector:@selector(statusCode)]){
statusCode = [((NSHTTPURLResponse*)response) statusCode];
}

if (statusCode == 200) {
disposition = NSURLSessionResponseAllow;
}

completionHandler(disposition);
}

- (void)URLSession:(NSURLSession *)session dataTask:(NSURLSessionDataTask *)dataTask didReceiveData:(NSData *)data{

NSString* html = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];

if (html && html.length > 0) {
[self fetchUrlWithHtml:html];

[self fetchImgWithHtml:html];

}
}

我们在修改一下fetchImgWithHtml方法,匹配img也要放到队列里,匹配成功后回调即可:

1
2
3
4
5
6
7
8
9
10
- (void)fetchImgWithHtml:(NSString*)html{

WeakSelf;
[self.fetchHtmlQueue addOperationWithBlock:^{
StrongSelf;
NSArray<NSString*>* strings = [sSelf fetchStringsWithHtml:html expression:sSelf->_imgExpression];

sSelf.option.complete ? sSelf.option.complete(strings) : nil;
}];
}

同理我们修改fetchUrlWithHtml方法,并且把需要请求的url放到_finashUrl里面,在请求之前判断一下是否请求过即可:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
- (void)fetchUrlWithHtml:(NSString*)html{

WeakSelf;
[self.fetchHtmlQueue addOperationWithBlock:^{
StrongSelf;
NSArray<NSString*>* strings = [sSelf fetchStringsWithHtml:html expression:sSelf->_urlExpression];

for (NSString* string in strings) {
NSURL* url = [NSURL URLWithString:string];
if ([sSelf->_finashUrl containsObject:string]) {
continue;
}

[sSelf->_finashUrl addObject:string];
[sSelf loadUrl:url];
}
}];
}

最后附上fetchStringsWithHtml方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
- (NSArray<NSString*>*)fetchStringsWithHtml:(NSString*)html
expression:(NSRegularExpression*)expression{

NSArray<NSTextCheckingResult*>* results = [expression matchesInString:html
options:NSMatchingReportCompletion
range:NSMakeRange(0, html.length)];

NSMutableArray<NSString*>* strings = [NSMutableArray arrayWithCapacity:results.count];

for (NSTextCheckingResult* result in results) {
NSString* sub = [html substringWithRange:result.range];
[strings addObject:sub];
}

return strings;
}

运行

接下来运行一下程序,我们能看到程序不停的抓取图片,越来越多。

可视化

为了更好的浏览我们获取的图片,我们需要在控制器里面用UICollectionView展现出来,直接上代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

- (void)appendUrls:(NSArray<NSString *>*)urls{

if (urls.count > 0) {
WeakSelf;

dispatch_async(dispatch_get_main_queue(), ^{
[self.collectionView performBatchUpdates:^{

NSMutableArray<NSIndexPath*>* idxs = [NSMutableArray array];

for (int i = 0; i < urls.count; i ++) {
NSIndexPath* idx = [NSIndexPath indexPathForItem:weakSelf.urls.count + i inSection:0];
[idxs addObject:idx];
}

[weakSelf.urls addObjectsFromArray:urls];

[weakSelf.collectionView insertItemsAtIndexPaths:idxs];

} completion: NULL];
});
}
}

- (void)viewDidLayoutSubviews{
[super viewDidLayoutSubviews];
self.collectionView.frame = self.view.bounds;
}

- (NSInteger)collectionView:(UICollectionView *)collectionView numberOfItemsInSection:(NSInteger)section{
return self.urls.count;
}

- (__kindof UICollectionViewCell*)collectionView:(UICollectionView *)collectionView cellForItemAtIndexPath:(NSIndexPath *)indexPath{
CollectionViewCell* cell = [collectionView dequeueReusableCellWithReuseIdentifier:CollectionViewCellID forIndexPath:indexPath];
cell.url = self.urls[indexPath.row];
return cell;
}

- (CGFloat)collectionView:(UICollectionView *)collectionView layout:(UICollectionViewLayout *)collectionViewLayout minimumLineSpacingForSectionAtIndex:(NSInteger)section{
return 0.5;
}

- (CGFloat)collectionView:(UICollectionView *)collectionView layout:(UICollectionViewLayout *)collectionViewLayout minimumInteritemSpacingForSectionAtIndex:(NSInteger)section{
return 0.01;
}

- (UIEdgeInsets)collectionView:(UICollectionView *)collectionView layout:(UICollectionViewLayout *)collectionViewLayout insetForSectionAtIndex:(NSInteger)section{
return UIEdgeInsetsZero;
}

- (CGSize)collectionView:(UICollectionView *)collectionView layout:(UICollectionViewLayout *)collectionViewLayout sizeForItemAtIndexPath:(NSIndexPath *)indexPath{
UICollectionViewFlowLayout* layout = (UICollectionViewFlowLayout*)collectionViewLayout;

CGFloat w = (CGRectGetWidth(collectionView.frame) -
layout.minimumInteritemSpacing -
layout.sectionInset.left -
layout.sectionInset.right) / 8 - 0.5;

return CGSizeMake(w, w);
}

- (void)collectionView:(UICollectionView *)collectionView didSelectItemAtIndexPath:(NSIndexPath *)indexPath{
DetailController* detail = [[DetailController alloc] init];
detail.url = self.urls[indexPath.row];
[self.navigationController pushViewController:detail animated:YES];
}

- (UICollectionView *)collectionView{
if (!_collectionView) {

UICollectionViewFlowLayout* layout = [[UICollectionViewFlowLayout alloc] init];
layout.minimumLineSpacing = 0.5;
layout.minimumInteritemSpacing = 0;
layout.sectionInset = UIEdgeInsetsZero;

_collectionView = [[UICollectionView alloc] initWithFrame:CGRectZero collectionViewLayout:layout];
_collectionView.delegate = self;
_collectionView.dataSource = self;

[_collectionView registerClass:[CollectionViewCell class]
forCellWithReuseIdentifier:CollectionViewCellID];

_collectionView.backgroundColor = [UIColor whiteColor];
}
return _collectionView;
}

至此,我们就算完成了我们的目标;看到程序不停的有图片加载出来,是不是有点小兴奋呢。😄
现在我们写的程序会无限循环的爬取,但是怎么能控制爬取深度呢?

控制深度

我的做法是给url创建一个分类,添加一个属性depth表示此url的深度值,初始的url深度为0,在此页面上爬取的url深度为1,这样以此类推,直到我们指定深度停止为止。

NSURL+depth.h

1
2
3
4
5
@interface NSURL (depth)

@property (nonatomic, assign) NSUInteger depth;

@end

NSURL+depth.m

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#import "NSURL+depth.h"
#import <objc/runtime.h>

static const void * depth_key = "depth_key";

@implementation NSURL (depth)
@dynamic depth;

- (NSUInteger)depth{
return [objc_getAssociatedObject(self, depth_key) unsignedIntegerValue];
}

- (void)setDepth:(NSUInteger)depth{
objc_setAssociatedObject(self, depth_key, @(depth), OBJC_ASSOCIATION_ASSIGN);
}

@end

然后在我们的 start方法里面给第一个url的深度赋值:

1
2
3
4
5
6
- (void)start{
NSURL* url = [NSURL URLWithString:self.option.website];
url.depth = 0;

[self loadUrl:url];
}

session回调的方法里面我们判断一下此url的深度是否需要继续爬取:

1
2
3
4
5
6
7
8
9
10
11
12
13
- (void)URLSession:(NSURLSession *)session dataTask:(NSURLSessionDataTask *)dataTask didReceiveData:(NSData *)data{

NSString* html = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
NSUInteger depth = dataTask.originalRequest.URL.depth;

if (html && html.length > 0) {
if (depth < self.option.maxDepth) {
[self fetchUrlWithHtml:html depth:depth + 1];
}
[self fetchImgWithHtml:html];

}
}

然后改下fetchUrlWithHtml方法,给下个url赋值深度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
- (void)fetchUrlWithHtml:(NSString*)html depth:(NSUInteger)depth{

WeakSelf;
[self.fetchHtmlQueue addOperationWithBlock:^{
StrongSelf;
NSArray<NSString*>* strings = [sSelf fetchStringsWithHtml:html expression:sSelf->_urlExpression];

for (NSString* string in strings) {
NSURL* url = [NSURL URLWithString:string];
if ([sSelf->_finashUrl containsObject:string]) {
continue;
}

url.depth = depth;
[sSelf->_finashUrl addObject:string];
[sSelf loadUrl:url];
}
}];
}
```

最后忘说了一个地方,那就是`SpiderOption`,给这个类添加一个最大深度的属性,让外界来控制:

```objc
/**
最大抓取深度 默认 NSUIntegerMax
*/
@property (nonatomic, assign) NSUInteger maxDepth;

ok,在控制器里面给定最大的深度,最好不要太大,运行一下吧!!!

相关链接:

项目源码

用Objective-C写一个简单的爬虫(一)

------ 本文结束------
0%