用Objective-C写一个简单的爬虫（二）

上次写过爬虫的初篇，我们已经能爬到基本的数据了。但是一张网页的数据是有限的，从我们上次的程序来看，能抓到的图片才不过十来张。很显然不能满足我们的需求。

接下来，我们继续完善我们的项目，来满足这个需求吧。

必备工具

联接网络的Mac,
浏览器，推荐用Google Chrome,
Xcode开发工具,
用Objective-C写一个简单的爬虫(一) 的源码

必备技能

建议看完用Objective-C写一个简单的爬虫(一) ,明白一些简单的知识,虽然我写的很烂,但也不至于看这篇文章的时候一头雾水。

本篇目标

深度抓取更多的数据
如图:

分析

在上篇中，我们给定一个url抓取完图片之后直接就结束了，那么如何不让程序结束，而继续抓取别的ur呢？总不能给SpiderOption定义一个url数组,程序的输入是有限的,这种办法解决不了问题。聪明的你一定想到了，在爬虫爬取图片的时候，不也是可以爬到这个页面的跳转页面么，如果我们在把跳转页面的图片爬下来，甚至把跳转页面里面的跳转页面爬下来，这样一直往深层次的抓取数据，就不是源源不断的图片爬到手了么？😎下面我们就开始吧！！！

编码

准备

首先我们改造一下Spider类，给Spider添加几个变量:
1、_finashUrl 用来存放已经爬取过的url,防止重复抓取,导致重复循环。
2、_session 用来进行获取html的网络请求;在第一篇中我们直接在方法里面定义了相同的变量,这里我们把它声明成变量,防止重复创建对象.
3、_urlExpression 用来匹配url的正则表达式.
4、_imgExpression 用来匹配img的正则表达式.
5、fetchHtmlQueue 正则匹配的队列.
6、operationQueue 网络请求的队列.

定义上面变量之后并且在init方法中初始化：

@interface Spider()<NSURLSessionDataDelegate>

@property (nonatomic, strong) NSOperationQueue* fetchHtmlQueue;
@property (nonatomic, strong) NSOperationQueue* operationQueue;

@end

@implementation Spider
{
    //已经抓取的url
    NSMutableSet<NSString*>* _finashUrl;
    
    NSRegularExpression* _urlExpression;
    NSRegularExpression* _imgExpression;
    
    NSURLSession* _session;
}

- (instancetype)initWithOption:(SpiderOption *)option{
    self = [super init];
    if (self) {
        _option = option;
        
        _finashUrl = [NSMutableSet set];
        
        _imgExpression = [NSRegularExpression regularExpressionWithPattern:option.pattern
                                                                   options:NSRegularExpressionCaseInsensitive
                                                                     error:nil];
        
        _urlExpression = [NSRegularExpression regularExpressionWithPattern:@"https?:.+?.html"
                                                                   options:NSRegularExpressionCaseInsensitive
                                                                     error:nil];
        
        _operationQueue = [[NSOperationQueue alloc] init];
        _operationQueue.maxConcurrentOperationCount = 3;
        _operationQueue.name = @"com.html.loadQueue";

        _fetchHtmlQueue = [[NSOperationQueue alloc] init];
        _fetchHtmlQueue.maxConcurrentOperationCount = 3;
        _fetchHtmlQueue.name = @"com.html.fetchQueue";
        
        NSURLSessionConfiguration* configuration = [NSURLSessionConfiguration defaultSessionConfiguration];
        configuration.timeoutIntervalForRequest = 30;
        _session = [NSURLSession sessionWithConfiguration:configuration delegate:self delegateQueue:_fetchHtmlQueue];
        
    }
    return self;
}

改写方法

由于我们用了多线程的技术，每次加载一个url都要放到队列去，所以我们需要修改一下loadUrl方法:

- (void)loadUrl:(NSURL*)url{
    
    WeakSelf;
    
    @synchronized(self){
        [self->_operationQueue addOperationWithBlock:^{
            
            StrongSelf;
            
            NSURLSessionDataTask* task = [sSelf->_session dataTaskWithURL:url];
            
            [task resume];
            
        }];
    }
}

上一步请求成功之后，会调用session的代理方法，所有我们需要实现session的代理方法:

///MARK:- NSURLSessionDataDelegate
- (void)URLSession:(NSURLSession *)session
          dataTask:(NSURLSessionDataTask *)dataTask
didReceiveResponse:(NSURLResponse *)response
 completionHandler:(void (^)(NSURLSessionResponseDisposition))completionHandler{
    
    NSURLSessionResponseDisposition disposition = NSURLSessionResponseCancel;
    
    NSInteger statusCode = 0;
    
    if ([response respondsToSelector:@selector(statusCode)]){
        statusCode = [((NSHTTPURLResponse*)response) statusCode];
    }
    
    if (statusCode == 200) {
        disposition = NSURLSessionResponseAllow;
    }
    
    completionHandler(disposition);
}

- (void)URLSession:(NSURLSession *)session dataTask:(NSURLSessionDataTask *)dataTask didReceiveData:(NSData *)data{
    
    NSString* html = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
    
    if (html && html.length > 0) {
        [self fetchUrlWithHtml:html];

        [self fetchImgWithHtml:html];
        
    }
}

我们在修改一下fetchImgWithHtml方法,匹配img也要放到队列里,匹配成功后回调即可:

- (void)fetchImgWithHtml:(NSString*)html{
    
    WeakSelf;
    [self.fetchHtmlQueue addOperationWithBlock:^{
        StrongSelf;
        NSArray<NSString*>* strings = [sSelf fetchStringsWithHtml:html expression:sSelf->_imgExpression];
        
        sSelf.option.complete ? sSelf.option.complete(strings) : nil;
    }];
}

同理我们修改fetchUrlWithHtml方法,并且把需要请求的url放到_finashUrl里面,在请求之前判断一下是否请求过即可:

- (void)fetchUrlWithHtml:(NSString*)html{
    
    WeakSelf;
    [self.fetchHtmlQueue addOperationWithBlock:^{
        StrongSelf;
        NSArray<NSString*>* strings = [sSelf fetchStringsWithHtml:html expression:sSelf->_urlExpression];
        
        for (NSString* string in strings) {
            NSURL* url = [NSURL URLWithString:string];
            if ([sSelf->_finashUrl containsObject:string]) {
                continue;
            }
            
            [sSelf->_finashUrl addObject:string];
            [sSelf loadUrl:url];
        }
    }];
}

最后附上fetchStringsWithHtml方法:

- (NSArray<NSString*>*)fetchStringsWithHtml:(NSString*)html
                                 expression:(NSRegularExpression*)expression{
    
    NSArray<NSTextCheckingResult*>* results = [expression matchesInString:html
                                                                  options:NSMatchingReportCompletion
                                                                    range:NSMakeRange(0, html.length)];
    
    NSMutableArray<NSString*>* strings = [NSMutableArray arrayWithCapacity:results.count];
    
    for (NSTextCheckingResult* result in results) {
        NSString* sub = [html substringWithRange:result.range];
        [strings addObject:sub];
    }
    
    return strings;
}

运行

接下来运行一下程序，我们能看到程序不停的抓取图片，越来越多。

可视化

为了更好的浏览我们获取的图片，我们需要在控制器里面用UICollectionView展现出来，直接上代码:


- (void)appendUrls:(NSArray<NSString *>*)urls{
    
    if (urls.count > 0) {
        WeakSelf;
        
        dispatch_async(dispatch_get_main_queue(), ^{
            [self.collectionView performBatchUpdates:^{
                
                NSMutableArray<NSIndexPath*>* idxs = [NSMutableArray array];
                
                for (int i = 0; i < urls.count; i ++) {
                    NSIndexPath* idx = [NSIndexPath indexPathForItem:weakSelf.urls.count + i inSection:0];
                    [idxs addObject:idx];
                }
                
                [weakSelf.urls addObjectsFromArray:urls];

                [weakSelf.collectionView insertItemsAtIndexPaths:idxs];
                
            } completion: NULL];
        });
    }
}

- (void)viewDidLayoutSubviews{
    [super viewDidLayoutSubviews];
    self.collectionView.frame = self.view.bounds;
}

- (NSInteger)collectionView:(UICollectionView *)collectionView numberOfItemsInSection:(NSInteger)section{
    return self.urls.count;
}

- (__kindof UICollectionViewCell*)collectionView:(UICollectionView *)collectionView cellForItemAtIndexPath:(NSIndexPath *)indexPath{
    CollectionViewCell* cell = [collectionView dequeueReusableCellWithReuseIdentifier:CollectionViewCellID forIndexPath:indexPath];
    cell.url = self.urls[indexPath.row];
    return cell;
}

- (CGFloat)collectionView:(UICollectionView *)collectionView layout:(UICollectionViewLayout *)collectionViewLayout minimumLineSpacingForSectionAtIndex:(NSInteger)section{
    return 0.5;
}

- (CGFloat)collectionView:(UICollectionView *)collectionView layout:(UICollectionViewLayout *)collectionViewLayout minimumInteritemSpacingForSectionAtIndex:(NSInteger)section{
    return 0.01;
}

- (UIEdgeInsets)collectionView:(UICollectionView *)collectionView layout:(UICollectionViewLayout *)collectionViewLayout insetForSectionAtIndex:(NSInteger)section{
    return UIEdgeInsetsZero;
}

- (CGSize)collectionView:(UICollectionView *)collectionView layout:(UICollectionViewLayout *)collectionViewLayout sizeForItemAtIndexPath:(NSIndexPath *)indexPath{
    UICollectionViewFlowLayout* layout = (UICollectionViewFlowLayout*)collectionViewLayout;
    
    CGFloat w = (CGRectGetWidth(collectionView.frame) -
                 layout.minimumInteritemSpacing -
                 layout.sectionInset.left -
                 layout.sectionInset.right) / 8 - 0.5;
    
    return CGSizeMake(w, w);
}

- (void)collectionView:(UICollectionView *)collectionView didSelectItemAtIndexPath:(NSIndexPath *)indexPath{
    DetailController* detail = [[DetailController alloc] init];
    detail.url = self.urls[indexPath.row];
    [self.navigationController pushViewController:detail animated:YES];
}

- (UICollectionView *)collectionView{
    if (!_collectionView) {
        
        UICollectionViewFlowLayout* layout = [[UICollectionViewFlowLayout alloc] init];
        layout.minimumLineSpacing = 0.5;
        layout.minimumInteritemSpacing = 0;
        layout.sectionInset = UIEdgeInsetsZero;
        
        _collectionView = [[UICollectionView alloc] initWithFrame:CGRectZero collectionViewLayout:layout];
        _collectionView.delegate = self;
        _collectionView.dataSource = self;
        
        [_collectionView registerClass:[CollectionViewCell class]
            forCellWithReuseIdentifier:CollectionViewCellID];
        
        _collectionView.backgroundColor = [UIColor whiteColor];
    }
    return _collectionView;
}

至此，我们就算完成了我们的目标;看到程序不停的有图片加载出来，是不是有点小兴奋呢。😄
现在我们写的程序会无限循环的爬取，但是怎么能控制爬取深度呢？

控制深度

我的做法是给url创建一个分类，添加一个属性depth表示此url的深度值,初始的url深度为0,在此页面上爬取的url深度为1，这样以此类推，直到我们指定深度停止为止。

NSURL+depth.h

@interface NSURL (depth)

@property (nonatomic, assign) NSUInteger depth;

@end

NSURL+depth.m

#import "NSURL+depth.h"
#import <objc/runtime.h>

static const void * depth_key = "depth_key";

@implementation NSURL (depth)
@dynamic depth;

- (NSUInteger)depth{
    return [objc_getAssociatedObject(self, depth_key) unsignedIntegerValue];
}

- (void)setDepth:(NSUInteger)depth{
    objc_setAssociatedObject(self, depth_key, @(depth), OBJC_ASSOCIATION_ASSIGN);
}

@end

然后在我们的 start方法里面给第一个url的深度赋值:

- (void)start{
    NSURL* url = [NSURL URLWithString:self.option.website];
    url.depth = 0;
    
    [self loadUrl:url];
}

在session回调的方法里面我们判断一下此url的深度是否需要继续爬取:

- (void)URLSession:(NSURLSession *)session dataTask:(NSURLSessionDataTask *)dataTask didReceiveData:(NSData *)data{
    
    NSString* html = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
    NSUInteger depth = dataTask.originalRequest.URL.depth;
    
    if (html && html.length > 0) {
        if (depth < self.option.maxDepth) {
            [self fetchUrlWithHtml:html depth:depth + 1];
        }
        [self fetchImgWithHtml:html];

    }
}

然后改下fetchUrlWithHtml方法，给下个url赋值深度:

- (void)fetchUrlWithHtml:(NSString*)html depth:(NSUInteger)depth{
    
    WeakSelf;
    [self.fetchHtmlQueue addOperationWithBlock:^{
        StrongSelf;
        NSArray<NSString*>* strings = [sSelf fetchStringsWithHtml:html expression:sSelf->_urlExpression];
        
        for (NSString* string in strings) {
            NSURL* url = [NSURL URLWithString:string];
            if ([sSelf->_finashUrl containsObject:string]) {
                continue;
            }
            
            url.depth = depth;
            [sSelf->_finashUrl addObject:string];
            [sSelf loadUrl:url];
        }
    }];
}
``` 

最后忘说了一个地方,那就是`SpiderOption`,给这个类添加一个最大深度的属性,让外界来控制:  

```objc  
/**
 最大抓取深度 默认 NSUIntegerMax
 */
@property (nonatomic, assign) NSUInteger maxDepth;

ok,在控制器里面给定最大的深度,最好不要太大,运行一下吧！！！