无聊的书

04.09.2019
Garfield
0

最近在看一本无聊的书，里面就是主角是一个统治地下的人，然后和一个美女总裁在一起的故事。然后就无敌的保护着女主，然后被无数的美女喜欢。嗯，屌丝都喜欢这样的幻想，我也是。

不过，我现在都是用 Kindle 看书，结果搜了很多地方，都没找到下载TXT文件的地方，索性吧，就自己写了一个爬虫，从网站来爬取全文。

首先，我们要先获取书的所有目录，以及所有章节的连接，我爬的是www.ldks.cc这个网站，分析了一下目录的HTML，发现挺好辨识的，因为这个网站的模板，有一个唯一标识class=’listmain’，所以，我只需要得到这个div标签下所有的连接，就可以得到目录了。

本来想用python写，不过因为vs没有关闭，所以，就顺手新建了一个c#的工程，用C#也一样。

我去下了一个html的解析库，html-agility-pack ( https://html-agility-pack.net ) ，先做这个事情：

        void GetList(string _html)
        {
            PageLinkList.Clear();
            listBox1.Items.Clear();

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(_html);

            HtmlAgilityPack.HtmlNode _n = doc.DocumentNode.SelectSingleNode("//div[@class='listmain']");
            //textBox1.Text = _n.InnerHtml;

            HtmlAgilityPack.HtmlDocument doc2 = new HtmlAgilityPack.HtmlDocument();
            doc2.LoadHtml(_n.InnerHtml);

            HtmlAgilityPack.HtmlNodeCollection _n2 = doc.DocumentNode.SelectNodes("//a");

            foreach (HtmlAgilityPack.HtmlNode _node in _n2)
            {
                string _ht = _node.OuterHtml;

                if (_ht.IndexOf("www.ldks.cc") > 0)
                {
                    continue;
                }

                if (_ht.IndexOf("第") > 0 && _ht.IndexOf("章") > 0)
                {
                    int _s1 = _ht.IndexOf("href=");
                    int _e1 = _ht.IndexOf(">第");

                    char[] _link = new char[_e1 - _s1 - 7];
                    _ht.CopyTo(_s1 + 6, _link, 0, _e1 - _s1 - 7);

                    string _link_str = new string(_link);

                    PageLinkList.Add(_link_str);
                    listBox1.Items.Add(_link_str);
                    //textBox1.AppendText(_link_str + "\r\n");
                }
            }
        }

以上就得到了所有文章的link，剩下的就去分析单章文章的页面，看了看HTML，发现也挺简单的，我的这个书，每一个章节被强制拆分成2个章节，分为1,2，所以，我写了PAGE1 PAGE2。每个章节中，也有很强的标识符，class=’showtxt’，所以，我也只需要获取这个div标签内的内容就好了。具体的，我还是给出2个PAGE的爬取完整代码：

        void Page1(int _index)
        {
            string _link = string.Format("https://www.ldks.cc{0}", PageLinkList[_index]);
            WebClient MyWebClient = new WebClient();
            MyWebClient.Credentials = CredentialCache.DefaultCredentials;//获取或设置用于向Internet资源的请求进行身份验证的网络凭据
            Byte[] pageData = MyWebClient.DownloadData(_link); //从指定网站下载数据
            string pageHtml = Encoding.UTF8.GetString(pageData);

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(pageHtml);

            ChapterNode _node = new ChapterNode();

            HtmlAgilityPack.HtmlNode _n1 = doc.DocumentNode.SelectSingleNode("//div[@class='content']//h1");

            _node._name1 = _n1.InnerText;

            HtmlAgilityPack.HtmlNode _n2 = doc.DocumentNode.SelectSingleNode("//div[@class='content']//div[@class='showtxt']");

            _node._page1 = _n2.InnerText;

            Page2(ref _node,_index);
        }

        void Page2(ref ChapterNode _node,int _index)
        {
            string _link = string.Format("https://www.ldks.cc{0}", PageLinkList[_index]);
            _link = _link.Replace(".html", "_2.html");
            WebClient MyWebClient = new WebClient();
            MyWebClient.Credentials = CredentialCache.DefaultCredentials;//获取或设置用于向Internet资源的请求进行身份验证的网络凭据
            Byte[] pageData = MyWebClient.DownloadData(_link); //从指定网站下载数据
            string pageHtml = Encoding.UTF8.GetString(pageData);

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(pageHtml);
            HtmlAgilityPack.HtmlNode _n1 = doc.DocumentNode.SelectSingleNode("//div[@class='content']//h1");

            _node._name2 = _n1.InnerText;

            HtmlAgilityPack.HtmlNode _n2 = doc.DocumentNode.SelectSingleNode("//div[@class='content']//div[@class='showtxt']");

            _node._page2 = _n2.InnerText;

            _node.Repair();
            ChapterList.Add(_node);
            ShowChapter(ChapterList.Count - 1);
        }

剩下来的就是保存，我循环的去不断不断的保存每一章的内容，接得到了我想要的TXT文件，保存TXT代码如下：

        void Save()
        {
            string filePath = Directory.GetCurrentDirectory() + "\\" + Process.GetCurrentProcess().ProcessName + ".txt";
            if (File.Exists(filePath))
                File.Delete(filePath);

            FileStream fs = new FileStream(filePath, FileMode.Create);
            StreamWriter _writer = new StreamWriter(fs);

            for (int i = 0; i < PageLinkList.Count; i++)
            {
                Page1(i);

                if ( i < ChapterList.Count)
                {
                    _writer.WriteLine(ChapterList[i]._name1);
                    _writer.WriteLine(ChapterList[i]._page1);
                    _writer.WriteLine(ChapterList[i]._name2);
                    _writer.WriteLine(ChapterList[i]._page2);
                    _writer.Flush();
                }
            }

            _writer.Close();
        }

Over，经过十多分钟，我就将这本书900章左右的内容全部写入了TXT，最后，只需要用TXTTOPDF就可以转成PDF文件放到KINDLE里了。这个网站比较简单，也没有多余的反爬，所以，还是挺容易的。不过要想快速的生成，可以分类然后用多线程去做，只是我觉得我就要这一本，就懒得动弹了。

日	一	二	三	四	五	六
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

无聊的书

Comments

点击这里取消回复。