找回密码
 立即注册

QQ登录

只需一步,快速开始

扫一扫,访问微社区

查看: 342|回复: 9

[求助] 【新人】利用pandas库中的read_html方法快速抓取网页中出现.....

1

主题

5

帖子

5

积分

贫民

积分
5
yanqinyanqin 发表于 2019-2-14 00:40:56 | 显示全部楼层 |阅读模式
求助各位爬虫大神,若能解答不甚感激,请问运行了代码之后出现这样的报错:
1、不知道这些问题出现是什么意思?
2、不清楚这些问题该如何解决?

/usr/local/bin/python3.7 /Users/yq37222/PycharmProjects/untitled8/.idea/wangyetiqu.py
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pymysql/connecti**.py", line 583, in connect
    **kwargs)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/socket.py", line 727, in create_connection
    raise err
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/yq37222/PycharmProjects/untitled8/.idea/wangyetiqu.py", line 87, in <module>
    main(178)
  File "/Users/yq37222/PycharmProjects/untitled8/.idea/wangyetiqu.py", line 78, in main
    generate_mysql()
  File "/Users/yq37222/PycharmProjects/untitled8/.idea/wangyetiqu.py", line 57, in generate_mysql
    db='wade')
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pymysql/__init__.py", line 94, in Connect
    return Connection(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pymysql/connecti**.py", line 325, in __init__
    self.connect()
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pymysql/connecti**.py", line 630, in connect
    raise exc
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'localhost' ([Errno 61] Connection refused)")

Process finished with exit code 1



回复

使用道具 举报

0

主题

400

帖子

400

积分

骑士

积分
400
sheeboard 发表于 2019-2-14 17:16:18 | 显示全部楼层
连mysql数据库要用sqlalchemy的,先建关系映射,再用引擎连数据库,书里是一笔带过的,具体代码pandas的官方文档在stackoverflow网站里写的很清楚的,去找一下吧。
回复 支持 反对

使用道具 举报

1

主题

5

帖子

5

积分

贫民

积分
5
yanqinyanqin  楼主| 发表于 2019-2-14 23:31:28 | 显示全部楼层
sheeboard 发表于 2019-2-14 17:16
连mysql数据库要用sqlalchemy的,先建关系映射,再用引擎连数据库,书里是一笔带过的,具体代码pandas的官 ...

很感谢您的解答,您是说写的代码出现问题了吗?我不太明白报错时出现connection refuse error和你说的mysql数据库要用sqlalchemy有什么关系?
回复 支持 反对

使用道具 举报

0

主题

400

帖子

400

积分

骑士

积分
400
sheeboard 发表于 2019-2-15 10:22:04 | 显示全部楼层
本帖最后由 sheeboard 于 2019-2-15 11:49 编辑

书上的原话
This is quite a bit of munging that you’d rather not repeat each time you query the
database. The SQLAlchemy project is a popular Python SQL toolkit that abstracts
away many of the common differences between SQL databases. pandas has a
read_sql function that enables you to read data easily from a general SQLAlchemy
connection. Here, we’ll connect to the same SQLite database with SQLAlchemy and
read data from the table created before:
上面写的是读操作,写操作也一样。你要先保证用mysqlclient能连接你的数据库,如果是远程连接的还需要grant权限,orm关系映射是面向对象的编程方法也可以不用,直接用sql语句。
我只是根据报的错写的,也可能是由其他问题引起数据库无法连接,当然这个和read_html没什么关系,要解决问题最好是要贴代码。
回复 支持 反对

使用道具 举报

1

主题

5

帖子

5

积分

贫民

积分
5
yanqinyanqin  楼主| 发表于 2019-2-15 19:13:33 | 显示全部楼层
sheeboard 发表于 2019-2-15 10:22
书上的原话
This is quite a bit of munging that you’d rather not repeat each time you query the
data ...

谢谢您的回复,代码已经贴在下面:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from lxml import etree
import time
import pymysql
from sqlalchemy import create_engine
from urllib.parse import urlencode  # 编码 URL 字符串

start_time = time.time()  # 计算程序运行时间


def get_one_page(i):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
        }
        paras = {
            'reportTime': '2017-12-31',
            # 可以改报告日期,比如2018-6-30获得的就是该季度的信息
            'pageNum': i  # 页码
        }
        url = 'http://s.askci.com/stock/a/?' + urlencode(paras)
        resp**e = requests.get(url, headers=headers)
        if resp**e.status_code == 200:
            return resp**e.text
        return None
    except RequestException:
        print('爬取失败')


def parse_one_page(html):
    soup = BeautifulSoup(html, 'lxml')
    content = soup.select('#myTable04')[0]  # [0]将返回的list改为bs4类型
    tbl = pd.read_html(content.prettify(), header=0)[0]
    # prettify()优化代码,[0]从pd.read_html返回的list中提取出DataFrame
    tbl.rename(columns={'序号': 'serial_number', '股票代码': 'stock_code', '股票简称': 'stock_abbre', '公司名称': 'company_name',
                        '省份': 'province', '城市': 'city', '主营业务收入(201712)': 'main_bussiness_income',
                        '净利润(201712)': 'net_profit', '员工人数': 'employees', '上市日期': 'listing_date', '招股书': 'zhaogushu',
                        '公司财报': 'financial_report', '行业分类': 'industry_classification', '产品类型': 'industry_type',
                        '主营业务': 'main_business'}, inplace=True)

    # print(tbl)
    return tbl


# rename将中文名改为英文名,便于存储到mysql及后期进行数据分析
# tbl = pd.DataFrame(tbl,dtype = 'object') #dtype可统一修改列格式为文本

def generate_mysql():
    conn = pymysql.connect(
        host='localhost',
        user='root',
        password='******',
        port=3306,
        charset='utf8',
        db='wade')
    cursor = conn.cursor()

    sql = 'CREATE TABLE IF NOT EXISTS listed_company (serial_number INT(20) NOT NULL,stock_code INT(20) ,stock_abbre VARCHAR(20) ,company_name VARCHAR(20) ,province VARCHAR(20) ,city VARCHAR(20) ,main_bussiness_income VARCHAR(20) ,net_profit VARCHAR(20) ,employees INT(20) ,listing_date DATETIME(0) ,zhaogushu VARCHAR(20) ,financial_report VARCHAR(20) , industry_classification VARCHAR(20) ,industry_type VARCHAR(100) ,main_business VARCHAR(200) ,PRIMARY KEY (serial_number))'
    # listed_company是要在wade数据库中建立的表,用于存放数据

    cursor.execute(sql)
    conn.close()


def write_to_sql(tbl, db='wade'):
    engine = create_engine('mysql+pymysql://root:******@localhost:3306/{0}?charset=utf8'.format(db))
    try:
        # df = pd.read_csv(df)
        tbl.to_sql('listed_company2', con=engine, if_exists='append', index=False)
    # append表示在原有表基础上增加,但该表要有表头
    except Exception as e:
        print(e)


def main(page):
    generate_mysql()
    for i in range(1, page):
        html = get_one_page(i)
        tbl = parse_one_page(html)
        write_to_sql(tbl)


# # 单进程
if __name__ == '__main__':
    main(178)

    endtime = time.time() - start_time
    print('程序运行了%.2f秒' % endtime)

# 多进程
# from multiprocessing import Pool
# if __name__ == '__main__':
#         pool = Pool(4)
#         pool.map(main, [i for i in range(1,178)])  #共有178页

#         endtime = time.time()-start_time
#         print('程序运行了%.2f秒' %(time.time()-start_time))
回复 支持 反对

使用道具 举报

0

主题

400

帖子

400

积分

骑士

积分
400
sheeboard 发表于 2019-2-17 10:14:32 | 显示全部楼层
只抓前10页的内容,抛砖引玉。
  1. from sqlalchemy import create_engine
  2. import pandas as pd
  3. import requests
  4. from bs4 import BeautifulSoup

  5. engine=create_engine('mysql+pymysql://root:123456@localhost:3306/stock?charset=utf8')

  6. df=pd.DataFrame()

  7. for i in range(1,11):
  8.     data=[]
  9.     url=("http://s.askci.com/stock/a/?reportTime=2018-09-30&pageNum=%i#QueryCondition" % i)
  10.     r=requests.get(url)
  11.     soup=BeautifulSoup(r.content,'html.parser')
  12.     table=soup.find('table',attrs={'id':'myTable04'})
  13.     rows=table.find_all('tr')
  14.     for row in rows:
  15.         cols=row.find_all('td')
  16.         cols=[ele.text.strip() for ele in cols]
  17.         data.append(cols)
  18.     del data[0]
  19.     df=df.append(data,ignore_index=True)

  20. df.columns=['serial_number','stock_code','stock_abbre','company_name','province',\
  21. 'city','main_bussiness_income','net_profit','employees','listing_date','zhaogushu',\
  22. 'financial_report','industry_classification','industry_type','main_business']

  23. df.to_sql('listed_company2', con=engine, if_exists='append', index=False)
复制代码
Screenshot from 2019-02-17 10-11-59.png
回复 支持 反对

使用道具 举报

1

主题

5

帖子

5

积分

贫民

积分
5
yanqinyanqin  楼主| 发表于 2019-3-2 13:57:48 | 显示全部楼层
sheeboard 发表于 2019-2-17 10:14
只抓前10页的内容,抛砖引玉。

我用了你的代码尝试了一下,出现了一个问题:
/usr/local/bin/python3.7 /Users/yq37222/PycharmProjects/untitled1/.idea/cp.py
Traceback (most recent call last):
  File "/Users/yq37222/PycharmProjects/untitled1/.idea/cp.py", line 28, in <module>
    df.to_sql('listed_company2', con=engine, if_exists='append', index=False)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 2522, in to_sql
    dtype=dtype, method=method)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 459, in to_sql
    chunksize=chunksize, dtype=dtype, method=method)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 1172, in to_sql
    table.create()
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 571, in create
    if self.exists():
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 559, in exists
    return self.pd_sql.has_table(self.name, self.schema)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 1200, in has_table
    schema or self.meta.schema,
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2124, in run_callable
    with self.contextual_connect() as conn:
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2192, in contextual_connect
    self._wrap_pool_connect(self.pool.connect, None),
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2228, in _wrap_pool_connect
    return fn()
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/pool.py", line 425, in connect
    return _ConnectionFairy._checkout(self)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/pool.py", line 822, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/pool.py", line 554, in checkout
    rec = pool._do_get()
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/pool.py", line 1250, in _do_get
    self._dec_overflow()
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 67, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 277, in reraise
    raise value
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/pool.py", line 1247, in _do_get
    return self._create_connection()
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/pool.py", line 370, in _create_connection
    return _ConnectionRecord(self)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/pool.py", line 499, in __init__
    self.__connect(first_connect_check=True)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/pool.py", line 701, in __connect
    connection = pool._invoke_creator(self)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
    return dialect.connect(*cargs, **cparams)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 437, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pymysql/__init__.py", line 94, in Connect
    return Connection(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pymysql/connecti**.py", line 325, in __init__
    self.connect()
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/site-packages/pymysql/connecti**.py", line 583, in connect
    **kwargs)
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/socket.py", line 707, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/Library/Frameworks/Python.framework/Versi**/3.7/lib/python3.7/socket.py", line 748, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
TypeError: getaddrinfo() argument 1 must be string or None
回复 支持 反对

使用道具 举报

0

主题

400

帖子

400

积分

骑士

积分
400
sheeboard 发表于 2019-3-2 20:59:04 | 显示全部楼层
我猜测还是数据库问题,新版mysql密码默认是不能用明码的,认为这是不安全的。我也是改了mysql的配置才用的明码。怎么用socket,我再去看看书。
回复 支持 反对

使用道具 举报

1

主题

5

帖子

5

积分

贫民

积分
5
yanqinyanqin  楼主| 发表于 2019-3-2 22:50:11 | 显示全部楼层
sheeboard 发表于 2019-3-2 20:59
我猜测还是数据库问题,新版mysql密码默认是不能用明码的,认为这是不安全的。我也是改了mysql的配置才用的 ...

好的  谢谢你    那么长一串链接出现的是什么问题呢
回复 支持 反对

使用道具 举报

0

主题

400

帖子

400

积分

骑士

积分
400
sheeboard 发表于 2019-3-3 01:20:08 | 显示全部楼层
dns解析localhost是不是127.0.0.1,试试看运行mysql_secure_installation,安全配置一下,还是前面说过的,本地客户端先要能连数据库。
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

快速回复 返回顶部 返回列表