Python 正则表达式的使用-白红宇

Python 正则表达式的使用

阅读量：2119 次

发布时间：2019-04-30

本文共 6492 字，大约阅读时间需要 21 分钟。

正则表达式通常被用来检索、替换那些符合某个模式(规则)的文本，Python使用re模块来处理正则表达式。

一、正则表达式

1、通配符

句点 . 与除换行符外的任何字符都匹配，并且只与一个字符匹配。

例如正则表达式'.ython'与字符串'python'匹配，不与'cpython'或'ython'匹配。

2、特殊字符转义

用两个反斜杠转义，如果用单个反斜杠，则前面字符串加r。

例如模式'python\\.org'，或r'python\.org'匹配字符串'python.org'。

3、字符集

字符集用方括号将一个子串括起，字符集只能匹配一个字符，如 '[pj]ython'与'python'和'jython'都匹配；

也可以使用用范围，如'[a-zA-Z0-9]'与大写字母、小写字母和数字都匹配；

要指定排除字符集，在开头添加一个^字符，如'[^ab]'与除a、b外的其他任何字符都匹配。

4、二选一和子模式

使用管道字符 | 表示匹配两个中的一个，如'python|perl' 匹配'python'和'perl'。

如果只想将 | 用于模式的一部分，可将这部分（子模式）放在圆括号内。如'p(ython|erl)'。

单个字符也可称为子模式。

5、字符串的开头和结尾

开头用脱字符 ^，结尾用美元符号 $。

6、可选模式和重复模式

在子模式后面加上指定符号，可指定可选和重复模式。

(pattern)? : pattern可重复0、1

(pattern)* : pattern可重复0、1或多次

(pattern)+ : pattern可重复1或多次

(pattern){m,n} : pattern可重复m至n次

重复运算符默认是贪婪的，匹配尽可能多的内容。

如r'\*(.+)\*'匹配字符串 '*This* is *it*!'时将匹配到*This* is *it*

在重复运算符后面加问号?可指定为非贪婪的，

如r'\*(.+?)\*'匹配字符串 '*This* is *it*!'时将匹配到*This* 和 *it*

二、模板re包含使用正则表达式的函数。

1、search(pattern, string[, flags])

（1）在给定字符串查找第一个与正则表达式匹配的子串，如果找到将返回MatchObject对象(结果为真)，否则返回None（结果为假）

参数 pattern 为正则表达式，string 为要匹配的字符串，flags为标志位，控制是否区分大小写等等。

（2）MatchObject对象

MatchObject对象包含与模式匹配的子串的信息，这些子串部分称为编组。

编组就是放在圆括号内的子模式，根据左边的括号数编号，其中编组0指的是整个模式。

MatchObject对象的几个重要方法

groups() 返回一个包含所有编组字符串的元组，从 1 到所含的编组，不包含编组0。

group([group1, ...]) 获取与给定子模式(编组)匹配的子串，没有指定编组号则默认为0

start([group]) 返回与给定编组匹配的子串的起始位置

end([group]) 返回与给定编组匹配的子串的终止位置(与切片一样不包含终止位置)

span([group]) 返回与给定编组匹配的子串的起始位置和终止位置

import rem = re.search(r'www\.(.*)\.(.{3})', 'WWW.python.org', re.I) #忽略大小写if(m):    print(m.groups()) #从编组1算起    print('编组0：')    print(m.group())     print(m.group(0))    print('编组1：')    print(m.group(1))    print(m.start(1))    print(m.end(1))    print(m.span(1))    print('编组2：')    print(m.group(2))    print(m.start(2))    print(m.end(2))    print(m.span(2))

运行结果：

('python', 'org')编组0：www.python.orgwww.python.org编组1：python10(4, 10)编组2：org14(11, 14)

2、match(pattern, string[, flags])

match函数与search函数类似，不同之处是在给定字符串开头查找与正则表达式匹配的子串。

import rem1 = re.search(r'python', 'www.python.org')if(m1):    print('search匹配成功')else:    print('search匹配失败')m2 = re.match(r'python', 'www.python.org')if(m2):    print('match匹配成功')else:    print('match匹配失败')

运行结果：

search匹配成功match匹配失败

3、compile(pattern[, flags])

调用search、match等函数时，如果提供的是用字符串表示的正则表达式，内部会将它们转换为模式对象。

compile将字符串表示的正则表达式转换为模式对象，内部无需再进行转换。

模式对象也有搜索/匹配方法，因此

pat = re.compile(pattern[, flags])

pat.search(string) (pat是使用 compile创建的模式对象)

等价于re.search(pattern, string[, flags])

import rem1 = re.search(r'python', 'www.python.org')if(m1):    print('search匹配成功')else:    print('search匹配失败')pat = re.compile(r'python')m2 = pat.search('www.python.org')if(m1):    print('compile search匹配成功')else:    print('compile search匹配失败')

运行结果：

search匹配成功compile search匹配成功

4、split(pattern, string[, maxsplit=0])

根据模式来分割字符串，返回列表

import reres = re.split('[, ]', 'ab,cd 123') #以空格和逗号为分隔符来分割print(res)

运行结果：

['ab', 'cd', '123']

5、findall(pattern, string)

返回一个列表，其中包含字符串中所有与模式匹配的子串

import reresult = re.findall(r'\d+', 'ab,cd 123 456') #查找数字print(result)

运行结果：

['123', '456']

6、sub(pattern, repl, string[, count=0])

将字符串中与模式pattern匹配的子串都替换为repl

import reresult = re.sub(r'\D', '', 'abc123def')print(result)

运行结果：

三、实例：抓取本人博客首页的信息

目标：抓取首页的每篇文章的标题、文章url、发布日期。

查看html源码，每篇文章的源码类似如下：


           
                    原        Python目录和文件处理总结          
    
                  1、判断目录是否存在、判断文件是否存在、创建目录、重命名目录或文件import os#获取当前目录路径：  E:\Work\Projects\pythonprint(os.getcwd()) #判断当前目录是否存在，不存在则创建目录dir1if not os.path.isdir...          
    
          
             2019-08-22 11:02:28      
      
     
      
             阅读数 0       
      
     
      
             评论数 0       
    
    
          
      		
       		
     编辑  		
           
    
      
   
  
           
                    原        Python 正则表达式的使用          
    
                  正则表达式通常被用来检索、替换那些符合某个模式(规则)的文本，Python使用re模块来处理正则表达式。一、正则表达式1、通配符句点 . 与除换行符外的任何字符都匹配，并且只与一个字符匹配。例如正则表达式'.ython'与字符串'python'匹配，不与'cpython'或'ython'...          
    
          
             2019-08-21 17:14:05      
      
     
      
             阅读数 6       
      
     
      
             评论数 0       
    
    
          
      		
       		
     编辑

经过多次测试调整正式表达式，最终代码如下：

from urllib.request import urlopenimport reimport pprint#参数re.DOTALL使得表达式中的句点匹配包括换行符在内的所有字符p = re.compile('
   
    .*?
    
     \\s*(.*?)\\s*.*?
     (.*?)', re.DOTALL)text = urlopen('https://blog.csdn.net/gdjlc').read().decode()pprint.pprint(p.findall(text))

运行结果如下：

[('https://blog.csdn.net/gdjlc/article/details/100010516',  'Python目录和文件处理总结',  '2019-08-22 11:02:28'), ('https://blog.csdn.net/gdjlc/article/details/99977171',  'Python 正则表达式的使用',  '2019-08-21 17:14:05'), ('https://blog.csdn.net/gdjlc/article/details/99867026',  'emmet的用法',  '2019-08-20 18:03:24'), ('https://blog.csdn.net/gdjlc/article/details/99682462',  'Sublime Text 3 插件安装、搭建Python、Java开发环境',  '2019-08-16 17:30:43'), ('https://blog.csdn.net/gdjlc/article/details/99089928',  'python 字符串用法总结',  '2019-08-10 17:38:43'), ('https://blog.csdn.net/gdjlc/article/details/98941184',  'Activit 5.13 工作流部署新版本后回退到上一个版本',  '2019-08-09 09:58:39'), ('https://blog.csdn.net/gdjlc/article/details/98874049',  '一个java的http请求的封装工具类',  '2019-08-08 15:59:31'), ('https://blog.csdn.net/gdjlc/article/details/98218877',  'FastJSON使用例子',  '2019-08-02 17:56:07'), ('https://blog.csdn.net/gdjlc/article/details/98033447',  'SoapUI、Postman测试WebService',  '2019-08-01 10:06:58'), ('https://blog.csdn.net/gdjlc/article/details/96422860',  'PLSQL连接oracle数据库',  '2019-07-18 09:12:55'), ('https://blog.csdn.net/gdjlc/article/details/95802645',  'python函数修饰符@的使用',  '2019-07-13 22:50:30'), ('https://blog.csdn.net/gdjlc/article/details/95040707',  'Python上下文管理器的使用',  '2019-07-07 22:50:49'), ('https://blog.csdn.net/gdjlc/article/details/95040436',  'Python使用DB-API操作MySQL数据库',  '2019-07-07 22:43:28'), ('https://blog.csdn.net/gdjlc/article/details/95039385',  'Python类的定义、方法和属性使用',  '2019-07-07 22:38:05'), ('https://blog.csdn.net/gdjlc/article/details/93755639',  'tomcat配置通过域名访问项目',  '2019-06-26 17:24:47'), ('https://blog.csdn.net/gdjlc/article/details/93417916',  'Python对文件的读写操作',  '2019-06-23 23:02:07'), ('https://blog.csdn.net/gdjlc/article/details/93381137',  '模板引擎Jinja2的基本用法',  '2019-06-23 14:53:32'), ('https://blog.csdn.net/gdjlc/article/details/93379033',  '使用Flask构建一个Web应用',  '2019-06-23 11:15:26'), ('https://blog.csdn.net/gdjlc/article/details/93376344',  'Python函数使用',  '2019-06-22 23:03:31'), ('https://blog.csdn.net/gdjlc/article/details/92620860',  'Python的4个内置数据结构',  '2019-06-17 14:29:09')]

你可能感兴趣的文章