【讨论】正则系列

nwcwww · 发表于 2013-6-20 10:31:43

本帖最后由 nwcwww 于 2013-6-20 11:50 编辑

把17和18楼的代码重新整理下吧。二者其实是一回事。我在这层写的详细些，方便还没入门的朋友。其他诸位在代码里加点注释就好。
代码本身的思路其实是很简单粗暴的，加上的dynamic则是花活儿，没太大意思。

测试条件：

txt = 'Hello World, from MATLAB'；
nl = 5；%length of target words

复制代码

第一步，因为在计数时大小写是没有分别的，而且我们需要考察每个单词的长度，所以不妨先把整条string转换大小写并分割成单词。
\<和\>是position anchor中的word boundary，类似于其他编程环境中的\b。二者分别代表词首和词尾，合起来\<expr\>自然就是将expr同整个词匹配。
\w{5}表示匹配5位。

>> a = regexp(lower(txt), '\<\w{5}\>','match')
a =
'hello' 'world'

复制代码

第二步，通过strcat把符合长度限制的单词拼成一条string，便于之后的regexp使用：

>> b = strcat(a{:})
b =
helloworld

复制代码

第三步，对任一字母而言，它在b中的出现次数可以直接用regexp求出。以字母l为例，共出现了3次：

Count_l = length(regexp(b, 'l'))
Count_l =
3

复制代码

出于编程考虑，直接指定'a' 'b' 一直到'z'是不方便的。所以在这里我用了char命令。查一下ascii表可以知道小写字母和ascii representation之间有96的偏移值。所以第12号字母'l'对应的就是char(96+12):

>> char(96+12)
ans =
l

复制代码

因此Count_l可以改写为

Count_l = length(regexp(b, char(96+12)))
Count_l =
3

复制代码

从char(96+1)到char(96+26)，我们就可以遍历所有的字母了。这个遍历用循环或是arrayfun包装下都可以。我这里用后者好了：

Count_all = arrayfun(@(x) length(regexp(b, char(96+x))), 1:26)
Count_all =
Columns 1 through 16
0 0 0 1 1 0 0 1 0 0 0 3 0 0 2 0
Columns 17 through 26
0 1 0 0 0 0 1 0 0 0

复制代码

这就得出了题目的答案。其中第12列的数值3，对应的即是我们刚才拿出来单练的字母l。

所以17楼的代码，如果我自己正常写多半如下：

a = regexp(lower(txt),'\<\w{5}\>','match');
b = strcat(a{:});
Count_all = arrayfun(@(x) length(regexp(b, char(96+x))), 1:26);

复制代码

在cody答案中之所以没有一步到位使用\w{5}，是因为{nl}中的nl是可变的。而如果使用['\<\w{', num2str(nl), '}\>']则又感觉傻傻的。
所以也就退而求其次把长度判断放在了strcat那一步。相比之下，其实不如直接在regexp里{nl}优雅。建议大家参考祁工楼下的sprintf。

a = regexp(lower(txt), '\<\w+\>','match');
%or use splitstr to do the same job
b = strcat(a{cellfun(@numel, a)==nl});
Count_all = arrayfun(@(x) length(regexp(b, char(96+x))), 1:26);

复制代码

liuyalong008 · 发表于 2013-6-20 10:57:59

受qibbxxt兄前几天的启发想起了histc函数

s=regexp(lower(txt), '\<\w+\>','match')
s(cellfun(@length,s)==n1)
histc([ans{:}],97:122)

复制代码

qibbxxt · 发表于 2013-6-20 11:25:40

accumarray(reshape(char(regexp(lower(txt),sprintf('\\<\\w{%d}\\>',nl),'match'))-'`',[],1),1,[26,1])'

复制代码

lower(txt)：将字符串变为小写
sprintf(...): 构造\<\w{nl}\>,因为要动态的改变nl的数值，因此选择sprintf,至于这个正则的具体含义,nwc兄已经给出了解释
char:将元胞数组变为字符数组，和cellstr相对
reshape:因为char是2行5列，为了计算方面，需要转化成1列来处理
accumarray:统计频率，这只是这个函数的一个应用，匿名函数默认为@sum，由于此函数比较复杂，日后详解
还有一个-'`'，实际上就是-96，将a-z变为1-26

统计频率用histc应该比较好，为了多样性，因为liu版已经用过，因此用accumarray

如果用histc,则可以

sum(histc(char(regexp(lower(txt),sprintf('\\<\\w{%d}\\>',nl),'match')),97:122),2)'

复制代码

liuyalong008 · 发表于 2013-6-24 11:38:12

本帖最后由 liuyalong008 于 2013-6-24 22:20 编辑

在网上看到一个问题关于提取复杂文本中具体信息的帖子
忍受不了很长的循环和无谓的代码。
于是把问题发在本版块大家讨论。
问题为提取文本中的三个信息，第一为[]中所包含的信息，第二为e指数数字部分，第三为指数后跟随的字符串，不妨设置三个filed name为name、num和string吧：

>gb|AAN73709.1|AF484508_1 gag-pol fusion polyprotein [swine]
Score=440 bits (1131) Expect=e-122
PIVQNAQGQMIYQALSPRTLNAWVKVIEERGFSPEVIP
>gb|AAL34796.1| gag protein [monkey]
Score=457 bits (1177) Expect=e-127
PIVQNLQGQMVHQSISPRTLNAWVKVIEEKAFSPEVIPMFTALSEGATPQL
>gb|AAL34796.1| gag protein [swine]
Score=457 bits (1177) Expect=e-127
PIVQNLQGQMVHQSISPRTLNAWVKVIEEKAFSPEVIPMFTALSEGATPQL
>gb|AAL34796.1| gag protein [Human]
Score=457 bits (1177) Expect=e-127
PIVQNLQGQMVHQSISPRTLNAWVKVIEEKAFSPEVIPMFTALSEGATPQL

复制代码

随帖附上文本内容
ps：tokens和match都不少了，这次针对names

部分结果：

>> tokenNames(1)
ans =
name: 'swine'
num: '-122'
string: 'PIVQNAQGQMIYQALSPRTLNAWVKVIEERGFSPEVIP'

复制代码

liuyalong008 · 发表于 2013-6-25 09:45:13

本帖最后由 liuyalong008 于 2013-6-25 09:47 编辑

供大家拍：

fid = fopen('C:\Users\Administrator\Desktop\test.txt');
data = textscan(fid, '%s','delimiter','\n') %按行读取到cell中
fclose(fid)

复制代码

tt=regexprep([data{:}],'(.*)','$1 ') % 把每个字符串后面加上空格，$1表示第一个token，不知为何\s不起作用

复制代码

expression=['(?<name>(?<=\[)\w+(?=\])).*?',...
'(?<num>(?<=Expect=e)[+-]?\d+).*?',...
'(?<string>(?<=Expect=e-\d+\s+)[A-Z]+)']
tokenNames = regexp([tt{:}],expression,'names')

复制代码

expression的结构为：

(?<name>tag1).*?(?<num>tag2).*?(?<string>tag3)

复制代码

其中(?<***>tag)是固定结构，.*?表示 Lazy expression，匹配尽量少的字符（对应的为.*匹配尽可能多的字符串）
(?<=\[)%表示一个字符串前面是[符号；(?=\])表示是后面为]。
在look around中(?<=***)放在要匹配字符串前面(对应的不等于为(?<!***))
(?=)要放在后面(对应的不等于为(?!***))
要把cell转成字符串有两种方法：
其一用char(tt)，但是这样会形成一个二维字符串数组，较短的串会用空格补齐
其二用[tt{:}]，这样会只有一行，好利用正则函数来匹配

lin2009 · 发表于 2013-6-25 16:33:57

25#中生成目标字符串的代码可以进一步简化，如下：

fid = fopen('test.txt');
str = fscanf(fid, '%s'); % 整个目标文件内容作为字符串读入
fclose(fid)
p1 = '(?<name>(?<=\[).*?(?=\])).*?'; % 匹配[]中所包含的信息，.*?指代各子匹配模式之间的无关字符
p2 = '(?<num>(?<=e)(\-\d+)).*?'; % 匹配e指数数字部分
p3 = '(?<string>[a-zA-Z]+)'; % 匹配指数后跟随的字符串
pattern = [p1, p2, p3]; % pattern 目标匹配模式
loc = regexp(str, pattern, 'names') % 返回Struct类型数据
struct2cell(loc) % 显示结果。

复制代码

qibbxxt · 发表于 2013-6-26 11:04:30

本帖最后由 qibbxxt 于 2013-6-26 14:13 编辑

pattern = '\[(?<name>.*?)\].*?e(?<num>-\d+).*?(?<string>[A-Z]+)' ;
[~,name] = regexp(fileread('test.txt'), pattern, 'tokens', 'names');
struct2cell(name)

复制代码

知识点liu已经解释过了，我就不多说了

bainhome · 发表于 2013-6-26 12:11:40

本帖最后由 bainhome 于 2013-7-2 10:49 编辑

诸公请继续，各位的精彩代码我已经在总结，只是最近杂事较多，还没有写完，框架放在这里，大家看看，重点是3.3节中关于正则的一般规则，如果有不尽之处，请补充，我会调整过来。
ps：后面大家出的题目我会全部汇总在这个pdf中，大家不着急，一个题目一个题目慢慢来:)
pdf文件本楼删除，新更新附件暂时放在43#

liuyalong008 · 发表于 2013-6-26 15:20:05

qibbxxt 发表于 2013-6-26 11:04
知识点liu已经解释过了，我就不多说了

稍稍岔开一下话题
看了lin版和qi版的帖子突然发现其实读取文本的函数matlab提供得非常丰富
我暂时列举一些：

fread
fscanf
fgetl
textread
dlmread
tblread
tdfread
caseread
textscanf
fileread
importdata
dataset

复制代码

liuyalong008 · 发表于 2013-6-26 16:38:23

正则系列之 regexprep：
Problem 171. Reverse the Words (not letters) of a String
反转所有的单词：

Description
Change the words of a string such that the words appear in reverse order. You may assume that the string is a number of words split by a single space character. The only characters in the input are letters and spaces.

复制代码

Example
input = 'Will the ecological jail rule outside the tear';
output = 'tear the outside rule jail ecological the Will';

复制代码

FYI：字符串只包括字母和数字

验证：

%%
x = 'Will the ecological jail rule outside the tear';
y_correct = 'tear the outside rule jail ecological the Will';
%%
x = 'That computer programmer kept the room warm';
y_correct = 'warm room the kept programmer computer That';
%%
x = 'trivial';
y_correct = 'trivial';

复制代码

lin2009 · 发表于 2013-6-26 18:08:14

分成两步：将每个单词字母顺序反过来，再将整个字符串的顺序反过来。

fliplr(regexprep(x,'(\<\w+\>)','${fliplr($1)}'))

复制代码

代码说明：
% newStr = regexprep(str,expression,replace); ——— 新字符串 = regexprep(目标字符串，要寻找的文本的匹配模式，用来替换的内容)
% 两个$的作用不同，${command} 是regexprep的执行额外命令的语法，command 是Matlab中可执行语句或函数；
% $1 在regexprep中指代匹配模式中的第一个token。此例即找到的单词。

qibbxxt · 发表于 2013-6-26 21:39:18

跟着lin2009兄的思路，改写一下，其含义lin兄已经解释过了

fliplr(regexprep(x,'\w+','${fliplr($0)}'))

复制代码

liuyalong008 · 发表于 2013-6-26 22:24:43

本帖最后由 liuyalong008 于 2013-6-26 22:30 编辑

我来更长的

[match,noMatch] = regexp(s1,' ','match','split');
strjoin(noMatch(end:-1:1),match);

复制代码

regexprep(fliplr(s1),'(\w+)','${fliplr($1)}')

复制代码

nwcwww · 发表于 2013-6-27 01:49:24

来晚了，该说的各位都说啦。这个问题我心目中的leading solution也是dynamic + fliplr的方法。

常规的做法类似这样，倒是也不麻烦：

y = strsplit(x);
z = strtrim(sprintf('%s ' ,y{end:-1:1})); %或者strjoin也可以

复制代码

liuyalong008 · 发表于 2013-6-29 17:25:23

正则系列之单词查找
Problem 1376. Find names/words that start and end with the same letter.

Find names/words (from a string) that start and end with the same letter.
Case-insensitive.
If a name/word is not at the end of the string, it can be followed by a white-space or a comma.
Names/words contain only letters or dashes.
Underscores are NOT considered as letters. Words separated by underscores count as distinct words, e.g. in 'NAN_CONST' the 'NAN' is matched.
Words are at least two letters long, so e.g. 'a' is not matched.
Example:
in = 'Cedric loves regular expressions'
out = {'Cedric', 'regular'}

复制代码

验证：

1
%%
inStr = 'Cedric loves regular expressions' ;
output_correct = {'Cedric', 'regular'} ;
2
%%
inStr = 'Single neuron Cedric, Anna-Maria, Andrei, a koala' ;
output_correct = {'neuron', 'Cedric', 'Anna-Maria'} ;
3
%%
inStr = '__dEdiCaTeD__, REGULAR_EXPRESSION.. Rotor-1 and abracadabra' ;
output_correct = { 'dEdiCaTeD', 'REGULAR', 'abracadabra'} ;

复制代码

liuyalong008 · 发表于 2013-6-30 11:08:37

上贴主要是找出首字母和尾字母相同的单词，需要注意的是下划线也被算作\w了

ps：我发现帖子置顶之后看的人反而少了，还是暂时别置顶了

nwcwww · 发表于 2013-6-30 12:31:03

本帖最后由 nwcwww 于 2013-6-30 12:34 编辑

似乎各个论坛都是越置顶越没人看。

regexpi(in, '(?<=_|\<)([a-z])[\w-]*\1(?=[\s_,.]|$)' , 'match');

复制代码

也可以用dynamic包装下，从12到11.

pattern的各部分：
(?<=_|\<): look behind，之前有_，如果是单词起始处也可
([a-z])：匹配一个字母并记录
[\w-]*: 字母或者-被匹配0次或多次
\1：之前被匹配的字母再次出现
(?=[\s_,.]|$)：look ahead, 后接空格/下划线/逗号/句号，又或者是句子末尾。

lin2009 · 发表于 2013-6-30 18:32:57

% 本例中的 name/word 是仅由大小写字母、“-”(dash)构成的。
%
% 根据本题name/word的基本特征（由字母及“-”字符构成的，用[a-z-]表示name/word的字符集合），构造如下的匹配模式：
%
% pattern = '(?<![a-z0-9-])([a-z])[a-z-]+\1(?![a-z0-9-])' ;
%
% 其中，
% (?<![a-z0-9-])...(?![a-z0-9-])表示name/word前、后的字符非name/word的字符集合或数字。
% ...([a-z])...+\1...中，([a-z])表示字母开头，\1表示同一字母结尾。
% 'match' 返回匹配到的（找到的）字符串，不包括前后的查找条件的字符或字符串。
%（Lookaround assertions look for string patterns that immediately precede or follow the intended match,
% but are not part of the match.）
%
% 应该注意的是，regexp等函数中的正则表达式中，\w* 也表示一个单词（'\w*' identifies a word）。
% 但是它的word定义与通常的单词定义（构词元素为字母和破折号，不包括数字）不同，更与本例的name/word 不相同，
% 更像是一般编程语言（C等）的变量定义。
% Matlab正则表达式的构词元素--元字符\w表示任意字母、数字和下划线（Any alphabetic, numeric, or underscore character）。
% 对于英文字符集, \w 等效于 [a-zA-Z_0-9]。
% 另外，如\d表示0-9数字一样，Matlab应该设一个元字符(Metacharacter)表示26个字母。
%
% 因此不能用\w表示单词的元素。\<及\>表示的单词词首和词尾位置在本例中也失效。词首和词尾的位置只能另选其它。

pattern = '(?<![a-z0-9-])([a-z])[a-z-]+\1(?![a-z0-9-])' ;
inStr = 'Cedric loves regular expressions';
outStr = regexpi(inStr,pattern,'match')
inStr = 'Single neuron Cedric, Anna-Maria, Andrei, a koala';
outStr = regexpi(inStr,pattern,'match')
inStr = '__dEdiCaTeD__, REGULAR_EXPRESSION.. Rotor-1 and abracadabra';
outStr = regexpi(inStr,pattern,'match')

复制代码

qibbxxt · 发表于 2013-6-30 23:22:08

本题目的不能用\<和\>的原因，来自己第三个例子的_和-1,将这两个替换掉，可以

regexpi(regexprep(inStr,{'_','-1'},{' ','a'}),...
'(\<\w)[\w-]*(\1\>)','match')

复制代码

这样做，有点投机取巧之嫌，但也不失一种思路

liuyalong008 · 发表于 2013-7-1 14:44:58

本帖最后由 liuyalong008 于 2013-7-1 14:51 编辑

qibbxxt 发表于 2013-6-30 23:22
本题目的不能用\的原因，来自己第三个例子的_和-1,将这两个替换掉，可以这样做，有点投机取巧之嫌，但也不 ...

其实qi版用a来代替-1还不如用一个数字来代替可能会更好，防止a开头的单词的混淆

regexpi(regexprep(inStr,{'_','-1'},{' ','5'}),'(\<\w)[\w-]*(\1\>)','match')

复制代码

我也仿照着来一个：

regexpi(regexp(inStr,'[,_\s\.]','split'),'(\<\w).*\1,'match')
[ans{:}]

复制代码

按照空格、下划线和点号把字符串分割然后对每个子串进行首字母和尾字母的匹配，缺点是还需要在提取一次

我个人很喜欢作者给出的solution和他的remarks：

function output = getWordsSameStartEnd( inStr )
% No "ans" trick, so it is still
% possible to take the lead!
% I built this problem to test..
% - 'tokens' with the requirement
% of matching 1st and last letters.
% - 'lookarounds' with the annoying
% requirement about underscores not
% being letters (so we cannot use
% \<\> directly).
output = regexpi(inStr, '(?<=(^|\W|_))(?<start>[a-zA-Z])[\w-]*?\k<start>(?=($|[\s,_]))', 'match') ;
end</pre>

复制代码

(?<=(^|\W|_))依旧是look around的匹配单词之前的条件
(?<start>[a-zA-Z])[\w-]*?\k<start>相当精妙
(?<start>[a-zA-Z])具体用法可参照#24~#28
[a-zA-Z]找出首字母，当然nwcwww的帖子的提醒regexpi中用[a-z]也可
[\w-]*?尽可能短地匹配中间的字母
\k<start>为匹配token?<start>的内容，一般用法是\k<name>
此solution总体感觉很清爽易懂
也可以把tokens，names 也都输出看看

[m,t,n]=regexpi(inStr, '(?<=(^|\W|_))(?<start>[a-z])[\w-]*?\k<start>(?=($|[\s,_]))', 'match','tokens','names')

复制代码

此外，大小写不敏感在matlab正则表达式中有三种方式：

'(?i)'
regexp(inStr, '(?i)(?<=(^|\W|_))(?<start>[a-z])[\w-]*?\k<start>(?=($|[\s,_]))', 'match')
也可以用
regexpi,
或者
regexp(str,expression,'match','ignorecase')

复制代码

账号		自动登录	找回密码
密码			注册

账号		自动登录	找回密码
密码			立即注册

【讨论】正则系列

点评

评分

评分

评分

正则系列之-->names

本帖子中包含更多资源

评分

点评

评分

点评

评分

评分

点评

评分

点评

点评

评分

评分

评分

评分