【讨论】正则系列

lin2009 · 发表于 2013-7-1 20:06:46

这个题目出得不好，而且作者给出的solution也未考虑全面情况。
1、题目中要对names/words的前后有明确的界定。
仅仅有后面的说明是不够的。If a name/word is not at the end of the string, it can be followed by a white-space or a comma.
names/words前后端到底是什么呢？只能靠约定俗成。
2、用作测试的例子也不能涵盖题目所反映的各种情况，见下面说明。
4、solution未考虑其它情况。solution 把应该算进来的排除在外。试试下面的例子。

inStr= 'How about Rotor? Yes, Rotor!'
output = regexpi(inStr, '(?<=(^|\W|_))(?<start>[a-zA-Z])[\w-]*?\k<start>(?=($|[\s,_]))', 'match')

复制代码

output =
{}

另外 solution 同样也把类似于 ‘ abc8bca' 等不符合要求的算在内。（Names/words contain only letters or dashes）
5、(?<start>[a-zA-Z]), \k<start>仅在需要输出token时才会比普通的(..), \1更显优势。用在这里觉得有些大材小用。
当然如果是仅仅针对于3个例子编写代码的话，就没有上面的问题了，呵呵。

bainhome · 发表于 2013-7-2 10:23:48

本帖最后由 bainhome 于 2013-7-2 12:20 编辑

给定一个字符串，移除且只移除辅音字母。cody原链参看这里。
题目原文：

Remove all the consonants in the given phrase.
Example:
Input s1 = 'Jack and Jill went up the hill';
Output s2 is 'a a i e u e i';

复制代码

测试算例：

%% 1
s1 = 'Jack and Jill went up the hill';
s2 = 'a a i e u e i';
assert(isequal(s2,refcn(s1)))

复制代码

%% 2
s1 = 'I don''t want to work. I just want to bang on the drum all day.';
s2 = 'I o'' a o o. I u a o a o e u a a.';
assert(isequal(s2,refcn(s1)))

复制代码

这个题估计也是多解。

bainhome · 发表于 2013-7-2 10:35:21

本帖最后由 bainhome 于 2013-7-7 14:28 编辑

诸位给的题目都相当有挑战性，暂时整几个简单题目放松一下，我的笔记现在到第5题，论坛最后两道题目感觉和前面跨度略大，我看着都吃力，暂不总结了，抱歉。
基础题目有点少，才两个，我打算连贴三个，把基础内容补充补充，后面有我感觉不错的题目再持续更新，方便起见暂时放在本楼，前面的老版本我删掉了。

如果没猜错这是论坛第一次以题目案例形式讨论正则表达式，因此系统性，基础知识等等都要兼顾，争取形成经典讨论案例，让它成为后来者学习正则表达式的首选素材。
ps:表达式的做法各位有心，都在算例中给出了详解，先赞一个！
另外，哪位有空结合简单例子讲讲tokens、match和name参数。

liuyalong008 · 发表于 2013-7-2 12:45:45

bainhome 发表于 2013-7-2 10:23
给定一个字符串，移除且只移除辅音字母。cody原链参看这里。
题目原文：测试算例：这个题估计也是多解。
...

非正则来一个：

s1(ismember(lower(s1),setdiff(97:122,'aieou')))=[]

复制代码

setdiff(97:122,'aieou')是小写的所有的辅音字母
ismember函数来挑选s1中所有的辅音字母的index
再把相应位置的辅音字母替换为[]即可

liuyalong008 · 发表于 2013-7-2 13:38:29

本帖最后由 liuyalong008 于 2013-7-2 13:57 编辑

来一个正则的：

s1(regexp(s1,'(?i)[^aeiou\W]'))=[]

复制代码

用match属性太多的时候，疏忽了其默认的start属性
regexp(s1,'(?i)[^aeiou\s\W]')返回的正是辅音字母的index

此例中，对于构造辅音字母的expression，分为两步：
1 排除aieou
2 排除空格\s 和其他标点符号\W
组合一起就是[^aeiou\W]

matlab help中对于构造辅音字母很经典：

'(?=[a-z])[^aeiou]'

复制代码

'(?=str1)str2' 表示&，即同时满足俩条件

lin2009 · 发表于 2013-7-2 18:19:01

本帖最后由 lin2009 于 2013-7-2 18:20 编辑

“liuyalong008 发表于 2013-7-2 13:38
来一个正则的：用match属性太多的时候，疏忽了其默认的start属性
regexp(s1,'(?i)[^aeiou\s\W]')返回的正是 ...”(回复的“高级模式”太难用了！)

s1(regexp(s1,'(?i)[^aeiou\W]'))=[]

删除的不仅仅是辅音字母，原因在于[^aeiou\W]不仅包含了辅音字母，也包含了数字和下划线。另外regexp 应该为regexpi。
看看下例：
s1 = ['Tiangong-1 is China''s first space station,',...
'an experimental testbed to demonstrate orbital rendezvous and docking capabilities.'];
s1(regexpi(s1,'(?=[a-z])[^aeiou]'))=[]

s1 =
iao- i ia' i ae aio,a eeiea ee o eoae oia eeou a oi aaiiie.

matlab help中对于构造辅音字母很经典：
'(?=[a-z])[^aeiou]'
确实很经典，也更直接。

s1 = ['Tiangong-1 is China''s first space station,',...
'an experimental testbed to demonstrate orbital rendezvous and docking capabilities.'];
s1(regexpi(s1,'(?=[a-z])[^aeiou]'))=[]

s1 =
iao-1 i ia' i ae aio,a eeiea ee o eoae oia eeou a oi aaiiie.

nwcwww · 发表于 2013-7-2 18:33:27

regexprep的解，不过本质上一样
s2 = regexprep(s1, '(?=\w)[^aeiouAEIOU]', '');

qibbxxt · 发表于 2013-7-3 08:47:47

regexprep(s1,'(?i)[b-df-hj-np-tv-z]','');

复制代码

该说的前面已经说过了

bainhome · 发表于 2013-7-4 01:45:48

本帖最后由 bainhome 于 2013-7-4 01:49 编辑

给定的text文本中，把数字提取出来并求和。cody原链请参看：Problem 57. Summing Digits within Text
题目原文：

Given a string with text and digits, add all the numbers together.
Examples:
Input str = '4 and 20 blackbirds baked in a pie'
Output total is 24 Input str = '2 4 6 8 who do we appreciate?'
Output total is 20

复制代码

测试算例1：

%% 1
str = '4 and 20 blackbirds baked in a pie';
total = 24;
assert(isequal(number_sum(str),total))

复制代码

测试算例2：

%% 2
str = '2 4 6 8 who do we appreciate?';
total = 20;
assert(isequal(number_sum(str),total))

复制代码

测试算例3：

%% 3
str = 'He worked at the 7-11 for $10 an hour';
total = 28;
assert(isequal(number_sum(str),total))

复制代码

上一道题正在编辑中，诸位相当猛，搞出这么多求解办法，这个笔记写起来简直是幸福的烦恼...

nwcwww · 发表于 2013-7-4 02:57:05

常规做法：

sum(str2double(regexp(str, '(\d+)', 'match')))

复制代码

Alfonso大神的解法：

str2num(regexprep(str,'[^\d]+','+0'));

复制代码

相当巧妙。

lin2009 · 发表于 2013-7-4 21:21:17

nwcwww 发表于 2013-6-20 10:31
把17和18楼的代码重新整理下吧。二者其实是一回事。我在这层写的详细些，方便还没入门的朋友。其他诸位在代 ...

原问题隔得太远了，直接引用一下：
Build a function with two input arguments: a string and a word length (number of letters),
that outputs a vector of counts of the 26 letters of the alphabet, specific to words with a given length.

各位的代码都是下例基础上验证的。
txt = 'Hello World, from MATLAB';
n1 = 5; % n后面跟的是数字1.

所以17楼的代码，如果我自己正常写多半如下：
a = regexp(lower(txt),'\<\w{5}\>','match');
b = strcat(a{:});
Count_all = arrayfun(@(x) length(regexp(b, char(96+x))), 1:26);

很欣赏上面的代码，它是arrayfun和regexp这两个Matlab中比较灵活的函数相结合的典范。
qibbxxt在23#提到accumarray函数并用accumarray编写代码，但稍显复杂。

accumarray(reshape(char(regexp(lower(txt),sprintf('\\<\\w{%d}\\>',nl),'match'))-'`',[],1),1,[26,1])'

这里就用accumarray函数的另一种形式改写一下。

txt = 'Hello World, from MATLAB'; % 输入参数
n1 = 5; % 输入参数
pattern = ['\<\w{',num2str(n1),'}\>']; % 也可用sprintf()的形式，但前者在此处更直观一些。
a = regexpi(txt,pattern,'match'); % 与上同，a = {'hello', 'world'}，可以写成 a = regexp(lower(txt),pattern,'match')。
<font color="#8b0000">% 不知道楼主们计算代码size的原则是什么，代码分开来写与并起来写的size应该算一样，且更应提倡。</font>
b = strcat(a{:}); % 与上同，b =‘helloworld’，可以直接用 b = [a{:}];
subs = arrayfun(@(x) regexpi(char(97:122),x),b);
% 将匹配模式[a-z]和要查找的字符串b对调一下。返回b中的各位置字母在[a-z]匹配模式字母串中的位置。
% 思路与qibbxxt一样。
% subs =
% 8 5 12 12 15 23 15 18 12 4
Count_all = accumarray(subs',1,[26,1])' % 统计各个位置（对应于各个字母）出现的次数。
% 写成综合的形式：
% Count_all = accumarray(arrayfun(@(x) regexpi(char(97:122),x),b)',1,[26,1])' % 四个嵌套函数。
% qibbxxt 用sprintf(...) 引入动态参数n1，使得代码中的函数嵌套层次变多了，复杂度变大了，可读性降低了。
<font color="#8b0000">% 个人观点是：代码简洁要照顾到可读性，对于大多数人来说，一行代码三四个函数嵌套顶多了。</font>
% 另，22# liuyalong008 的代码应该最简洁。
txt = 'Hello World, from MATLAB'; % 输入参数
n1 = 5; % 输入参数
pattern = ['\<\w{',num2str(n1),'}\>']; % 或写为 sprintf('\\<\\w{d}\\>',nl) 的形式；
a = regexpi(txt,pattern,'match');
Count_all = histc([a{:}],97:122)

复制代码

nwcwww · 发表于 2013-7-4 21:55:54

lin2009 发表于 2013-7-4 21:21
原问题隔得太远了，直接引用一下：
Build a function with two input arguments: a string and a word ...

size的计算是依据m parse tree的节点数。分开写代码，又或者是在额外申明变量的情况下确实会增加size:

function a = sizetestbody(txt, n1)
pattern = ['\<\w{',num2str(n1),'}\>'];
a = regexpi(txt,pattern,'match');
end
%the size is 25

复制代码

function a = sizetestbody(txt, n1)
['\<\w{',num2str(n1),'}\>'];
a = regexpi(txt,ans,'match');
end
%the size is 24

复制代码

function a = sizetestbody(txt, n1)
a = regexpi(txt,['\<\w{',num2str(n1),'}\>'],'match');
end
%the size is 21

复制代码

function ans = sizetestbody(txt, n1)
regexpi(txt,['\<\w{',num2str(n1),'}\>'],'match');
end
%the size is 19

复制代码

bainhome · 发表于 2013-7-5 20:18:15

本帖最后由 bainhome 于 2013-7-5 20:19 编辑

cody原链接参看这里。
剔除且只剔除字符串首尾的多余空格(不包括其他符号)
题目原文：

Given a string, remove all leading and trailing spaces (where space is defined as ASCII 32).

复制代码

测试算例1：

%% 1
a = 'no extra spaces';
b = 'no extra spaces';
assert(isequal(b,removeSpaces(a)))

复制代码

测试算例2：

%% 2
a = ' lots of space in front';
b = 'lots of space in front';

复制代码

测试算例3：

%% 3
a = 'lots of space in back ';
b = 'lots of space in back';

复制代码

测试算例4：

%% 4
a = ' space on both sides ';
b = 'space on both sides';

复制代码

测试算例5：

%% 5
a = sprintf('\ttab in front, space at end ');
b = sprintf('\ttab in front, space at end');

复制代码

这道题比较“阴险”的是第5个测试算例。
前面两道题我的个人看法已在文档中做了更新(见第43#pdf文件) ，适合初学者梳理思路，各位高手一笑置之即可。
ps:第9题只有nwcwww和alfonso两位的解法吗？还有没有其他做法？感觉有点孤零零的，非常突兀哦。总而言之，我的基础题任务完成，各位估计也休息够了，期待进一步的探讨。

qibbxxt · 发表于 2013-7-6 09:45:58

补一个第9题的

sum(str2num(regexprep(str,'[^0-9]',blanks(1))))

复制代码

思路：
将非0-9的字符替换为空格
然后用str2num转化为数组
再用sum函数求和

liuyalong008 · 发表于 2013-7-6 11:01:33

bainhome 发表于 2013-7-4 01:45
给定的text文本中，把数字提取出来并求和。cody原链请参看：Problem 57. Summing Digits within Text
题目 ...

我也回答一个第九题的：
另辟蹊径很有难度啊
我的和qi版类似：

str(~ismember(str,[32 48:57]))=32
sum(str2num(str))

复制代码

不过我的更繁琐
首先把除了空格和数字以外的字符全部转化为空格
再利用转换和求和即可

但是要补充一点的是小数是否也要考虑
那样的话就可以改成：

str(~ismember(str,[32 46 48:57]))=32
sum(str2num(str))

复制代码

str='.2 4 6 8 who do we appreciate?'
>> str(~ismember(str,[32 46 48:57]))=32
sum(str2num(str))
str =
.2 4 6 8
ans =
18.2

复制代码

liuyalong008 · 发表于 2013-7-6 12:31:37

bainhome 发表于 2013-7-5 20:18
cody原链接参看这里。
剔除且只剔除字符串首尾的多余空格(不包括其他符号)
题目原文：测试算例1：测试算例2 ...

第五题果然阴险：
主要是 \s-->Any white-space character; equivalent to [ \f\n\r\t\v]
所以必须要吧\t给剔除掉

regexprep(a,'^(?=[^\t])\s+|\s+$','')

复制代码

^(?=[^\t])\s+ 表示把除了\t以外的白色字符，^表示每行的开头
当然此pattern只限于开头是制表符，如果结尾也要限制就另说了

lin2009 · 发表于 2013-7-6 16:49:26

综合及优化一下第9楼的答案。
str = '4 and 20 blackbirds baked in a pie';

sum(str2double(regexp(str,'\d+','match')))  % \d+ 可以不用在两边加圆括号。另外时若将str2double换成str2num就不行，二者在用法上还是有些区别的。
str2num(regexprep(str,'\D+','+0')) % 将非数字字符变为+0，整个字符串变成了加法表达式，省去了显式求和步骤了，构思巧妙。这里用\D代替[^\d]，更为直接。

sum(eval(sprintf('[%s]',regexprep(str,'\D+',' ')))) % 用sprintf构造成行向量的形式在求和。
sum(str2num(regexprep(str,'\D+',' ')))             % 用str2num化成行向量的形式；\D要比[^0-9]更直接，速度更快。

str(~ismember(str,[32 48:57]))=32;                % 在ASCII字符集合中找出数字0-9，另辟蹊径，但ismember函数涉及到集合，运算耗时。
sum(str2num(str))

lin2009 · 发表于 2013-7-6 16:54:34

本帖最后由 lin2009 于 2013-7-6 18:19 编辑

第10题：
关键在用构造合适的匹配模式，仅删除字符串前后的（半角）空格（ASCII 32）。
pat = '^ *([^ ].*[^ ]) *$'; % 原字符串前后可能为半角空格。剩余部分的特征是前后都不是空格字符。
c = regexprep(a,pat,'$1');
assert(isequal(b,c))
验证：

pat = '^ *([^ ].*[^ ]) *;
%% 1
a = 'no extra spaces';
b = 'no extra spaces';
c = regexprep(a,pat,'$1');
assert(isequal(b,c))
%% 2
a = ' lots of space in front';
b = 'lots of space in front';
c = regexprep(a,pat,'$1');
assert(isequal(b,c))
%% 3
a = 'lots of space in back ';
b = 'lots of space in back';
c = regexprep(a,pat,'$1');
assert(isequal(b,c))
%% 4
a = ' space on both sides ';
b = 'space on both sides';
c = regexprep(a,pat,'$1');
assert(isequal(b,c))
%% 5
a = sprintf('\ttab in front, space at end ');
b = sprintf('\ttab in front, space at end');
c = regexprep(a,pat,'$1');
assert(isequal(b,c))

复制代码

liuyalong008 · 发表于 2013-7-7 11:11:12

本帖最后由 liuyalong008 于 2013-7-7 11:39 编辑

bainhome 发表于 2013-7-5 20:18
cody原链接参看这里。
剔除且只剔除字符串首尾的多余空格(不包括其他符号)
题目原文：测试算例1：测试算例2 ...

突然觉得这个测试算例其实也是有bug的
不信试试这个：

regexprep(a,'\s{2,}','')
a( cumsum(deblank(a)-32 )~=0)
deblank(strjust(a,'left'))

复制代码

应该增加以下算例：

a = sprintf('\ttab in front, space at end\t ')

复制代码

当然即使这样的算例，lin兄的代码也安然无恙
:lol

bainhome · 发表于 2013-7-7 11:22:10

本帖最后由 liuyalong008 于 2013-7-7 13:27 编辑

我是用两个regexprep算的，这样正则判断部分的代价稍小一些。

regexprep(regexprep(a,'^ *',''),' *$','')

复制代码

此题如不限定tab制表符不算空格，也可以用high level的strtrim命令——一个专门用来去首尾空格的函数。
ps:怎么我的代码里结尾标识的美元符号无法编辑，总是出现很多个？邪门，应该只有一个的，说明一下

账号		自动登录	找回密码
密码			注册

账号		自动登录	找回密码
密码			立即注册

【讨论】正则系列

点评

评分

题目8：Remove all the consonants（移除所有辅音字母）

本帖子中包含更多资源

点评

评分

点评

评分

点评

评分

题目9：提取文本数字并求和

点评

评分

题目10：Trimming Spaces(修剪字符串首尾的空格)

点评

点评

点评

评分

评分

点评

点评