i was trying to parse a string with pyparsing so all the words were separated from the punctuation signs, i was using this expression to do it:
OneOrMore(Word(alphanums)) + OneOrMore(Char(printables))
But when i parse the following string with this expression:
return abc(1, ULLONG_MAX)
All the words inside the parentheses get split:
['return', 'abc', '(', '1', ',', 'U', 'L', 'L', 'O', 'N', '_', 'M', 'A', 'X', ')', ';']
But if i use this expression:
OneOrMore(Word(alphanums)) + OneOrMore(Char(string.punctuation))
Only a part of the string gets parsed:
['return', 'abc', '(']
What is wrong with those expressions?
Personally I would recommend to use regex instead for parsing, which would also allow you to more easily test your expressions. You could then get the list as
import re result = re.findall(r'[\w_]+|\S', yourstring) # This will preserve ULLONG_MAX as a single word if that's what you want
As for what’s wrong with your expressions:
First expression: Once you hit
(
,OneOrMore(Char(printables))
will take over and continue matching every printable char. Instead you should use OR (|
) with the alphanumerical first for priorityOneOrMore(word | Char(printables))
Second expression. You’re running into the same issue with your use of
+
. Once string.punctuation takes over, it will continue matching until it encounters a char that is not a punctuation and then stop the matching. Instead you can write:parser = OneOrMore(Word(alphanums) | Word(string.punctuation)) result = parser.parseString(yourstring)
Do note that underscore is considered a punctutation so ULLONG_MAX will be split, not sure if that’s what you want or not.