Utilities to manipulate text
Support for common text manipulation tasks like stripping content in braces, etc.
- clldutils.text.strip_brackets(text, brackets=None, strip_surrounding_whitespace=True)[source]
Strip brackets and what is inside brackets from text.
>>> from clldutils.text import strip_brackets >>> strip_brackets('outside <inside> (outside)', brackets={'<': '>'}) 'outside (outside)'
Note
If the text contains only one opening bracket, the rest of the text will be ignored. This is a feature, not a bug, as we want to avoid that this function raises errors too easily.
- Parameters:
text (
str
) –brackets (
typing.Optional
[dict
]) –strip_surrounding_whitespace (
bool
) –
- Return type:
str
- clldutils.text.split_text_with_context(text, separators='\\t\\n\\x0b\\x0c\\r\\x1c\\x1d\\x1e\\x1f \\x85\\xa0\\u1680\\u2000\\u2001\\u2002\\u2003\\u2004\\u2005\\u2006\\u2007\\u2008\\u2009\\u200a\\u2028\\u2029\\u202f\\u205f\\u3000', brackets=None)[source]
Splits text at separators outside of brackets.
- Parameters:
text (
str
) –separators (
str
) – An iterable of single character tokens.brackets (
typing.Optional
[dict
]) –
- Return type:
typing.List
[str
]- Returns:
A list of non-empty chunks.
>>> from clldutils.text import split_text_with_context >>> split_text_with_context('split-me (but-not-me)', separators='-') ['split', 'me (but-not-me)']
Note
This function leaves content in brackets in the chunks.
- clldutils.text.split_text(text, separators=re.compile('\\\\s'), brackets=None, strip=False)[source]
Split text along the separators unless they appear within brackets.
- Parameters:
separators (
typing.Union
[typing.Iterable
,re.Pattern
]) – An iterable of single characters or a compiled regex pattern.brackets (
typing.Optional
[dict
]) – dict mapping start tokens to end tokens of what is to be recognized as brackets.
- Return type:
typing.List
[str
]
>>> from clldutils.text import split_text >>> split_text('split-me (but-not-me)', separators='-') ['split', 'me']
Note
This function will also strip content within brackets.
- Parameters:
text (
str
) –strip (
bool
) –
- clldutils.text.strip_chars(chars, sequence)[source]
Strip the specified chars from anywhere in the text.
- Parameters:
chars (
typing.Iterable
) – An iterable of single character tokens to be stripped out.sequence (
typing.Iterable
) – An iterable of single character tokens.
- Return type:
str
- Returns:
Text string concatenating all tokens in sequence which were not stripped.
- clldutils.text.replace_pattern(pattern, repl, text, flags=0)[source]
Pretty much re.sub, but replacement functions are expected to be generators of strings.
- Parameters:
pattern (
typing.Union
[str
,re.Pattern
]) – Compiled regex pattern or regex specified by str.repl (
typing.Callable
[[re.Match
],typing.Generator
[str
,None
,None
]]) – callable accepting a match instance as sole argument, yielding str as replacements for the match.text (
str
) – str in which to replace the pattern.flags – Flags suitable for passing to re.compile in case pattern is a str.
- Return type:
str
- Returns:
Text string with pattern replaced as implemented by repl.
>>> from clldutils.text import replace_pattern >>> def multiply(m): ... for i in range(int(m.string[m.start():m.end()])): ... yield 'x' ... >>> replace_pattern('[0-9]+', multiply, 'a1b2c3') 'axbxxcxxx'
- clldutils.text.BRACKETS = {'(': ')', '[': ']', '{': '}', '«': '»', '⁽': '⁾', '₍': '₎', '『': '』', '【': '】', '(': ')'}
Brackets are pairs of single characters (<start-token>, <end-token>):
- clldutils.text.WHITESPACE = '\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'
A string of all unicode characters regarded as whitespace (by python’s re module s): See also http://stackoverflow.com/a/37903645