Utilities to manipulate text

Support for common text manipulation tasks like stripping content in braces, etc.

clldutils.text.strip_brackets(text, brackets=None, strip_surrounding_whitespace=True)[source]

Strip brackets and what is inside brackets from text.

>>> from clldutils.text import strip_brackets
>>> strip_brackets('outside <inside> (outside)', brackets={'<': '>'})
'outside  (outside)'

Note

If the text contains only one opening bracket, the rest of the text will be ignored. This is a feature, not a bug, as we want to avoid that this function raises errors too easily.

Parameters:
  • text (str) –

  • brackets (typing.Optional[dict]) –

  • strip_surrounding_whitespace (bool) –

Return type:

str

clldutils.text.split_text_with_context(text, separators='\\t\\n\\x0b\\x0c\\r\\x1c\\x1d\\x1e\\x1f \\x85\\xa0\\u1680\\u2000\\u2001\\u2002\\u2003\\u2004\\u2005\\u2006\\u2007\\u2008\\u2009\\u200a\\u2028\\u2029\\u202f\\u205f\\u3000', brackets=None)[source]

Splits text at separators outside of brackets.

Parameters:
  • text (str) –

  • separators (str) – An iterable of single character tokens.

  • brackets (typing.Optional[dict]) –

Return type:

typing.List[str]

Returns:

A list of non-empty chunks.

>>> from clldutils.text import split_text_with_context
>>> split_text_with_context('split-me (but-not-me)', separators='-')
['split', 'me (but-not-me)']

Note

This function leaves content in brackets in the chunks.

clldutils.text.split_text(text, separators=re.compile('\\\\s'), brackets=None, strip=False)[source]

Split text along the separators unless they appear within brackets.

Parameters:
  • separators (typing.Union[typing.Iterable, re.Pattern]) – An iterable of single characters or a compiled regex pattern.

  • brackets (typing.Optional[dict]) – dict mapping start tokens to end tokens of what is to be recognized as brackets.

Return type:

typing.List[str]

>>> from clldutils.text import split_text
>>> split_text('split-me (but-not-me)', separators='-')
['split', 'me']

Note

This function will also strip content within brackets.

Parameters:
  • text (str) –

  • strip (bool) –

clldutils.text.strip_chars(chars, sequence)[source]

Strip the specified chars from anywhere in the text.

Parameters:
  • chars (typing.Iterable) – An iterable of single character tokens to be stripped out.

  • sequence (typing.Iterable) – An iterable of single character tokens.

Return type:

str

Returns:

Text string concatenating all tokens in sequence which were not stripped.

clldutils.text.replace_pattern(pattern, repl, text, flags=0)[source]

Pretty much re.sub, but replacement functions are expected to be generators of strings.

Parameters:
  • pattern (typing.Union[str, re.Pattern]) – Compiled regex pattern or regex specified by str.

  • repl (typing.Callable[[re.Match], typing.Generator[str, None, None]]) – callable accepting a match instance as sole argument, yielding str as replacements for the match.

  • text (str) – str in which to replace the pattern.

  • flags – Flags suitable for passing to re.compile in case pattern is a str.

Return type:

str

Returns:

Text string with pattern replaced as implemented by repl.

>>> from clldutils.text import replace_pattern
>>> def multiply(m):
...     for i in range(int(m.string[m.start():m.end()])):
...         yield 'x'
...
>>> replace_pattern('[0-9]+', multiply, 'a1b2c3')
'axbxxcxxx'
clldutils.text.BRACKETS = {'(': ')', '[': ']', '{': '}', '«': '»', '⁽': '⁾', '₍': '₎', '『': '』', '【': '】', '(': ')'}

Brackets are pairs of single characters (<start-token>, <end-token>):

clldutils.text.WHITESPACE = '\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

A string of all unicode characters regarded as whitespace (by python’s re module s): See also http://stackoverflow.com/a/37903645