Repair line breaks within a field of a delimited file
Sometimes some people generate delimited files with line break characters (carriage return and/or line feed) inside a field without quoting. I previously wrote about the case when the problematic fields are quoted. I also wrote about using non-ascii characters as field and new record indicators to avoid clashes.
The following script reads in stdin
and writes to stdout
repaired lines by ensuring every output line has at least the number of delimiters (|) as the first/header line (call this the target number of delimiters) by continually concatenating lines (remove line breaks) until it reaches the point when concatenating the next line would yield more delimiters than the target number of delimiters. The script appears more complicated than it should be in order to address the case when there are more than one line breaks in a field (so don't just concatenate one line but keep doing so) and the case when a line has more delimiters than the target number of delimiter (this could lead to an infinite loop if we restrict the number of delimiters to equal the target).
```{python}
#! /usr/bin/env python
='|'
dlm
import sys
from signal import signal, SIGPIPE, SIG_DFL # http://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python
## no error when exiting a pipe like less
signal(SIGPIPE,SIG_DFL)
= sys.stdin.readline()
line = line.count(dlm)
n_dlm
= line
line0 = 'a'
line_next while line:
if line.count(dlm) > n_dlm or line_next=='':
sys.stdout.write(line0)= line_next
line # line = sys.stdin.readline()
if line.count(dlm) > n_dlm: ## line with more delimiters than target?
= line_next
line0 = sys.stdin.readline()
line_next = line.replace('\r', ' ').replace('\n', ' ') + line_next
line else:
= line
line0 = sys.stdin.readline()
line_next = line.replace('\r', ' ').replace('\n', ' ') + line_next
line ```