Repair line breaks within a field of a delimited file


Vinh Nguyen


October 9, 2015

Sometimes some people generate delimited files with line break characters (carriage return and/or line feed) inside a field without quoting. I previously wrote about the case when the problematic fields are quoted. I also wrote about using non-ascii characters as field and new record indicators to avoid clashes.

The following script reads in stdin and writes to stdout repaired lines by ensuring every output line has at least the number of delimiters (|) as the first/header line (call this the target number of delimiters) by continually concatenating lines (remove line breaks) until it reaches the point when concatenating the next line would yield more delimiters than the target number of delimiters. The script appears more complicated than it should be in order to address the case when there are more than one line breaks in a field (so don't just concatenate one line but keep doing so) and the case when a line has more delimiters than the target number of delimiter (this could lead to an infinite loop if we restrict the number of delimiters to equal the target).

#! /usr/bin/env python


import sys
from signal import signal, SIGPIPE, SIG_DFL #
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

line = sys.stdin.readline()
n_dlm = line.count(dlm)

line0 = line
line_next = 'a'
while line:
    if line.count(dlm) > n_dlm or line_next=='':
        line = line_next
        # line = sys.stdin.readline()
        if line.count(dlm) > n_dlm: ## line with more delimiters than target?
            line0 = line_next
            line_next = sys.stdin.readline()
            line = line.replace('\r', ' ').replace('\n', ' ') + line_next
        line0 = line
        line_next = sys.stdin.readline()
        line = line.replace('\r', ' ').replace('\n', ' ') + line_next