regex - Remove text between comma and dash in R with regular expressions -

- April 15, 2012

i remove text between commas , dashes in long string of variable labels saved comma-separated. here's minimal example of string:

myvarlabels <- ("participant number, how following products-green tea, how following products-beer,\"how much, if @ all, willing pay these products if ...-japanese, chinese, , indian green tea\",\"how much, if @ all, willing pay these products if ...-japanese, chinese, , indian beer\"")

importantly, variable labels appear in 2 different forms , should shortened in following way:

how following products-green tea
should reduced to: green tea
\"how much, if @ all, willing pay these products if ...-japanese, chinese, , indian green tea\"
should reduced to: \"japanese, chinese, , indian green tea\"

i tried use gsub , regular expressions identify , delete text between commas , dash (i.e., replacing text "").

has suggestion how use gsub remove text between commas indicate start of new column and dashes followed text want keep while preserving double quotes?

edit 1

to more precise, data include 3 types of comma-separated chunks of text. specify information corresponding variables contain:

short descriptions including 1 or more words (e.g., participant number)
longer descriptions relevant information appears after dash (e.g., how following products-green tea)
same above commas present somewhere before dash (e.g., how much, if @ all, ...); why type of chunk of text preceded , followed \" (otherwise not correctly read)
same above no commas before dash (e.g., how experience have following products)

the 4 types of text sequences preceded , followed commas , can appear in order.

here's new minimal example more accurately reflects real data first example:

(myvarlabels3 <- ("participant number,age,gender,body mass index,how following products-green tea,how following products-beer,outdoor temperature,season,\"how experience have following products-indian spices\",\"how much, if @ all, willing pay these products if ...-japanese, chinese, , indian beer\",email,telephone number"))

cath's code (edit 2) works point. when add more of "simple" type 1 sequences of text @ beginning of string or when add text sequence specified under 4. in above list, code doesn't work anymore.

however, when cath's code edit 2 run in 2 steps, works perfectly:

myvarlabels3 <- gsub("((?<=,\")[^-]*[^-]+-)|((?<=,\")[^-],*[^-]+-)", "", myvarlabels3, perl=true) # step 1: shorten text sequences specified under 3. , 4. in list above  [1] "participant number,age,gender,body mass index,how following products-green tea,how following products-beer,outdoor temperature,season,\"indian spices\",\"japanese, chinese, , indian beer\",email,telephone number"  gsub("((?<=,)[^-\",]+-)", "", myvarlabels3, perl=true) # step 2: shorten text sequences specified 2. in above list  [1] "participant number,age,gender,body mass index,green tea,beer,outdoor temperature,season,\"indian spices\",\"japanese, chinese, , indian beer\",email,telephone number"

i think possible use 1 line of code couldn't figure out how. anyway, facilitate workflow when import messy csv files qualtrics.

i'm not sure understand desired output is, can try spotting "start of new column" based on "how much" , go until "meet" dash:

gsub("(^[^,]+, )|(how much[^-]+-)", "", myvarlabels, perl=true) [1] "green tea, beer,\"japanese, chinese, , indian green tea\",\"japanese, chinese, , indian beer\""

edit

considering patterns, can try following:

gsub("((?<=, )[^-\"]+-)|((?<=,\")[^-]*,[^-]+-)", "", myvarlabels, perl=true) [1] "participant number, green tea, beer,\"japanese, chinese, , indian green tea\",\"japanese, chinese, , indian beer\""

i use 2 possible patterns, according 2 possible ones described, behinds specify should there needs kept

edit2

if don't have space between comma , question doesn't begin quote, can do:

myvarlabels_2 <- ("participant number,how following products-green tea, how following products-beer,\"how much, if @ all, willing pay these products if ...-japanese, chinese, , indian green tea\",\"how much, if @ all, willing pay these products if ...-japanese, chinese, , indian beer\"") gsub("((?<=,)[^-\",]+-)|((?<=,\")[^-]*,[^-]+-)", "", myvarlabels_2, perl=true) [1] "participant number,green tea,beer,\"japanese, chinese, , indian green tea\",\"japanese, chinese, , indian beer\""

Search This Blog

Shell

regex - Remove text between comma and dash in R with regular expressions -

Comments

Post a Comment

Popular posts from this blog

javascript - Laravel datatable invalid JSON response -

java - Exception in thread "main" org.springframework.context.ApplicationContextException: Unable to start embedded container; -

sql server 2008 - My Sql Code Get An Error Of Msg 245, Level 16, State 1, Line 1 Conversion failed when converting the varchar value '8:45 AM' to data type int -