JSON/JSONL¶
Compare¶
Compare two jsonl files that are not in the same order.
This implies that we need to sort the files on some key
.
Do a trick a la Schwartian transform
where we prepend the sort key
, sort on that key
and the remove the key
.
jqdiff \
<(zcat train.jsonl.gz \
| jq --compact-output --raw-output '[.id, .|@text] | @tsv' \
| sort -k1,1 \
| cut -f2,2) \
<(zcat source/train.jsonl.gz \
| jq --compact-output --raw-output '[.id, .|@text] | @tsv' \
| sort -k1,1 \
| cut -f 2,2)
Otherwise, use the alias jqdiff
which essentially does
vimdiff <(jq --sort-keys . file1.json) <(jq --sort-keys . file2.json)
Parallel Processing¶
Note that we use --keep-order
, --spreadstdin
& --recend='\n'
.
function process {
cat
}
export -f process
zcat input.gz \
| time \parallel \
--keep-order \
--spreadstdin \
--recend='\n' \
--env=process \
'process' \
| gzip \
> output.gz
Counting Elements¶
Count the number of entries/sentence pairs that have the .unparsable
key.
pv Huge.jsonl \
| jq --null-input '[ inputs | select(.unparsable)] | reduce .[] as $item (0; . + 1)'
Group by X and Merge¶
Context: after generating *.scores.json
using sacrebleu --width=14 reference --metrics bleu chrf ter < translation > scores.json
.
Can we extract BLEU scores from all our experiments and tabulate the result using mlr
?
find -type f -name \*scores.json \
| xargs dirname \
| parallel 'jq "{\"expt_name\": \"{//}\", \"{/}\": (.[] | select(.name == \"BLEU\") | .score)}" {}/*scores.json' \
| jq --slurp --sort-keys 'group_by(.expt_name) | [.[] | add]' \
| mlr --json --opprint --barred cat \
| less
Aggregate a Field¶
Given a list of objects where some of them have the same id
but with a field with different values, aggregate that field for each object.
This happens when you extracted data from mysql
.
mysql
doesn't allow subqueries to return multiple rows with multiple columns thus you have to do the same work using JOIN
.
Merge Arrays of JSON
echo -e '{"id":1, "b":[{"c":1}]}{"id":1, "b":[{"c":2}]}'
{
"id": 1,
"b": [
{
"c": 1
}
]
}
{
"id": 1,
"b": [
{
"c": 2
}
]
}
- group entries by
id
- for each group
- take the first element and aggregate all of the
b
in a list - return that first element that has been augmented with a list of
b
echo -e '{"id":1, "b":[{"c":1}]}{"id":1, "b":[{"c":2}]}' \
| jq --slurp 'group_by(.id) | .[] | (.[0].b=([.[].b]|flatten)) | .[0]'
{
"id": 1,
"b": [
{
"c": 1
},
{
"c": 2
}
]
}
Zip Multiple files¶
Merge arrays
The key here is the transpose
.
zcat translation.fr.json.gz \
| jq \
--slurp \
--argfile src <(jq -R '{"src":.}' source.en) \
--argfile ref <(jq -R '{"ref":.}' reference.fr) \
'[., $src, $ref] | transpose | map(add) | .[]'
Flat Files to Structured json¶
When you have multiple flat files that you want to combine into a structured json.
lingua_eng_spa/Tilde-worldbank-1-eng-spa.spa.gz
SPA 0.9998978843092705
SPA 0.9991979235059277
lingua_all_languages/Tilde-worldbank-1-eng-spa.spa.gz
SPA 0.9999975457963204
SPA 0.9847735076254288
Tilde-worldbank-1-eng-spa.spa.gz
"Igualmente, hacemos notar la importancia de abordar el problema del hambre y la malnutrición”.
"La vida es muy difícil.
Tilde-worldbank-1-eng-spa.eng.gz
" We also note the importance of addressing hunger and malnutrition.”
"[Life] is extremely difficult.
paste \
<(zcat lingua_eng_spa/Tilde-worldbank-1-eng-spa.spa.gz) \
<(zcat lingua_all_languages/Tilde-worldbank-1-eng-spa.spa.gz) \
<(zcat Tilde-worldbank-1-eng-spa.spa.gz) \
<(zcat Tilde-worldbank-1-eng-spa.eng.gz) \
| mlr --tsv --ojson --implicit-csv-header \
label eng_spa.lid,eng_spa.confidence,all.lid,all.confidence,spa,eng \
| jq '.[]'
{
"eng_spa": {
"lid": "SPA",
"confidence": 0.9998978843092705
},
"all": {
"lid": "SPA",
"confidence": 0.9999975457963204
},
"spa": "\"Igualmente, hacemos notar la importancia de abordar el problema del hambre y la malnutrición”.",
"eng": "\" We also note the importance of addressing hunger and malnutrition.”"
}
{
"eng_spa": {
"lid": "SPA",
"confidence": 0.9991979235059277
},
"all": {
"lid": "SPA",
"confidence": 0.9847735076254288
},
"spa": "\"La vida es muy difícil.",
"eng": "\"[Life] is extremely difficult."
}
XML to json¶
Using yq, we can convert a xml document into a json file.
yq -p xml -o json < input.xml > output.json
Convert to Array¶
Given
[
{ "seg": ["A", "B"] },
{ "seg": "C"},
]
The second object is NOT an array but you need it to be an array to process all elements the same way, you can make sure all segments are arrays by doing:
jq '.[] | .seg | (if type == "object" then [.] else . end) | .[]'
Filter-out SubObjects¶
Given
<?xml version='1.0' encoding='utf-8'?>
<dataset id="wmttest2024">
<collection id="general">
<doc origlang="en" id="test-en-news_beverly_press.3585" domain="news">
<src lang="en">
<p>
<seg id="1">Siso's depictions of land, water center new gallery exhibition</seg>
</p>
</src>
<ref lang="es" translator="refA">
<p>
<seg id="1">Representaciones de la tierra y el agua de Siso centran una nueva exposición</seg>
</p>
</ref>
</doc>
<doc origlang="en" id="test-en-news_brisbanetimes.com.au.228963" domain="NOT_news">
<src lang="en">
<p>
<seg id="1">Adapt the old, accommodate the new to solve issue</seg>
</p>
</src>
<ref lang="es" translator="refA">
<p>
<seg id="1">Adapta lo viejo, incorpora lo nuevo para resolver el problema</seg>
</p>
</ref>
</doc>
</collection>
</dataset>
Remove documents that are NOT of news
domain keeping the document's structure.
~/.local/bin/yq 'del(.dataset.collection.doc[] | select(.["+@domain"] != "news"))' wmttest2024.en-es.xml
<?xml version='1.0' encoding='utf-8'?>
<dataset id="wmttest2024">
<collection id="general">
<doc origlang="en" id="test-en-news_beverly_press.3585" domain="news">
<src lang="en">
<p>
<seg id="1">Siso's depictions of land, water center new gallery exhibition</seg>
</p>
</src>
<ref lang="es" translator="refA">
<p>
<seg id="1">Representaciones de la tierra y el agua de Siso centran una nueva exposición</seg>
</p>
</ref>
</doc>
</collection>
</dataset>