- To: slug@xxxxxxxxxxx
- Subject: [SLUG] Re: Python, XML, and Splitting a 750M XML File?
- From: Bill Donoghoe <donoghoew@xxxxxxxxx>
- Date: Tue, 1 Feb 2011 18:32:35 +1100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:from:date:message-id:subject:to :content-type; bh=TCZ5vi/i/YQlDtteG32KJyEJyJ+297O0cORx5+8pngA=; b=x5ThOkF0aT3Ph1Gz5XQIlrk2mS+l3FhJLtKeXigRGJrKfvGLQ6TUUKyIE1j4MPLspN DOVzmS65crM6S0Zn3+2pAVaTp2T3xsQB1qEwgHE5UYC2PilTkDNLx1QkQjuYqdxFZ4VE WCaeznLZajPx7kDYpNtpuk5XExDE47RUfl5BU=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:from:date:message-id:subject:to:content-type; b=Ons2xd4Jp5rA2fppQOCmUifkK8wCdHH3jBTe1yua/gxceFSVrnZoX7s8nSVwh1EPAC 2Mjp14Hq89tZ5gupY4awC+SvU4Hfb/bxpzn1txWBYKrq0GHUHTEDpI4gY84VvM0bZc48 dImmnMlqr4ysgnNnr5AoXVcabnOamAHr/QFc8=
G'Day,
An option that is based on XQuery/Xpath is STX (Streaming Transformations
for XML). There is a Java implementation called Joost.
Here is a link to the STX sourceforge page (http://stx.sourceforge.net/).
That page includes a link to Joost.
Regards,
Bill Donoghoe
--------- Forwarded message ----------
> From: Tom Deckert <tdeckert@xxxxxxxxxxxxxxxxxx>
> To: slug@xxxxxxxxxxx
> Date: Mon, 31 Jan 2011 12:32:48 +1100
> Subject: [SLUG] Re: Python, XML, and Splitting a 750M XML File?
>
> G'Day,
>
> Apologies for not responding sooner - I've been too embarrassed.
> Re-googling instantly gave the answer - xml_split. On my
> Ubuntu Linux desktop, it's in package xml-twig-tools.
>
> Thanks to Peter who reminded about awk (I'd not forgotten
> about it), and thanks to Chris for writing 160 lines of
> shell code, but I knew there had to be a trivially easy
> tool out there.
>
> A thing about Python I just learned and really love is:
>
> > import apt
> > cache = apt.Cache()
> > if not cache['xml-twig-tools'].isInstalled:
> > print "Please install xml-twig-tools and rerun"
> > sys.exit(1)
>
> This makes it mind-bogglingly easy for a Python script to check
> whether a tool it needs is installed. Fantastic!
>
> Cheers,
> Tom
>
>
>
> On Thu, 2011-01-06 at 13:51 +1100, Tom Deckert wrote:
> > G'Day,
> >
> > Any easy XML (Python or otherwise) tools for splitting a 750M
> > XML file down into smaller portions?
> >
> > Because the file is so large
> > and exceeds memory size, I think the tool needs to be a 'streaming'
> > tool. On IBM DeveloperWorks site, I found an article detailing
> > using XSLT, but in other places it states XSLT tools usually
> > aren't streaming, so I'm guessing none of the XSLT processors
> > (xalan, saxon) will succeed. (Not to mention its been more than
> > 10 years since I last worked with XSLT.)
> >
> > Original file looks like:
> > <?xml version="1.0"?>
> > <!DOCTYPE BigFile SYSTEM "BigFile.dtd">
> > <BigFile>
> > <TrivialHeader> blah </TrivialHeader>
> > <Datum> A couple hundred thousand Datum elements.</Datum>
> > <Datum> 'Datum' are non-trivial, containing extensive subtrees.</Datum>
> > <Datum> ...etc... </Datum>
> > <TrivialFooter> blah </TrivialFooter>
> > </BigFile>
> >
> >
> > I'd like a tool to split that into maybe
> > 10 different, valid XML files, all of which have the <BigFile>,
> > <TrivialHeader> and <TrivialFooter> tags,
> > but 1/10th as many <Datum>s per file.
> >
> >
> > The problem is that on my 4Gig laptop, I run out of memory
> > for any tool which tries to read in the whole tree at
> > one time. In my case, Python's ElementTree fails, ala:
> >
> > > fin = open("BigFile.xml", "r")
> > > tree = xml.etree.ElementTree.parse(fin) --> Out of Memory
> >
> >
> > Solution doesn't have to be Python, but it would be nicest
> > if it were, as rest of the processing is all done in
> > a Python script.
> >
> >
> > Cheers,
> > Tom
> >
> >
> >
> >
> >
>
>